{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Work with binary files\n",
    "\n",
    "**The question.** You need to read or write non-text data — an image, an audio file, a custom binary format, a PNG header, a struct layout from a C program. Text mode will corrupt it: encoding translation and newline mangling are both active by default.\n",
    "\n",
    "The answer: open with `'rb'` or `'wb'`. You now work in `bytes`, not `str`, and Python does no translation. For simple read-everything / write-everything, `Path.read_bytes()` and `Path.write_bytes()` are one-line shortcuts.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Binary read + write — no encoding, no newline translation, no data corruption.\n",
    "# The canonical pattern: open('rb') / open('wb'), work in bytes.\n",
    "from pathlib import Path\n",
    "\n",
    "path = Path('/tmp/sample.bin')\n",
    "\n",
    "# Write — 'wb' plus bytes (note b'...'). Never specify encoding in binary mode.\n",
    "header = b'\\x89PNG\\r\\n\\x1a\\n'         # the real PNG magic bytes\n",
    "payload = b'\\x00' * 100                  # any binary data\n",
    "with open(path, 'wb') as f:\n",
    "    f.write(header)\n",
    "    f.write(payload)\n",
    "\n",
    "# Read — 'rb', result is a bytes object.\n",
    "with open(path, 'rb') as f:\n",
    "    data = f.read()\n",
    "\n",
    "print(f'size: {len(data)} bytes')\n",
    "print(f'first 8 bytes: {data[:8]!r}')     # the header\n",
    "print(f'as hex:        {data[:8].hex()}')\n",
    "print(f'recognised as PNG: {data.startswith(header)}')\n",
    "\n",
    "path.unlink()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variant: copy or hash in chunks\n",
    "\n",
    "For large binary files, read in fixed-size chunks. 8 KB is a reasonable default — big enough that overhead is low, small enough that memory stays tiny.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hashlib\n",
    "from pathlib import Path\n",
    "\n",
    "# Set up a sample binary file.\n",
    "src = Path('/tmp/big.bin')\n",
    "src.write_bytes(b'\\x01\\x02\\x03\\x04' * 10_000)\n",
    "\n",
    "# Streaming SHA-256 — never holds the whole file in memory.\n",
    "hasher = hashlib.sha256()\n",
    "with open(src, 'rb') as f:\n",
    "    for chunk in iter(lambda: f.read(8192), b''):    # b'' is the EOF sentinel\n",
    "        hasher.update(chunk)\n",
    "\n",
    "print(f'sha256: {hasher.hexdigest()}')\n",
    "src.unlink()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variant: structured records with `struct`\n",
    "\n",
    "For binary formats with a defined layout, `struct` pack/unpacks bytes in a declarative way. Format chars say byte-order (`>` big-endian), then each field's type (`I` = uint32, `H` = uint16, `f` = float32).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import struct\n",
    "\n",
    "# Big-endian: uint32 id, uint16 count, float32 value.\n",
    "FORMAT = '>IHf'\n",
    "\n",
    "packed = struct.pack(FORMAT, 12345, 42, 3.14)\n",
    "print(f'packed: {packed.hex()}  ({len(packed)} bytes)')\n",
    "\n",
    "record_id, count, value = struct.unpack(FORMAT, packed)\n",
    "print(f'id={record_id}, count={count}, value={value:.2f}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why this works\n",
    "\n",
    "Text mode (`'r'`, `'w'`) decodes bytes to `str` on read, encodes on write, and translates newlines (`\\r\\n` ↔ `\\n`) on Windows. That's fine — and necessary — for text. Applied to a PNG or a wav file, it will silently mutate your data: flip `\\r\\n` into `\\n`, or raise `UnicodeDecodeError` the first time it hits a byte that's not valid UTF-8.\n",
    "\n",
    "Binary mode (`'rb'`, `'wb'`) skips both steps. You get exactly the bytes that were on disk, and you write exactly the bytes you pass in. The read type is `bytes`; the write type must be `bytes` too (`TypeError` if you pass a `str`). Slicing, concatenation, `len()`, `.hex()`, and `.startswith()` all work the obvious way.\n",
    "\n",
    "For tiny files or tests, `Path.read_bytes()` / `Path.write_bytes()` are the single-call shortcuts — they do the `with open(...)` dance for you.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trade-offs\n",
    "\n",
    "`Path.read_bytes()` loads the whole file — fine up to a few MB, bad for a multi-GB video. For large binary files, read in fixed-size chunks with `iter(lambda: f.read(8192), b'')` — constant memory regardless of size. See the extras.\n",
    "\n",
    "For structured binary formats (a record is 'uint32 id, uint16 count, float32 value, big-endian'), the `struct` module packs and unpacks bytes into Python tuples. That's easier than hand-bit-fiddling and makes the format definition explicit. Also see the extras.\n",
    "\n",
    "A common gotcha: mixing modes. If you `open('w')` and try to `.write(b'...')` you get a `TypeError` because `str.write` rejects bytes, and vice versa. The mode and the data types have to agree.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Related reading\n",
    "\n",
    "- [Process large files](https://agilearn.co.uk/guides/file-handling/recipes/process-large-files) — chunk-at-a-time reading, also used for binary data.\n",
    "- [Avoid common file-handling mistakes](https://agilearn.co.uk/guides/file-handling/recipes/avoid-common-file-handling-mistakes) — encoding, newline, and mode traps.\n",
    "- [File modes reference](https://agilearn.co.uk/guides/file-handling/reference/file-modes-reference) — every combination in one place.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}