Work with binary files¶
The question. You need to read or write non-text data — an image, an audio file, a custom binary format, a PNG header, a struct layout from a C program. Text mode will corrupt it: encoding translation and newline mangling are both active by default.
The answer: open with 'rb' or 'wb'. You now work in bytes, not str, and Python does no translation. For simple read-everything / write-everything, Path.read_bytes() and Path.write_bytes() are one-line shortcuts.
# Binary read + write — no encoding, no newline translation, no data corruption.
# The canonical pattern: open('rb') / open('wb'), work in bytes.
from pathlib import Path
path = Path('/tmp/sample.bin')
# Write — 'wb' plus bytes (note b'...'). Never specify encoding in binary mode.
header = b'\x89PNG\r\n\x1a\n' # the real PNG magic bytes
payload = b'\x00' * 100 # any binary data
with open(path, 'wb') as f:
f.write(header)
f.write(payload)
# Read — 'rb', result is a bytes object.
with open(path, 'rb') as f:
data = f.read()
print(f'size: {len(data)} bytes')
print(f'first 8 bytes: {data[:8]!r}') # the header
print(f'as hex: {data[:8].hex()}')
print(f'recognised as PNG: {data.startswith(header)}')
path.unlink()
Variant: copy or hash in chunks¶
For large binary files, read in fixed-size chunks. 8 KB is a reasonable default — big enough that overhead is low, small enough that memory stays tiny.
import hashlib
from pathlib import Path
# Set up a sample binary file.
src = Path('/tmp/big.bin')
src.write_bytes(b'\x01\x02\x03\x04' * 10_000)
# Streaming SHA-256 — never holds the whole file in memory.
hasher = hashlib.sha256()
with open(src, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''): # b'' is the EOF sentinel
hasher.update(chunk)
print(f'sha256: {hasher.hexdigest()}')
src.unlink()
Variant: structured records with struct¶
For binary formats with a defined layout, struct pack/unpacks bytes in a declarative way. Format chars say byte-order (> big-endian), then each field's type (I = uint32, H = uint16, f = float32).
import struct
# Big-endian: uint32 id, uint16 count, float32 value.
FORMAT = '>IHf'
packed = struct.pack(FORMAT, 12345, 42, 3.14)
print(f'packed: {packed.hex()} ({len(packed)} bytes)')
record_id, count, value = struct.unpack(FORMAT, packed)
print(f'id={record_id}, count={count}, value={value:.2f}')
Why this works¶
Text mode ('r', 'w') decodes bytes to str on read, encodes on write, and translates newlines (\r\n ↔ \n) on Windows. That's fine — and necessary — for text. Applied to a PNG or a wav file, it will silently mutate your data: flip \r\n into \n, or raise UnicodeDecodeError the first time it hits a byte that's not valid UTF-8.
Binary mode ('rb', 'wb') skips both steps. You get exactly the bytes that were on disk, and you write exactly the bytes you pass in. The read type is bytes; the write type must be bytes too (TypeError if you pass a str). Slicing, concatenation, len(), .hex(), and .startswith() all work the obvious way.
For tiny files or tests, Path.read_bytes() / Path.write_bytes() are the single-call shortcuts — they do the with open(...) dance for you.
Trade-offs¶
Path.read_bytes() loads the whole file — fine up to a few MB, bad for a multi-GB video. For large binary files, read in fixed-size chunks with iter(lambda: f.read(8192), b'') — constant memory regardless of size. See the extras.
For structured binary formats (a record is 'uint32 id, uint16 count, float32 value, big-endian'), the struct module packs and unpacks bytes into Python tuples. That's easier than hand-bit-fiddling and makes the format definition explicit. Also see the extras.
A common gotcha: mixing modes. If you open('w') and try to .write(b'...') you get a TypeError because str.write rejects bytes, and vice versa. The mode and the data types have to agree.
Related reading¶
- Process large files — chunk-at-a-time reading, also used for binary data.
- Avoid common file-handling mistakes — encoding, newline, and mode traps.
- File modes reference — every combination in one place.