Work with binary files¶

The question. You need to read or write non-text data — an image, an audio file, a custom binary format, a PNG header, a struct layout from a C program. Text mode will corrupt it: encoding translation and newline mangling are both active by default.

The answer: open with 'rb' or 'wb'. You now work in bytes, not str, and Python does no translation. For simple read-everything / write-everything, Path.read_bytes() and Path.write_bytes() are one-line shortcuts.

In [ ]:

Copied!





# Binary read + write — no encoding, no newline translation, no data corruption.
# The canonical pattern: open('rb') / open('wb'), work in bytes.
from pathlib import Path

path = Path('/tmp/sample.bin')

# Write — 'wb' plus bytes (note b'...'). Never specify encoding in binary mode.
header = b'\x89PNG\r\n\x1a\n'         # the real PNG magic bytes
payload = b'\x00' * 100                  # any binary data
with open(path, 'wb') as f:
    f.write(header)
    f.write(payload)

# Read — 'rb', result is a bytes object.
with open(path, 'rb') as f:
    data = f.read()

print(f'size: {len(data)} bytes')
print(f'first 8 bytes: {data[:8]!r}')     # the header
print(f'as hex:        {data[:8].hex()}')
print(f'recognised as PNG: {data.startswith(header)}')

path.unlink()
# Binary read + write — no encoding, no newline translation, no data corruption.
# The canonical pattern: open('rb') / open('wb'), work in bytes.
from pathlib import Path

path = Path('/tmp/sample.bin')

# Write — 'wb' plus bytes (note b'...'). Never specify encoding in binary mode.
header = b'\x89PNG\r\n\x1a\n'         # the real PNG magic bytes
payload = b'\x00' * 100                  # any binary data
with open(path, 'wb') as f:
    f.write(header)
    f.write(payload)

# Read — 'rb', result is a bytes object.
with open(path, 'rb') as f:
    data = f.read()

print(f'size: {len(data)} bytes')
print(f'first 8 bytes: {data[:8]!r}')     # the header
print(f'as hex:        {data[:8].hex()}')
print(f'recognised as PNG: {data.startswith(header)}')

path.unlink()

Variant: copy or hash in chunks¶

For large binary files, read in fixed-size chunks. 8 KB is a reasonable default — big enough that overhead is low, small enough that memory stays tiny.

In [ ]:

Copied!





import hashlib
from pathlib import Path

# Set up a sample binary file.
src = Path('/tmp/big.bin')
src.write_bytes(b'\x01\x02\x03\x04' * 10_000)

# Streaming SHA-256 — never holds the whole file in memory.
hasher = hashlib.sha256()
with open(src, 'rb') as f:
    for chunk in iter(lambda: f.read(8192), b''):    # b'' is the EOF sentinel
        hasher.update(chunk)

print(f'sha256: {hasher.hexdigest()}')
src.unlink()
import hashlib
from pathlib import Path

# Set up a sample binary file.
src = Path('/tmp/big.bin')
src.write_bytes(b'\x01\x02\x03\x04' * 10_000)

# Streaming SHA-256 — never holds the whole file in memory.
hasher = hashlib.sha256()
with open(src, 'rb') as f:
    for chunk in iter(lambda: f.read(8192), b''):    # b'' is the EOF sentinel
        hasher.update(chunk)

print(f'sha256: {hasher.hexdigest()}')
src.unlink()

Variant: structured records with `struct`¶

For binary formats with a defined layout, struct pack/unpacks bytes in a declarative way. Format chars say byte-order (> big-endian), then each field's type (I = uint32, H = uint16, f = float32).

In [ ]:

Copied!

import struct

# Big-endian: uint32 id, uint16 count, float32 value.
FORMAT = '>IHf'

packed = struct.pack(FORMAT, 12345, 42, 3.14)
print(f'packed: {packed.hex()}  ({len(packed)} bytes)')

record_id, count, value = struct.unpack(FORMAT, packed)
print(f'id={record_id}, count={count}, value={value:.2f}')
import struct

# Big-endian: uint32 id, uint16 count, float32 value.
FORMAT = '>IHf'

packed = struct.pack(FORMAT, 12345, 42, 3.14)
print(f'packed: {packed.hex()}  ({len(packed)} bytes)')

record_id, count, value = struct.unpack(FORMAT, packed)
print(f'id={record_id}, count={count}, value={value:.2f}')

Why this works¶

Text mode ('r', 'w') decodes bytes to str on read, encodes on write, and translates newlines (\r\n ↔ \n) on Windows. That's fine — and necessary — for text. Applied to a PNG or a wav file, it will silently mutate your data: flip \r\n into \n, or raise UnicodeDecodeError the first time it hits a byte that's not valid UTF-8.

Binary mode ('rb', 'wb') skips both steps. You get exactly the bytes that were on disk, and you write exactly the bytes you pass in. The read type is bytes; the write type must be bytes too (TypeError if you pass a str). Slicing, concatenation, len(), .hex(), and .startswith() all work the obvious way.

For tiny files or tests, Path.read_bytes() / Path.write_bytes() are the single-call shortcuts — they do the with open(...) dance for you.

Trade-offs¶

Path.read_bytes() loads the whole file — fine up to a few MB, bad for a multi-GB video. For large binary files, read in fixed-size chunks with iter(lambda: f.read(8192), b'') — constant memory regardless of size. See the extras.

For structured binary formats (a record is 'uint32 id, uint16 count, float32 value, big-endian'), the struct module packs and unpacks bytes into Python tuples. That's easier than hand-bit-fiddling and makes the format definition explicit. Also see the extras.

A common gotcha: mixing modes. If you open('w') and try to .write(b'...') you get a TypeError because str.write rejects bytes, and vice versa. The mode and the data types have to agree.

Process large files — chunk-at-a-time reading, also used for binary data.
Avoid common file-handling mistakes — encoding, newline, and mode traps.
File modes reference — every combination in one place.

Work with binary files¶

Variant: copy or hash in chunks¶

Variant: structured records with struct¶

Why this works¶

Trade-offs¶

Related reading¶

Variant: structured records with `struct`¶