Skip to content

gh-62944: Add performance note for out-of-order extraction from compressed archives#151115

Open
nahcmon wants to merge 1 commit into
python:mainfrom
nahcmon:issue-62944-tarfile-perf-note
Open

gh-62944: Add performance note for out-of-order extraction from compressed archives#151115
nahcmon wants to merge 1 commit into
python:mainfrom
nahcmon:issue-62944-tarfile-perf-note

Conversation

@nahcmon

@nahcmon nahcmon commented Jun 8, 2026

Copy link
Copy Markdown

When a compressed tarfile is opened with a block mode ('r:gz', 'r:bz2',
'r:xz', 'r:zst'), the module wraps the compressed stream and emulates
seeking. Seeking forwards is cheap — decompress until you reach the target
offset. Seeking backwards is expensive — the stream must be re-decompressed
from the beginning, because compressed formats like gzip, bzip2, lzma, and
Zstandard are not random-accessible.

Accessing archive members out of their storage order therefore causes
repeated full re-decompressions: for a 500 MB gzip archive extracted in
reverse order the total bytes decompressed approaches O(n²/2). This came up
in the original bug report (bpo-18744 / gh-62944), where the reporter saw a
~60× slowdown compared to manually opening with gzip.open().

The current documentation is silent on this performance characteristic.
@serhiy-storchaka confirmed in the original issue: "adding a warning looks
reasonable."

Changes

Added a .. note:: block inside the tarfile.open() documentation, immediately
after the table of r:* modes, explaining:

  • why out-of-order extraction is slow for compressed archives
  • that performance is proportional to total data decompressed, not member size
  • that members should be extracted in archive order, or TarFile.extractall()
    used for best performance

Test coverage

Documentation-only change; no test file modifications needed.

NEWS entry

Not required for documentation-only changes per the CPython developer guide.

CLA

Note: the CLA for the GitHub account nahcmon may not be signed. The PR will be
held by the CLA bot until it is signed at https://cla.python.org/.

… compressed archives

Extracting members in a different order than they appear in a compressed
tarfile requires re-decompressing from the beginning of the stream for each
backward seek. Add a note to tarfile.open() documenting this and recommending
in-order extraction or use of TarFile.extractall() for best performance.
@nahcmon nahcmon requested a review from ethanfurman as a code owner June 8, 2026 21:42
@python-cla-bot

python-cla-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

All commit authors signed the Contributor License Agreement.

CLA signed

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@read-the-docs-community

Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #33048292 | 📁 Comparing a56e6be against main (29a920e)

  🔍 Preview build  

1 file changed
± library/tarfile.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting merge docs Documentation in the Doc dir skip news

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants