Iterate large directories efficiently with python.
python-getdents
is a simple wrapper around Linux system call getdents64
(see man getdents
for details). More details on approach.
- Verify that implementation works on platforms other than
x86_64
.
pip install getdents
python3 -m venv env
. env/bin/activate
pip install -e .[test]
pip install cibuildwheel
cibuildwheel --platform linux --output-dir wheelhouse
ulimit -v 33554432 && py.test tests/
Or
ulimit -v 33554432 && ./setup.py test
from getdents import getdents
for inode, type, name in getdents('/tmp', 32768):
print(name)
import os
from getdents import *
fd = os.open('/tmp', O_GETDENTS)
for inode, type, name in getdents_raw(fd, 2**20):
print({
DT_BLK: 'blockdev',
DT_CHR: 'chardev ',
DT_DIR: 'dir ',
DT_FIFO: 'pipe ',
DT_LNK: 'symlink ',
DT_REG: 'file ',
DT_SOCK: 'socket ',
DT_UNKNOWN: 'unknown ',
}[type], {
True: 'd',
False: ' ',
}[inode == 0],
name,
)
os.close(fd)
python-getdents [-h] [-b N] [-o NAME] PATH
+--------------------------+-------------------------------------------------+ | Option | Description | +==========================+=================================================+ | -b N
| Buffer size (in bytes) to allocate when | | | iterating over directory. Default is 32768, the | | | same value used by glibc, you probably want to | +--------------------------+ increase this value. Try starting with 16777216 | | --buffer-size N
| (16 MiB). Best performance is achieved when | | | buffer size rounds to size of the file system | | | block. | +--------------------------+-------------------------------------------------+ | -o NAME
| Output format: | | | | | | * plain
(default) Print only names. | | | * csv
Print as comma-separated values in | +--------------------------+ order: inode, type, name. | | --output-format NAME
| * csv-headers
Same as csv
, but print | | | headers on the first line also. | | | * json
output as JSON array. | | | * json-stream
output each directory entry | | | as single json object separated by newline. | +--------------------------+-------------------------------------------------+
- 3 - Requested buffer is too large
- 4 -
PATH
not found. - 5 -
PATH
is not a directory. - 6 - Not enough permissions to read contents of the
PATH
.
python-getdents /path/to/large/dir
python -m getdents /path/to/large/dir
python-getdents /path/to/large/dir -o csv -b 16777216 > dir.csv