head / tail operations are slow on larger files
jamestexas opened this issue Β· comments
Howdy -
I wanted to preface this with: If I missed a contributor guideline or anything, please let me know. I did check other issues and did not see one relevant to this.
I am somewhat new to using rich-cli
(but am familiar with rich
) and recently attempted to parse a somewhat large CSV file (~119Mb, 483k lines).
I did not expect the whole CSV to load quickly, but I was somewhat surprised that running --head
and --tail
took as long as they did. Obviously they won't behave like GNU tail
/ head
, but I took a jab at a minimal / naive change to this and was able to get it much faster. It's around this here
if you want I am happy to open a PR. I'll also just put a code block of what I did. I did take the somewhat naive approach to file parsing (rather than parsing the buffer stream per line, which would be more efficient for tail) to avoid making a huge change.
-
head is just using the existing generator to parse out x rows and filtering out None values. Since the list gets iterated ~ twice, this means the second iteration that adds indexes is also way faster.
-
tail is using a collections.deque example recipe (which, while still going through the whole file, does not store the whole file in memory).
rows = iter(reader)
if has_header:
header = next(rows)
for column in header:
table.add_column(column)
if head is not None:
table_rows = list(
filter(
None,
(next(rows, None) for _ in range(head)),
)
)
elif tail is not None:
table_rows = deque(rows, tail)
else:
table_rows = list(rows)
These are naive benchmarks, but comparing the two (where rich
command is the install CLI, and python3 ./src/rich_cli
having my changes:
Head
β> time python3 ./src/rich_cli --head 500 large_csv.csv &> /dev/null [πΎ 3.10.5]β
python3 ./src/rich_cli --head 500 large_csv.csv &> /dev/null 0.83s user 0.47s system 94% cpu 1.369 total
β> time rich --head 500 large_csv.csv &> /dev/null [πΎ 3.10.5]β
rich --head 500 large_csv.csv &> /dev/null 2.81s user 0.60s system 99% cpu 3.443 total
Tail
β> time rich --tail 500 large_csv.csv &> /dev/null [πΎ 3.10.5]β
rich --tail 500 large_csv.csv &> /dev/null 2.95s user 0.63s system 99% cpu 3.604 total
β> time python3 ./src/rich_cli --tail 500 large_csv.csv &> /dev/null [πΎ 3.10.5]β
python3 ./src/rich_cli --tail 500 large_csv.csv &> /dev/null 1.93s user 0.53s system 96% cpu 2.545 total
Anyway, let me know if you want me to do anything here!