head / tail operations are slow on larger files

Question

head / tail operations are slow on larger files

jamestexas opened this issue 2 years ago · comments

Howdy -

I wanted to preface this with: If I missed a contributor guideline or anything, please let me know. I did check other issues and did not see one relevant to this.

I am somewhat new to using rich-cli (but am familiar with rich) and recently attempted to parse a somewhat large CSV file (~119Mb, 483k lines).
I did not expect the whole CSV to load quickly, but I was somewhat surprised that running --head and --tail took as long as they did. Obviously they won't behave like GNU tail / head, but I took a jab at a minimal / naive change to this and was able to get it much faster. It's around this here
if you want I am happy to open a PR. I'll also just put a code block of what I did. I did take the somewhat naive approach to file parsing (rather than parsing the buffer stream per line, which would be more efficient for tail) to avoid making a huge change.

head is just using the existing generator to parse out x rows and filtering out None values. Since the list gets iterated ~ twice, this means the second iteration that adds indexes is also way faster.
tail is using a collections.deque example recipe (which, while still going through the whole file, does not store the whole file in memory).


    rows = iter(reader)
    if has_header:
        header = next(rows)
        for column in header:
            table.add_column(column)

    if head is not None:
        table_rows = list(
            filter(
                None,
                (next(rows, None) for _ in range(head)),
            )
        )

    elif tail is not None:
        table_rows = deque(rows, tail)

    else:
        table_rows = list(rows)

These are naive benchmarks, but comparing the two (where rich command is the install CLI, and python3 ./src/rich_cli having my changes:

Head

└> time python3 ./src/rich_cli --head 500 large_csv.csv &> /dev/null                                           [👾 3.10.5]➜
python3 ./src/rich_cli --head 500 large_csv.csv &> /dev/null  0.83s user 0.47s system 94% cpu 1.369 total

└> time rich --head 500 large_csv.csv &> /dev/null                                                             [👾 3.10.5]➜
rich --head 500 large_csv.csv &> /dev/null  2.81s user 0.60s system 99% cpu 3.443 total

Tail

└> time rich --tail 500 large_csv.csv &> /dev/null                                                             [👾 3.10.5]➜
rich --tail 500 large_csv.csv &> /dev/null  2.95s user 0.63s system 99% cpu 3.604 total

└> time python3 ./src/rich_cli --tail 500 large_csv.csv &> /dev/null                                           [👾 3.10.5]➜
python3 ./src/rich_cli --tail 500 large_csv.csv &> /dev/null  1.93s user 0.53s system 96% cpu 2.545 total

Anyway, let me know if you want me to do anything here!