Textualize / rich-cli

Rich-cli is a command line toolbox for fancy output in the terminal

Home Page:https://www.textualize.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

head / tail operations are slow on larger files

jamestexas opened this issue Β· comments

Howdy -

I wanted to preface this with: If I missed a contributor guideline or anything, please let me know. I did check other issues and did not see one relevant to this.

I am somewhat new to using rich-cli (but am familiar with rich) and recently attempted to parse a somewhat large CSV file (~119Mb, 483k lines).
I did not expect the whole CSV to load quickly, but I was somewhat surprised that running --head and --tail took as long as they did. Obviously they won't behave like GNU tail / head, but I took a jab at a minimal / naive change to this and was able to get it much faster. It's around this here
if you want I am happy to open a PR. I'll also just put a code block of what I did. I did take the somewhat naive approach to file parsing (rather than parsing the buffer stream per line, which would be more efficient for tail) to avoid making a huge change.

  • head is just using the existing generator to parse out x rows and filtering out None values. Since the list gets iterated ~ twice, this means the second iteration that adds indexes is also way faster.

  • tail is using a collections.deque example recipe (which, while still going through the whole file, does not store the whole file in memory).


    rows = iter(reader)
    if has_header:
        header = next(rows)
        for column in header:
            table.add_column(column)

    if head is not None:
        table_rows = list(
            filter(
                None,
                (next(rows, None) for _ in range(head)),
            )
        )

    elif tail is not None:
        table_rows = deque(rows, tail)

    else:
        table_rows = list(rows)


These are naive benchmarks, but comparing the two (where rich command is the install CLI, and python3 ./src/rich_cli having my changes:

Head

β””> time python3 ./src/rich_cli --head 500 large_csv.csv &> /dev/null                                           [πŸ‘Ύ 3.10.5]➜
python3 ./src/rich_cli --head 500 large_csv.csv &> /dev/null  0.83s user 0.47s system 94% cpu 1.369 total

β””> time rich --head 500 large_csv.csv &> /dev/null                                                             [πŸ‘Ύ 3.10.5]➜
rich --head 500 large_csv.csv &> /dev/null  2.81s user 0.60s system 99% cpu 3.443 total

Tail

β””> time rich --tail 500 large_csv.csv &> /dev/null                                                             [πŸ‘Ύ 3.10.5]➜
rich --tail 500 large_csv.csv &> /dev/null  2.95s user 0.63s system 99% cpu 3.604 total

β””> time python3 ./src/rich_cli --tail 500 large_csv.csv &> /dev/null                                           [πŸ‘Ύ 3.10.5]➜
python3 ./src/rich_cli --tail 500 large_csv.csv &> /dev/null  1.93s user 0.53s system 96% cpu 2.545 total

Anyway, let me know if you want me to do anything here!