Inconsistent header treatment for csv tables

Question

Inconsistent header treatment for csv tables

AndydeCleyre opened this issue 2 years ago · comments

Hello!

I'm sorry I'm not sure exactly what's going on here, so I'll get to it. Using Zsh:

$ rows=( Package,Version,Latest,Project 'tomli,2.0.0,2.0.1,~/Code/zpy' 'click,8.0.1,8.0.3,~/Code/archbuilder_iosevka' 'pep517,0.11.0,0.12.0,~/Code/archbuilder_iosevka' 'ruamel.yaml,0.17.17,0.17.21,~/Code/archbuilder_iosevka' 'tomli,1.2.1,2.0.1,~/Code/archbuilder_iosevka' )
$ rich --csv - <<<${(F)rows}

$ rows=( 'Package,Version,Latest,Project' 'tomli,2.0.0,2.0.1,~/Code/zpy' 'click,8.0.1,8.0.3,~/Code/archbuilder_iosevka' 'pep517,0.11.0,0.12.0,~/Code/archbuilder_iosevka' 'ruamel.yaml,0.17.17,0.17.21,~/Code/archbuilder_iosevka' 'tomli,1.2.1,2.0.1,~/Code/archbuilder_iosevka' )
$ rich --csv - <<<${(F)rows}

Same result as above

$ rows=( 'tomli,2.0.0,2.0.1,~/Code/zpy' 'click,8.0.1,8.0.3,~/Code/archbuilder_iosevka' 'pep517,0.11.0,0.12.0,~/Code/archbuilder_iosevka' 'ruamel.yaml,0.17.17,0.17.21,~/Code/archbuilder_iosevka' 'tomli,1.2.1,2.0.1,~/Code/archbuilder_iosevka' )
$ rich --csv - <<<${(F)rows}

What determines whether the first row gets treated as a header?

Thanks for any help!

Will McGugan · Answer 1 · Fri Feb 18 2022 06:55:27 GMT+0800 (China Standard Time)

It’s a heuristic used by the Python CSV library, which is imperfect as you have noticed. In the future I’ll expose a way to adjust the via an option.

Andy Kluger · Answer 2 · Fri Feb 18 2022 07:09:26 GMT+0800 (China Standard Time)

Thanks! Do you know what about the input in this case gives CSV the wrong idea, so that I can work around this?

Will McGugan · Answer 3 · Fri Feb 18 2022 07:20:19 GMT+0800 (China Standard Time)

Not sure. You could have a look at the source of the csv module.

Andy Kluger · Answer 4 · Fri Feb 18 2022 10:43:46 GMT+0800 (China Standard Time)

FYI:

csv.Sniffer.has_header:

      def has_header(self, sample):
          # Creates a dictionary of types of data in each column. If any
          # column is of a single type (say, integers), *except* for the first
          # row, then the first row is presumed to be labels. If the type
          # can't be determined, it is assumed to be a string in which case
          # the length of the string is the determining factor: if all of the
          # rows except for the first are the same length, it's a header.
          # Finally, a 'vote' is taken at the end for each column, adding or
          # subtracting from the likelihood of the first row being a header.

I spotted "lexter" @ https://github.com/Textualize/rich-cli/blob/main/src/rich_cli/__main__.py#L569