s-weigand / flake8-nb

Flake8 checking for jupyter notebooks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Identifying code cells in un-run notebooks

psychemedia opened this issue · comments

If the flake8-nb tool is run over a notebook with unrun cells, all the cell references seem to be returned with an an empty [ ] cell reference.

It would be useful if there were an option to at least add an accession number relating to each reported code cell, or to be able to override perhaps override cell run numbers with the code cell accession number?

I already was wondering why no one ever requested this feature, since the execution count formatting becomes quite useless when using tools like nbstripout to keep the git history cleaner🤔
I just thought everyone is using nbQA by now since it supports a lot more tools. 😅

The only advantage flake8-nb has over nbQA (at the moment and as far as I know), is the use of cell-tags to fine tune the reports. Which is aimed at educational material (tutorial/lecture/book), to show off and explain bad practices/code without the CI failing.

That said I started a branch for this feature, so users can easily switch reporting style by adding the --use-cell-nr-format flag.

Re: the branch featuring --use-cell-nr-format, I don't see that anywhere? Nor a PR that has since added it to main?

Yeah, I didn't push that branch yet.
Also, after thinking of it a bit --use-cell-nr-format would still lack configure ability since a simple flag would rely on two default formatting strings.
My current drafts implementation examples from the readme:

$ flake8_nb example_notebook.ipynb
example_notebook.ipynb#In[34]:1:31: E231 missing whitespace after ':'
$ flake8_nb --use-cell-nr-format example_notebook.ipynb
example_notebook.ipynb:code_cell#5:1:31: E231 missing whitespace after ':'

In the long run, having an option --notebook-cell-format with a default of "{nb_path}#In[{cell_exec_count}]" would provide users with more freedom to customize the reports. And also leave the possibility to extend it without breaking current behavior.

I think #In[] works if the cells are run, but #N is neater if you are just counting. The reporting method used by nbqa seems to use index value 1 rather than 0 for the first code cell where the report is related to cell count number. Which I guess also conforms to the run number when you run a notebook in a fresh kernel.

I think for consistency starting at 1 is the best way to go, since you won't need to overthink "Wait did this start at 0 or 1?".
Also, I can think of three wanted ways to identify a cell:

  • Execution count (#In[ ])
  • Total cell count (also taking markdown and raw cells into account)
  • Code cell count (ignoring markdown and raw cells)

That way users can choose their favorite counting method.
Also, having all those values available would make implementing #127 a lot easier.

With total cell count, there is always the change that folk are scrolling rather than down arrowing through cells, in which case you might miscount on contiguous markdown cells etc.

But yes, in general case, where you iterate or click through individual cells of whatever flavour, simple cell count number (offset by 1? or in that case would you start at 0?!) also makes sense.

I meant to be consitent all counts should start at 1, rather than using 0 in one case and 1 in the other.

While I also don't think that a total cell count would be THAT useful for reporting, it would be very useful to map back output to the original notebook.

Mapping back would certainly be simplified. I did a hacky thing here to take the nbqa flake report and inject flake8 reports into the notebooks that raised them, and that required maintaining a local count of code cells when parsing the notebook so they could be reconciled with the code cell number referenced in the flake8 report.

Thanks:-)

@psychemedia fancy to have a look at the feature in #133? 😄

Nice... I tried with flake8_nb --notebook-cell-format "{nb_path}#In[{code_cell_count}" test1.ipynb and seemed to work well, eg as per --help docs:

  --notebook-cell-format notebook_cell_format
                        Template string used to format the filename and cell
                        part of error report. Possible variables which will be
                        replaces 'nb_path', 'exec_count','code_cell_count' and
                        'total_cell_count'. (Default:
                        {nb_path}#In[{exec_count}])

The 0.3.0 release has the new feature 🎉