jupyter / nbformat

Reference implementation of the Jupyter Notebook format

Home Page:http://nbformat.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fragment Identification Syntax for Jupyter (linking to cells in document)

krassowski opened this issue · comments

It is currently possible to link directly to specific markdown heading in notebook or to any HTML element with id property via the use of URL fragment identifier to automatically scroll to such a heading or element (e.g. via destination anchor).

There is currently no way to scroll to other elements of the notebook. jupyter/nbconvert#1862 and executablebooks/jupyter-book#1812 proposed to allow linking to specific cells in the Jupyter notebooks. Elements which we may want to link to are:

  • cells (and ranges of cells)
  • outputs (cell can have more than one output),
  • sections inside of outputs (imagine large table which is lazy-loaded/paginated)
  • fragments of the editor (potentially useful for applications in education)

Fragment Identification Syntax is a way of defining how to recognise which element is referred to in the Fragment Identifiers. Adopting one for Jupyter notebooks will:

  • harmonise how frontends and converters (e.g. nbconvert) transform headings to fragments (there is a lot of choices in escaping problematic characters and dealing with duplicated headings (see notebook#77)
  • allow referring to arbitrary part of the notebook in a future-proof way (so that we can add additional targets later thanks to scoping)
  • disambiguate elements bearing the same name, for example a cell with ID "example" and a heading "example" (human-readable Jupyter cell IDs are permitted and are proposed as default in the future)

Formats with Fragment Identification Syntax include:

  • text/csv (RFC 7111) which allows to link to a specified:
    • row: a.csv#row=1, range of rows a.csv#row=5-7, use wildcard for last row myacsv#row=5-*,
    • column a.csv#column=2 (and ranges, as above), or
    • cell a.csv#cell=1,2 (as above),
  • text/plain (RFC 5147) includes an optional integrity check in case if document has changed (md5 or length-based) and allows to link to a specified:
    • line: a.txt#line=1, range of lines a.txt#line=10,20,
    • character: a.txt#char=100 (or range, as above)
  • application/yaml (RFC draft):
    • alias nodes: a.txt#*alias

For more examples see Wikipedia: URI fragment.

The proposed way forward is to:

  • keep current behaviour for headings (auto-generated IDs point to the heading/element without any special prefix by default)
  • define syntax for cell fragment as the fundamental unit of the notebook (e.g. cell=my-cell-id or cell-id=my-cell-id)
  • create a PR with reference implementation in JupyterLab (and Notebook v7), nbconvert and jupyter-book

This proposal result in a very limited backward incompatibility, this is a heading with cell= prefix would now be resolved to a cell instead of such a heading. We could support heading= target to allow to disambiguate.

Questions:

  1. do we want a JEP for this proposal?
  2. should we have multiple ways of referring to a cell? Cell IDs are an obvious way forward, but we could also support syntax like nth-cell=3 or nth-code-cell=2?

Of note, equals sign is allowed in identifiers in HTML 5 but not in HTML 4; we could consider using a colon instead (cell:my-cell, but using an equals sign seems in line with other formats). Spaces are always forbidden in identifiers.

To expand on things folk have wanted to point at in a client-agnostic way:

  • a specific position within an embedded CSV
  • a section of source code
  • a line of a log file at a date
  • a particular x,y,w,h in an image
  • a location on an embedded GeoJSON map output

Of particular note here is choosing something that can be made to work with the web annotation data model.

If these things are nbformat-first, it will be more client-independent (once implemented) rather than specifying a concrete DOM model (though it would be much more possible to use URLs rather than any specific frontend thing).

This is definitely a JEP-level concern, but could certainly be demonstrated first in a Lab4/Notebook7/nbconvert compatible extension before going for something in core... previous efforts have foundered on trying to integrate too deeply and do too much.

To expand on things folk have wanted to point at in a client-agnostic way:

  • a specific position within an embedded CSV
  • a section of source code
  • a line of a log file at a date
  • a particular x,y,w,h in an image
  • a location on an embedded GeoJSON map output

Maybe the position inside of output is out of scope for the fragment syntax specification; instead we could:

  • use a fragment to point to the output where the embedded CSV/image/etc is attached notebook.ipynb#nth-output-of-cell=my-cell,2 (select second output of cell with id my-cell)
  • use query string to specify the target notebook.ipynb?output-fragment="row=100"#nth-output-of-cell=my-cell,2 (scroll to row 100 of 2nd output of cell with ID my-cell in notebook.ipynb) using existing syntax for fragment for given MIME type

This way we avoid re-inventing the the syntax for specific data types. For source code we can use text/plain (char= and line=), for images this is handled by Media Fragments URI (e.g. #xywh=160,120,320,240). I don't know if there is a standard GeoJSON fragment syntax.

If these things are nbformat-first, it will be more client-independent (once implemented) rather than specifying a concrete DOM model (though it would be much more possible to use URLs rather than any specific frontend thing).

Some thoughts here:

  • in JupyterLab/Notebook we can parse the fragment and find the relevant element dynamically (no DOM modifications would be introduced)
  • when the notebook gets exported via nbconvert/jupyter-book:
    • for HTML and Markdown files, the fragment syntax does not need changes (the exporters would need to add anchors by injecting id attributes into HTML tags; this could be opt-in for Markdown and default for HTML)
    • for PDF files, users would need to adjust the syntax, at the very minimum by prepending #nameddest= which is the PDF way of referring to sections (RFC 3778) (and the exporters would need to add sections too).

out of scope

Much like when cell ids became a thing (but not output ids), i feel like this would be a significant change to handle for implementations, and doing it piecemeal wound't be as much fun.

a standard GeoJSON fragment syntax.

I'd wager because there's not JSON fragment syntax. The closest is JSON pointer, but it's a hair underpowered, as it lacks the ability to do attribute lookups. This means a cell would have to be #/cells/0/outputs/1/#sub-selector rather than something like #/cells/[id="abc1234"]/outputs/[some=thing]/#sub-selector.

Inventing a new syntax would be very frustrating. But at the end of it, if a notebook-derived document can't refer back to the logical location within a (potentially nbconvert-mangled) document, I don't know if we've moved the state-of-the-art forward.

If we did pick from one of the many non-standard JSON reference mechanisms (jq, jmespath, etc) it would be important to pick something with broad implementation profile. Really the most powerful thing is XPath, but XML gives everyone the willies.

I though that GeoJSON users would be interested in pointing to a specific position on map, not to a node in JSON? That wold be something like lat=a,long=b, right? Sorry, if it sounds silly, I don't work with geospatial data.

Summarising what I see so far:

  • there is a commonly used concept for fragment identifiers, and many file types have it defined (csv, txt, pdf, media, etc)
  • it is easy to adopt fragment identifiers for cell location and maybe output location in notebooks:
    • which would be easy to extend in the future
    • which would work in both frontends and HTML exports the same
  • there are file types for which it is not defined (like JSON), but this does not prevent us from having one for ipynb
  • there is no agreed convention in the fragment identifiers world on how to handle deep references/nested targeting (like: "a row in table" in "output of a cell in a notebook")
    • which does not appear to be a blocker, because if we reserve cell= prefix, we can still later decide to add xpath= prefix later on (if for some reason we decide to go the XML way)
    • or maybe we don't need to solve nested targeting with fragment syntax at all and we could use an entirely different mechanism for it

For what it is worth, I implemented fragment id's for all of the editors in CoCalc recently. The format I used for our Jupyter notebooks is

#id=some-cell-id

That's it. Thanks for thinking through a format for more refined information! I'll attempt to follow what you do for any extensions, rather than inventing something new (except I'm sticking with #id rather than #cell-id).

executablebooks/meta#102

@westurner thank you for pinging interested parties and very useful links. Do you think that advancing with cell-id= takes as closer to the larger goal?

One slightly-technical question to all: if we go forward with cell-id=, should the base nbconvert template produce id="cell-id=some-unique-id" (as in current draft of jupyter/nbconvert#1897), or should it include id="some-unique-id" and a blob of JavaScript which would manually scroll to the relevant fragment?

@gwincr11 this might be of your interest too.

@krassowski thanks for bringing this to my attention. I have been looking at a similar issue, with mapping content to a Python notebook. Cell id is useful, I am curious if a more granular approach would be helpful though. One thing that is super helpful in the GitHub ui is linking directly to a block of code, this is granluar down to the line being discussed. I would love a tool that allowed for adding into the notebook structure easily across platforms without the content needing to be in the notebook json structure.

It maybe interesting to consider something like a Javascript source map, this would give very granular access to the line level potentially or even character. There are a number of json mapping tools in the python echo system, here is a stack overflow discussing this very idea. https://stackoverflow.com/questions/55684780/get-line-number-while-parsing-a-json-file

My thinking around this is that it maybe nice to tie content to the rich text view of the notebook and makes it portable with the notebook, without needing to be part of the structure, since a consumer can map features into the notebooks structure at render time. For example you could create a commenting system that worked with any third party tooling, GitHub, GitLab etc and since the comments could map to the underlying json structure you could bring PR review comments into any plugin you wanted.

My thinking around this is that it maybe nice to tie content to the rich text view of the notebook and makes it portable with the notebook, without needing to be part of the structure, since a consumer can map features into the notebooks structure at render time. For example you could create a commenting system that worked with any third party tooling, GitHub, GitLab etc and since the comments could map to the underlying json structure you could bring PR review comments into any plugin you wanted.

Web annotations may also be a good way to accomplish this... I am wondering how portable it maybe to other plugins? I do like that is it an open standard though 😄

commented

I have a platform that uses URLs to embed code: https://docs.metapage.io/docs

For a lot of the components (that are simply URLs/websites), I use the hash part of the URL, but re-use the query param format:

http://<origin>/<path>?key=val # <hashfragement> ?hashkey2=val

This way the hash parts can be very big without sending all that code to the server. The important bit is that it contains a hash fragment and hash query params. Whatever schema is decided here, it would be great if I could keep my hash key=val pattern within the schema, since I would like to add browser-only jupyter notebooks, as a pure URL defined notebook is a really useful pattern.