nteract / papermill

πŸ“š Parameterize, execute, and analyze notebooks

Home Page:http://papermill.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nbformat/nbformat_minor not well extracted with HTTP handler

LetMeR00t opened this issue Β· comments

commented

πŸ› Bug

I'm currently trying to create a connector between Jupyter (using papermill) and another product named "Cortex" from the Strangee project.
I encountered an issue during my development. I'm currently testing the HTTP handler by trying to execute a notebook located on a JupyterHub instance which has a "demo" user for who a "cortex_job" server is configured.

import papermill as pm

pm.execute_notebook(
    "http://192.168.1.117:8000/user/demo/cortex_job/api/contents/notebook1.ipynb?token=SECRET",
    "http://192.168.1.117:8000/user/demo/cortex_job/api/contents/Folder1/notebook2.ipynb?token=SECRET",
    parameters = dict(var1 = "toto")
)

Everything is working fine to recover the notebook but I get an error message:

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Cell In[1], line 3
      1 import papermill as pm
----> 3 pm.execute_notebook(
      4     "http://192.168.1.117:8000/user/demo/cortex_job/api/contents/notebook1.ipynb?token=SECRET",
      5     "http://192.168.1.117:8000/user/demo/cortex_job/api/contents/Folder1/notebook2.ipynb?token=SECRET",
      6     parameters = dict(var1 = "toto")
      7 )

File /usr/local/lib/python3.10/dist-packages/papermill/execute.py:89, in execute_notebook(input_path, output_path, parameters, engine_name, request_save_on_cell_execute, prepare_only, kernel_name, language, progress_bar, log_output, stdout_file, stderr_file, start_timeout, report_mode, cwd, **engine_kwargs)
     86 if cwd is not None:
     87     logger.info("Working directory: {}".format(get_pretty_path(cwd)))
---> 89 nb = load_notebook_node(input_path)
     91 # Parameterize the Notebook.
     92 if parameters:

File /usr/local/lib/python3.10/dist-packages/papermill/iorw.py:512, in load_notebook_node(notebook_path)
    502 def load_notebook_node(notebook_path):
    503     """Returns a notebook object with papermill metadata loaded from the specified path.
    504 
    505     Args:
   (...)
    510 
    511     """
--> 512     nb = nbformat.reads(papermill_io.read(notebook_path), as_version=4)
    513     nb_upgraded = nbformat.v4.upgrade(nb)
    514     if nb_upgraded is not None:

File /usr/local/lib/python3.10/dist-packages/nbformat/__init__.py:91, in reads(s, as_version, capture_validation_error, **kwargs)
     89 nb = reader.reads(s, **kwargs)
     90 if as_version is not NO_CONVERT:
---> 91     nb = convert(nb, as_version)
     92 try:
     93     validate(nb)

File /usr/local/lib/python3.10/dist-packages/nbformat/converter.py:62, in convert(nb, to_version)
     60 except AttributeError as e:
     61     msg = f"Notebook could not be converted from version {version} to version {step_version} because it's missing a key: {e}"
---> 62     raise ValidationError(msg) from None
     64 # Recursively convert until target version is reached.
     65 return convert(converted, to_version)

ValidationError: Notebook could not be converted from version 1 to version 2 because it's missing a key: cells

When looking into the code, we can see the HTTP handler way of working, which is getting the all response content:

image

Which gives:

{
   "name":"notebook1.ipynb",
   "path":"notebook1.ipynb",
   "last_modified":"2023-07-12T11:43:37.265003Z",
   "created":"2023-07-12T11:43:37.265003Z",
   "content":{
      "cells":[
         {
            "cell_type":"markdown",
            "id":"e0882b67",
            "metadata":{
               
            },
            "source":"# My title\n\n## My subtitle\n\nHello world!"
         },
         {
            "cell_type":"code",
            "execution_count":1,
            "id":"e92789a6",
            "metadata":{
               "tags":[
                  "parameters"
               ],
               "trusted":true
            },
            "outputs":[
               
            ],
            "source":"var1 = 3\nvar2 = 5"
         },
         {
            "cell_type":"code",
            "execution_count":2,
            "id":"d49d5a2b",
            "metadata":{
               "trusted":true
            },
            "outputs":[
               {
                  "name":"stdout",
                  "output_type":"stream",
                  "text":"var1 is 3, var2 is 5\n"
               }
            ],
            "source":"print(\"var1 is {0}, var2 is {1}\".format(var1,var2))"
         }
      ],
      "metadata":{
         "celltoolbar":"Tags",
         "kernelspec":{
            "display_name":"Python 3 (ipykernel)",
            "language":"python",
            "name":"python3"
         },
         "language_info":{
            "codemirror_mode":{
               "name":"ipython",
               "version":3
            },
            "file_extension":".py",
            "mimetype":"text/x-python",
            "name":"python",
            "nbconvert_exporter":"python",
            "pygments_lexer":"ipython3",
            "version":"3.10.6"
         }
      },
      "nbformat":4,
      "nbformat_minor":5
   },
   "format":"json",
   "mimetype":"None",
   "size":1188,
   "writable":true,
   "type":"notebook"
}

As you can notice, the nbformat variable is set to 4 but papermill found out that it was 1 (default value).

This assumption is coming from here (under the library nbformat which is reading the notebook):

image

As you can see, the version is taken from the root node "nbformat" instead of "content.nbformat" which is causing the issue.

Do you know if this a bug on your side or on the nbformat library maybe ? I tested it with a LocalHandler and it's working fine as the output is:

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e0882b67",
   "metadata": {},
   "source": [
    "# My title\n",
    "\n",
    "## My subtitle\n",
    "\n",
    "Hello world!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e92789a6",
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "var1 = 3\n",
    "var2 = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d49d5a2b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "var1 is 3, var2 is 5\n"
     ]
    }
   ],
   "source": [
    "print(\"var1 is {0}, var2 is {1}\".format(var1,var2))"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

A solution could be to load the JSON answer and get the "content" node before returning the result in the HTTP handler

Thank you

commented

Fix working on my side:

papermill/iorw.py

class HttpHandler(object):
    @classmethod
    def read(cls, path):
        return json.dumps(requests.get(path, headers={'Accept': 'application/json'}).json()["content"])

    @classmethod
    def listdir(cls, path):
        raise PapermillException('listdir is not supported by HttpHandler')

    @classmethod
    def write(cls, buf, path):
        payload = {"type": "notebook", "format": "json", "path": path}
        payload["content"] = json.loads(buf)
        result = requests.put(path, json=payload)
        result.raise_for_status()

    @classmethod
    def pretty_path(cls, path):
        return path