Relancer is an automatic technique that restores the executability of broken Jupyter Notebooks by upgrading deprecated APIs.
- Linux OS
- The system root directory
/
must have at least 20 GB free space - Docker (tested with version 20.10.7, recent version should also work)
Official Docker Installation Guide at https://docs.docker.com/get-docker/
Build the Docker image:
docker build -f Dockerfile -t relancer .
Time estimation: ~10 mins on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.
If the installation is successful, the end of the command line output should be like this:
Successfully built 837dd58b4bc6
Successfully tagged relancer:latest
1. Run demo cases (~15 mins)
2. Run Relancer on all the subjects (~6 hrs)
3. Run the baselines on all the subjects (~78 hrs)
4. Reproduce charts in the paper without running the full experiment
We first demonstrate Relancer's ability of fixing deprecated APIs on 10 Jupyter Notebooks from Kaggle, which is a subset of the evaluation subjects in the paper.
First, launch a Docker container relancer-demo
:
docker run --name relancer-demo -i -t relancer --rm -v `pwd`:/Scratch
Then, set up the environment:
conda activate relancer
python setup_nltk.py
These commands activate the Conda environment and set up NLTK resources.
If successful, there will be a (relancer)
prefix on your command line prompt. For example,
(relancer) root@1758fb078432:/home/relancer#
And there will be command line output like this:
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.
Finally, run the demo script:
./run_demo.sh | tee relancer-exp/batch-log.txt
Time estimation: ~15 mins on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.
This script executes Relancer on the 10 Jupyter Notebooks and stores the results in following directory structure:
├──📂 relancer
├──📂 relancer-exp
├──📂 patches
├──📂 fixed_notebooks
├──📂 exec-logs
├──📜 batch-log.txt
relancer
: the current working directoryrelancer-exp
: the parent directory that stores all the experiment resultsfixed_notebooks
: stores the fixed notebooks generated by Relancerpatches
: stores the patches generated by Relancerexec-logs
: stores the log files generated by Relancerbatch-log.txt
: stores the cumulated logs produced in the current run, same as the output on the screen
During the run, for each notebook, Relancer generates 4 files: a fixed notebook, a patch, an execution log file, and a repair log file.
They all follow the same directory structure TOOL_NAME/DATASET_NAME/NOTEBOOK_NAME.*
.
For example, in this demo, for notebook "Auto Imports - Beginner Level Analysis" of dataset "Automobile Dataset" from Kaggle, Relancer generates the following files:
├──📂 relancer
├──📂 relancer-exp
├──📂 fixed_notebooks
│ ├──📂 relancer
│ ├──📂 toramky_automobile-dataset
│ ├──📜 auto-imports-beginner-level-analysis.py
├──📂 patches
│ ├──📂 relancer
│ ├──📂 toramky_automobile-dataset
│ ├──📜 auto-imports-beginner-level-analysis.patch
├──📂 exec-logs
├──📂 relancer
├──📂 toramky_automobile-dataset
├──📜 auto-imports-beginner-level-analysis.exec.log
├──📜 auto-imports-beginner-level-analysis.repair.log
The patch auto-imports-beginner-level-analysis.patch
should be like this:
--- /home/relancer/relancer-exp/original_notebooks/toramky_automobile-dataset/auto-imports-beginner-level-analysis.py 2021-07-08 05:25:33.406257335 -0500
+++ /home/relancer/relancer-exp/fixed_notebooks/relancer/toramky_automobile-dataset/auto-imports-beginner-level-analysis.py 2021-07-08 05:25:42.890223574 -0500
@@ -1,3 +1,4 @@
+from sklearn.impute import SimpleImputer
#!/usr/bin/env python
# coding: utf-8
@@ -49,7 +50,7 @@
# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression
-from sklearn.preprocessing import Imputer
+pass
from sklearn.preprocessing import Normalizer
@@ -158,7 +159,7 @@
# Imputting Missing value
-imp = Imputer(missing_values='NaN', strategy='mean' )
+imp = SimpleImputer(missing_values=np.nan, strategy='mean' )
df_1[['normalized-losses','bore','stroke','horsepower','peak-rpm','price']] = imp.fit_transform(df_1[['normalized-losses','bore','stroke','horsepower','peak-rpm','price']])
df_1.head()
#########################################################################################################################
In this case, Relancer performs two fixes:
- Update the fully qualified name of API
sklearn.preprocessing.Imputer
tosklearn.impute.SimpleImputer
- Update the value of parameter
missing_values
from'NaN'
tonp.nan
To get the number of fixed notebooks during the run, use the command
grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log.txt | wc -l
To get the total number of executed notebooks, use the command
grep -E '\-\-\- Running ' relancer-exp/batch-log.txt | wc -l
If the two numbers are equal, it means that Relancer successfully fixed all the notebooks during the execution.
For this demo, we expect that both commands output the number 10.
One can also manually walk through the execution log files to check the results. The command
find relancer-exp/exec-logs/relancer/ -name "*.exec.log"
will print the paths of all execution log files. Each file stores the final execution log of a notebook fixed by Relancer. Therefore, no errors should exist in any of these log files. One can manually check if the logs are error-free.
Run Relancer on all the Jupyter Notebooks used in our experiment.
First, launch a Docker container relancer-full
:
docker run --name relancer-full -i -t relancer --rm -v `pwd`:/Scratch
Then, set up the environment:
conda activate relancer
python setup_nltk.py
If successful, there will be a (relancer)
prefix on your command line prompt. For example,
(relancer) root@1758fb078432:/home/relancer#
And there will be command line output like this:
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.
Finally, run the full script:
./run_full.sh | tee relancer-exp/batch-log.txt
Time estimation: ~6 hours on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.
This script executes Relancer on all the Jupyter Notebooks and stores each notebook's result in the same directory structure as the demo.
At the end of the run, all the notebooks should be fixed by Relancer. The output directories are as follows:
├──📂 relancer
├──📂 relancer-exp
├──📂 fixed_notebooks
│ ├──📂 relancer
│ ├──📂 neuromusic_avocado-prices
│ │ ├──📜 explore-avocados-from-all-sides.py
│ ├──📂 toramky_automobile-dataset
│ │ ├──📜 auto-imports-beginner-level-analysis.py
│ │ ├──📜 ...
│ ├──📂 ...
├──📂 patches
│ ├──📂 relancer
│ ├──📂 neuromusic_avocado-prices
│ │ ├──📜 explore-avocados-from-all-sides.patch
│ ├──📂 toramky_automobile-dataset
│ │ ├──📜 auto-imports-beginner-level-analysis.patch
│ │ ├──📜 ...
│ ├──📂 ...
├──📂 exec-logs
│ ├──📂 relancer
│ ├──📂 neuromusic_avocado-prices
│ │ ├──📜 explore-avocados-from-all-sides.exec.log
│ │ ├──📜 explore-avocados-from-all-sides.repair.log
│ ├──📂 toramky_automobile-dataset
│ │ ├──📜 auto-imports-beginner-level-analysis.exec.log
│ │ ├──📜 auto-imports-beginner-level-analysis.repair.log
│ │ ├──📜 ...
│ ├──📂 ...
To get the number of fixed notebooks during the run, use the command
grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log.txt | wc -l
To get the total number of executed notebooks, use the command
grep -E '\-\-\- Running ' relancer-exp/batch-log.txt | wc -l
If the two numbers are equal, it means that Relancer successfully fixed all the notebooks during the execution.
For this demo, we expect that both commands output the number 142.
One can also manually walk through the execution log files to check the results. The command
find relancer-exp/exec-logs/relancer/ -name "*.exec.log"
will print the paths of all execution log files. Each file stores the final execution log of a notebook fixed by Relancer. Therefore, no errors should exist in any of these log files. One can manually check if the logs are error-free.
Note: Relancer's performance can be affected by the configuration of the computer. We have tested run_full.sh
multiple times on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM, in which Relancer can always fix all the 142 notebooks (as in Figure 5a) within the 30-min preset time limit (the same time limit as in the paper). However, running on a different machine may produce different results. If the machine specs are significantly lower than our machine, then Relancer may time out on several subjects, but most subjects should remain unaffected.
Run all the baselines (RQ2 and RQ3) on all the Jupyter Notebooks used in our experiment.
First, launch a Docker container relancer-baselines
:
docker run --name relancer-baselines -i -t relancer --rm -v `pwd`:/Scratch
Then, set up the environment:
conda activate relancer
python setup_nltk.py
If successful, there will be a (relancer)
prefix on your command line prompt. For example,
(relancer) root@1758fb078432:/home/relancer#
And there will be command line output like this:
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.
Finally, run the baselines script:
./run_baselines.sh | tee relancer-exp/batch-log.txt
Time estimation: ~78 hours on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.
This script executes all the baselines on all the Jupyter Notebooks and stores their results in the similar directory structures as in the previous section (run_full.sh
), except that the name relancer
in the TOOL_NAME
part of the path is replaced by the corresponding baseline's name (github
/doc
/text
/random
/naive
).
For example, for baseline Relancer_github, the output directories are as follows:
├──📂 relancer
├──📂 relancer-exp
├──📂 fixed_notebooks
│ ├──📂 github
│ ├──📂 neuromusic_avocado-prices
│ │ ├──📜 explore-avocados-from-all-sides.py
│ ├──📂 toramky_automobile-dataset
│ │ ├──📜 auto-imports-beginner-level-analysis.py
│ │ ├──📜 ...
│ ├──📂 ...
├──📂 patches
│ ├──📂 github
│ ├──📂 neuromusic_avocado-prices
│ │ ├──📜 explore-avocados-from-all-sides.patch
│ ├──📂 toramky_automobile-dataset
│ │ ├──📜 auto-imports-beginner-level-analysis.patch
│ │ ├──📜 ...
│ ├──📂 ...
├──📂 exec-logs
│ ├──📂 github
│ ├──📂 neuromusic_avocado-prices
│ │ ├──📜 explore-avocados-from-all-sides.exec.log
│ │ ├──📜 explore-avocados-from-all-sides.repair.log
│ ├──📂 toramky_automobile-dataset
│ │ ├──📜 auto-imports-beginner-level-analysis.exec.log
│ │ ├──📜 auto-imports-beginner-level-analysis.repair.log
│ │ ├──📜 ...
│ ├──📂 ...
The mapping of baseline names to their directory names:
- Relancer_github:
github
- Relancer_doc:
apidoc
- Relancer_text:
text
- Relancer_random:
random
- Relancer_naive:
naive
This script also generates 5 batch logs, one for each baseline: relancer-exp/batch-log-github.txt
, relancer-exp/batch-log-apidoc.txt
, relancer-exp/batch-log-text.txt
, relancer-exp/batch-log-random.txt
, relancer-exp/batch-log-naive.txt
.
Commands for getting the number of notebooks fixed by each baseline:
Relancer_github:
grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-github.txt | wc -l
We expect the number to be 96.
Relancer_doc:
grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-apidoc.txt | wc -l
We expect the number to be 92.
Relancer_text:
grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-text.txt | wc -l
We expect the number to be 102.
Relancer_random:
grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-random.txt | wc -l
Due to the random nature of Relancer_random, the result is not predictable. However, we expect the number to be around 100 if running on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.
Relancer_naive:
grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-naive.txt | wc -l
Due to the random nature of Relancer_naive, the result is not predictable. However, we expect the number to be around 80 if running on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.
Note: All the baselines' performance can be affected by the configuration of the computer. We have tested run_baseline.sh
on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM, where the baselines can fix the number of notebooks as shown in Figure 5b and Figure 6 in the paper, within the 30-min preset time limit. However, running on a different machine may produce different results. If the machine specs are significantly lower than our machine, then the baselines may time out on some subjects. Furthermore, Relancer_random and Relancer_naive use a random choice generator to determine repair actions, which can make their results fluctuate across different runs.
As running baseline can take substantial time, and the randomness of the baselines can affect the result, if one wants to reproduce the charts (Figure 5b, Figure 6) in the paper, we offer a script to generate these plots using the log files of our original experiment. Please note that since we do not have access to GUI in Docker, we present the result in tabular form.
Launch a Docker container relancer-charts
and activate the environment:
docker run --name relancer-charts -i -t relancer --rm -v `pwd`:/Scratch
conda activate relancer
Run the script using the following command:
python data/reproduce-plots.py
This script processes our experimental log files (in data/exec-logs
) and outputs the data for plotting the charts:
----------------------------
1. Figure 5b data:
RELANCER: 142
RELANCER$_{github}$: 96
RELANCER$_{doc}$: 92
----------------------------
----------------------------
2. Figure 6 data:
RELANCER RELANCER$_{random}$ RELANCER$_{text}$ RELANCER$_{naive}$
Time (min)
1 98 73 57 42
5 124 92 80 60
10 133 102 85 66
15 138 105 91 70
20 140 107 93 71
25 141 107 99 75
30 142 108 102 78
----------------------------
In addition, there will be two chart files generated in the current working directory: fig5b.eps
and fig6.eps
. They are the same as the figures in the paper.
If you would like to use Relancer in your research, please cite our ASE'21 paper.
@inproceedings{ZhuETAL2021Relancer,
author = {Zhu, Chenguang and Saha, Ripon and Prasad, Mukul and Khurshid, Sarfraz},
booktitle = {Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering},
title = {Restoring the Executability of Jupyter Notebooks by Automatic Upgrade of Deprecated APIs},
year = {2021}
}