alaalial / relancer-artifact

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Relancer

DOI

Relancer is an automatic technique that restores the executability of broken Jupyter Notebooks by upgrading deprecated APIs.

Prerequisites

  • Linux OS
  • The system root directory / must have at least 20 GB free space
  • Docker (tested with version 20.10.7, recent version should also work)
    Official Docker Installation Guide at https://docs.docker.com/get-docker/

Installation

Build the Docker image:

docker build -f Dockerfile -t relancer .

Time estimation: ~10 mins on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.
If the installation is successful, the end of the command line output should be like this:

Successfully built 837dd58b4bc6
Successfully tagged relancer:latest


Quick Start

1. Run demo cases (~15 mins)
2. Run Relancer on all the subjects (~6 hrs)
3. Run the baselines on all the subjects (~78 hrs)
4. Reproduce charts in the paper without running the full experiment

1. Run demo cases

We first demonstrate Relancer's ability of fixing deprecated APIs on 10 Jupyter Notebooks from Kaggle, which is a subset of the evaluation subjects in the paper.

First, launch a Docker container relancer-demo:

docker run --name relancer-demo -i -t relancer --rm -v `pwd`:/Scratch

Then, set up the environment:

conda activate relancer
python setup_nltk.py

These commands activate the Conda environment and set up NLTK resources.

If successful, there will be a (relancer) prefix on your command line prompt. For example,

(relancer) root@1758fb078432:/home/relancer#

And there will be command line output like this:

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.

Finally, run the demo script:

./run_demo.sh | tee relancer-exp/batch-log.txt

Time estimation: ~15 mins on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.

This script executes Relancer on the 10 Jupyter Notebooks and stores the results in following directory structure:

├──📂 relancer  
   ├──📂 relancer-exp  
      ├──📂 patches  
      ├──📂 fixed_notebooks  
      ├──📂 exec-logs  
      ├──📜 batch-log.txt
  • relancer: the current working directory
  • relancer-exp: the parent directory that stores all the experiment results
  • fixed_notebooks: stores the fixed notebooks generated by Relancer
  • patches: stores the patches generated by Relancer
  • exec-logs: stores the log files generated by Relancer
  • batch-log.txt: stores the cumulated logs produced in the current run, same as the output on the screen

During the run, for each notebook, Relancer generates 4 files: a fixed notebook, a patch, an execution log file, and a repair log file. They all follow the same directory structure TOOL_NAME/DATASET_NAME/NOTEBOOK_NAME.*.

For example, in this demo, for notebook "Auto Imports - Beginner Level Analysis" of dataset "Automobile Dataset" from Kaggle, Relancer generates the following files:

├──📂 relancer  
   ├──📂 relancer-exp  
      ├──📂 fixed_notebooks  
      │   ├──📂 relancer  
      │       ├──📂 toramky_automobile-dataset    
      │           ├──📜 auto-imports-beginner-level-analysis.py  
      ├──📂 patches  
      │   ├──📂 relancer  
      │       ├──📂 toramky_automobile-dataset    
      │           ├──📜 auto-imports-beginner-level-analysis.patch  
      ├──📂 exec-logs  
          ├──📂 relancer  
              ├──📂 toramky_automobile-dataset    
                  ├──📜 auto-imports-beginner-level-analysis.exec.log  
                  ├──📜 auto-imports-beginner-level-analysis.repair.log  

The patch auto-imports-beginner-level-analysis.patch should be like this:

--- /home/relancer/relancer-exp/original_notebooks/toramky_automobile-dataset/auto-imports-beginner-level-analysis.py 2021-07-08 05:25:33.406257335 -0500
+++ /home/relancer/relancer-exp/fixed_notebooks/relancer/toramky_automobile-dataset/auto-imports-beginner-level-analysis.py 2021-07-08 05:25:42.890223574 -0500
@@ -1,3 +1,4 @@
+from sklearn.impute import SimpleImputer
 #!/usr/bin/env python
 # coding: utf-8
 
@@ -49,7 +50,7 @@
 # Import Linear Regression machine learning library
 from sklearn.linear_model import LinearRegression
 
-from sklearn.preprocessing import Imputer
+pass
 
 from sklearn.preprocessing import Normalizer
 
@@ -158,7 +159,7 @@
 
 
 # Imputting Missing value
-imp = Imputer(missing_values='NaN', strategy='mean' )
+imp = SimpleImputer(missing_values=np.nan, strategy='mean' )
 df_1[['normalized-losses','bore','stroke','horsepower','peak-rpm','price']] = imp.fit_transform(df_1[['normalized-losses','bore','stroke','horsepower','peak-rpm','price']])
 df_1.head()
 #########################################################################################################################

In this case, Relancer performs two fixes:

  1. Update the fully qualified name of API sklearn.preprocessing.Imputer to sklearn.impute.SimpleImputer
  2. Update the value of parameter missing_values from 'NaN' to np.nan

To get the number of fixed notebooks during the run, use the command

grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log.txt | wc -l

To get the total number of executed notebooks, use the command

grep -E '\-\-\- Running ' relancer-exp/batch-log.txt | wc -l

If the two numbers are equal, it means that Relancer successfully fixed all the notebooks during the execution.
For this demo, we expect that both commands output the number 10.

One can also manually walk through the execution log files to check the results. The command

find relancer-exp/exec-logs/relancer/ -name "*.exec.log"

will print the paths of all execution log files. Each file stores the final execution log of a notebook fixed by Relancer. Therefore, no errors should exist in any of these log files. One can manually check if the logs are error-free.


2. Run Relancer on all the subjects

Run Relancer on all the Jupyter Notebooks used in our experiment.

First, launch a Docker container relancer-full:

docker run --name relancer-full -i -t relancer --rm -v `pwd`:/Scratch

Then, set up the environment:

conda activate relancer
python setup_nltk.py

If successful, there will be a (relancer) prefix on your command line prompt. For example,

(relancer) root@1758fb078432:/home/relancer#

And there will be command line output like this:

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.

Finally, run the full script:

./run_full.sh | tee relancer-exp/batch-log.txt

Time estimation: ~6 hours on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.

This script executes Relancer on all the Jupyter Notebooks and stores each notebook's result in the same directory structure as the demo.
At the end of the run, all the notebooks should be fixed by Relancer. The output directories are as follows:

├──📂 relancer  
   ├──📂 relancer-exp  
      ├──📂 fixed_notebooks  
      │   ├──📂 relancer  
      │       ├──📂 neuromusic_avocado-prices    
      │       │   ├──📜 explore-avocados-from-all-sides.py  
      │       ├──📂 toramky_automobile-dataset    
      │       │   ├──📜 auto-imports-beginner-level-analysis.py  
      │       │   ├──📜 ...  
      │       ├──📂 ... 
      ├──📂 patches  
      │   ├──📂 relancer  
      │       ├──📂 neuromusic_avocado-prices    
      │       │   ├──📜 explore-avocados-from-all-sides.patch  
      │       ├──📂 toramky_automobile-dataset    
      │       │   ├──📜 auto-imports-beginner-level-analysis.patch  
      │       │   ├──📜 ...  
      │       ├──📂 ... 
      ├──📂 exec-logs  
      │   ├──📂 relancer  
      │       ├──📂 neuromusic_avocado-prices    
      │       │   ├──📜 explore-avocados-from-all-sides.exec.log  
      │       │   ├──📜 explore-avocados-from-all-sides.repair.log        
      │       ├──📂 toramky_automobile-dataset    
      │       │   ├──📜 auto-imports-beginner-level-analysis.exec.log  
      │       │   ├──📜 auto-imports-beginner-level-analysis.repair.log  
      │       │   ├──📜 ...  
      │       ├──📂 ... 

To get the number of fixed notebooks during the run, use the command

grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log.txt | wc -l

To get the total number of executed notebooks, use the command

grep -E '\-\-\- Running ' relancer-exp/batch-log.txt | wc -l

If the two numbers are equal, it means that Relancer successfully fixed all the notebooks during the execution.
For this demo, we expect that both commands output the number 142.

One can also manually walk through the execution log files to check the results. The command

find relancer-exp/exec-logs/relancer/ -name "*.exec.log"

will print the paths of all execution log files. Each file stores the final execution log of a notebook fixed by Relancer. Therefore, no errors should exist in any of these log files. One can manually check if the logs are error-free.

Note: Relancer's performance can be affected by the configuration of the computer. We have tested run_full.sh multiple times on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM, in which Relancer can always fix all the 142 notebooks (as in Figure 5a) within the 30-min preset time limit (the same time limit as in the paper). However, running on a different machine may produce different results. If the machine specs are significantly lower than our machine, then Relancer may time out on several subjects, but most subjects should remain unaffected.


3. Run the baselines on all the subjects (if one wants to replicate the results of the baselines)

Run all the baselines (RQ2 and RQ3) on all the Jupyter Notebooks used in our experiment.

First, launch a Docker container relancer-baselines:

docker run --name relancer-baselines -i -t relancer --rm -v `pwd`:/Scratch

Then, set up the environment:

conda activate relancer
python setup_nltk.py

If successful, there will be a (relancer) prefix on your command line prompt. For example,

(relancer) root@1758fb078432:/home/relancer#

And there will be command line output like this:

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.

Finally, run the baselines script:

./run_baselines.sh | tee relancer-exp/batch-log.txt

Time estimation: ~78 hours on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.

This script executes all the baselines on all the Jupyter Notebooks and stores their results in the similar directory structures as in the previous section (run_full.sh), except that the name relancer in the TOOL_NAME part of the path is replaced by the corresponding baseline's name (github/doc/text/random/naive).

For example, for baseline Relancer_github, the output directories are as follows:

├──📂 relancer  
   ├──📂 relancer-exp  
      ├──📂 fixed_notebooks  
      │   ├──📂 github  
      │       ├──📂 neuromusic_avocado-prices    
      │       │   ├──📜 explore-avocados-from-all-sides.py  
      │       ├──📂 toramky_automobile-dataset    
      │       │   ├──📜 auto-imports-beginner-level-analysis.py  
      │       │   ├──📜 ...  
      │       ├──📂 ... 
      ├──📂 patches  
      │   ├──📂 github  
      │       ├──📂 neuromusic_avocado-prices    
      │       │   ├──📜 explore-avocados-from-all-sides.patch  
      │       ├──📂 toramky_automobile-dataset    
      │       │   ├──📜 auto-imports-beginner-level-analysis.patch  
      │       │   ├──📜 ...  
      │       ├──📂 ... 
      ├──📂 exec-logs  
      │   ├──📂 github  
      │       ├──📂 neuromusic_avocado-prices    
      │       │   ├──📜 explore-avocados-from-all-sides.exec.log  
      │       │   ├──📜 explore-avocados-from-all-sides.repair.log        
      │       ├──📂 toramky_automobile-dataset    
      │       │   ├──📜 auto-imports-beginner-level-analysis.exec.log  
      │       │   ├──📜 auto-imports-beginner-level-analysis.repair.log  
      │       │   ├──📜 ...  
      │       ├──📂 ... 

The mapping of baseline names to their directory names:

  • Relancer_github: github
  • Relancer_doc: apidoc
  • Relancer_text: text
  • Relancer_random: random
  • Relancer_naive: naive

This script also generates 5 batch logs, one for each baseline: relancer-exp/batch-log-github.txt, relancer-exp/batch-log-apidoc.txt, relancer-exp/batch-log-text.txt, relancer-exp/batch-log-random.txt, relancer-exp/batch-log-naive.txt.

Commands for getting the number of notebooks fixed by each baseline:

Relancer_github:

grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-github.txt | wc -l

We expect the number to be 96.

Relancer_doc:

grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-apidoc.txt | wc -l

We expect the number to be 92.

Relancer_text:

grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-text.txt | wc -l

We expect the number to be 102.

Relancer_random:

grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-random.txt | wc -l

Due to the random nature of Relancer_random, the result is not predictable. However, we expect the number to be around 100 if running on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.

Relancer_naive:

grep -E '\[INFO\] This case is fully fixed!' relancer-exp/batch-log-naive.txt | wc -l

Due to the random nature of Relancer_naive, the result is not predictable. However, we expect the number to be around 80 if running on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM.

Note: All the baselines' performance can be affected by the configuration of the computer. We have tested run_baseline.sh on an Intel Core i7-8650U CPU @ 1.90GHz × 8 machine with 16 GB RAM, where the baselines can fix the number of notebooks as shown in Figure 5b and Figure 6 in the paper, within the 30-min preset time limit. However, running on a different machine may produce different results. If the machine specs are significantly lower than our machine, then the baselines may time out on some subjects. Furthermore, Relancer_random and Relancer_naive use a random choice generator to determine repair actions, which can make their results fluctuate across different runs.


4. Reproduce charts in the paper without running the full experiment

As running baseline can take substantial time, and the randomness of the baselines can affect the result, if one wants to reproduce the charts (Figure 5b, Figure 6) in the paper, we offer a script to generate these plots using the log files of our original experiment. Please note that since we do not have access to GUI in Docker, we present the result in tabular form.

Launch a Docker container relancer-charts and activate the environment:

docker run --name relancer-charts -i -t relancer --rm -v `pwd`:/Scratch
conda activate relancer

Run the script using the following command:

python data/reproduce-plots.py

This script processes our experimental log files (in data/exec-logs) and outputs the data for plotting the charts:

----------------------------
1. Figure 5b data:

RELANCER: 142
RELANCER$_{github}$: 96
RELANCER$_{doc}$: 92
----------------------------
----------------------------
2. Figure 6 data:

            RELANCER  RELANCER$_{random}$  RELANCER$_{text}$  RELANCER$_{naive}$
Time (min)                                                                      
1                 98                   73                 57                  42
5                124                   92                 80                  60
10               133                  102                 85                  66
15               138                  105                 91                  70
20               140                  107                 93                  71
25               141                  107                 99                  75
30               142                  108                102                  78
----------------------------

In addition, there will be two chart files generated in the current working directory: fig5b.eps and fig6.eps. They are the same as the figures in the paper.


If you would like to use Relancer in your research, please cite our ASE'21 paper.

@inproceedings{ZhuETAL2021Relancer,
  author = {Zhu, Chenguang and Saha, Ripon and Prasad, Mukul and Khurshid, Sarfraz},
  booktitle = {Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering},
  title = {Restoring the Executability of Jupyter Notebooks by Automatic Upgrade of Deprecated APIs},
  year = {2021}
}

About

License:Apache License 2.0


Languages

Language:Python 90.5%Language:Shell 9.5%Language:Dockerfile 0.0%