bdqnghi / infercode

[ICSE 2021] - InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Where are processing_data.sh, CodeClassificationData, CorderModel, yijun/fast and requirements.txt?

Ledenel opened this issue · comments

Thanks for your wonderful work!
I'm trying to reproduce the results, but I got stucked while following the steps in README:

  1. in section Data Preparation, I've been told to execute the script source process_data.sh. But I can't find any process_data.sh information in all versions, except README mentions it. The newest version has an process_data.py, but it seem that it's not been implemented yet (only TreeSitterDataProcessor's constructor has been called, no processing)
  2. I try to skip the data process and see that if existing data (downloaded by python3 download_data.py, named OJ_pycparser_train_test_val.zip) can train the model. Then I notice that class CodeClassificationData can only be found in the old version of this repo (since 5a89b64), after that the tree_loader.py is deleted.
  3. Then I try to checkout an older version (5a89b64), and find out that CorderModel can not be found. I change it to InferCodeModel and then encounter path issues on tree_loader.py.

features_file_path_splits[-4] = subtree_features_directory.split("/")[-2] both index change to 0, solved this issue. Then I put the .ids.csv together with .pkl files.

  1. I found #2 and followed the instructuctions. I managed to run examples on CPU tensorflow==1.15.0 installed via conda, but failed with tensorflow-gpu==1.15.0. The README indicate that there is a requirements.txt file, but I couldn't find it in the repo.

So these are my questions:
0. Where are processing_data.sh, CodeClassificationData, CorderModel? Is the deleted CodeClassificationData still validate? What is the relation between CorderModel and InferCodeModel?

  1. It seems that you are trying to migrate AST parser from srcml(yijun/fast) to tree-sitter, and this repo is under heavy development. But result may differ with different AST parser (mentioned in README). So in oreder to replicate this study, should I follow the newest version, or is there any minor verion just to replicate the study?
  2. I install the newest pyarrow==3.0.0, scikit-learn==0.24.1, bidict==0.21.2, and keras-radam==0.15.0 with python==3.7.10. tensorflow==1.15.0 is installed via conda. Is the environment ok? Or better, is there any requirements.txt for reference?
  3. Since I can't find the source code of yijun/fast image, where can I find the details to process the ASTs?
  4. How to use the existing .pkl files (in OJ_pycparser_train_test_val.zip)? Is it relavant to replicate the study?

Sincerely student.

w.r.t (3) there is a minimal explanation of the usage of "yijun/fast" docker image in the cited ICSE'19 poster paper. You can also run "docker run --rm -v $(pwd):/e -it yijun/fast" to see a synopsis of the parser with all options (many of which are experimental). Again InferCode does not need to use all of these options. For reproducibility, you can grep the system calls to the "docker" command in the python script to tell which options have been used.

Hi, do you successfully generate the data input?

can you guys try to look at the latest instructions again?