madhurimamandal / Parallel-Corpus-NL2Bash

An effort to create a natural language to bash command parallel corpus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A-natural-language-to-bash-commends-parallel-corpus

An effort to create a natural language to bash command parallel corpus.

Datasets retrieved from StackExchange, which is licensed under CC BY-SA.

QueryFile 1.txt to QueryFile 6.txt files provide the SQL commands used to retrieve data from StackExchange Data Dump Explorer. Specifically from AskUbuntu and Unix and Linux.

Data folder has all the intermediate data files and also the final dataset.

The files Dataframe_Linux.csv and Dataframe_AskUbuntu.csv under the Data folder linked above are the results obtained from the above queries.

download_answers.ipynb is used to download answers form stackexchange by sending http get requests, according to Ids obtained above in Dataframe_Linux.csv and Dataframe_AskUbuntu.csv files. These are dumped using pickle into files posts_from_unix.txt and posts_from_askubuntu.txt, located under Data folder.

Then the obtained answers are filtered according to various rules, using Filter_Dataframe_AskUbuntu.ipynb and Filter_Dataframe_Linux.ipynb. The results are dumped into filtered1.txt and filtered2.txt files using pickle, located under Data folder.

The filtered answers further processed using create_final.ipynb.

The answers are further filtered depending on whether or not bashlint parser can parse them. This parser can only parse single line bash commands in correct format. This is used to convert the commands into a standard template form. This code is available in parse_and_filter.ipynb. Bashlint parser is obtained from IBM/clai repository, available under utils, here.

Now Dataset.csv file under Data folder containes the final clean corpus containing 10,260 natural language and bash code pairs, which is close to corpus provided by IBM nlc2cmd competition providing 9,305 pairs. Both the corpuses can be mixed as a possible corpus for code generation task. Dataset_multiple.csv under Data folder contains a noisy version of this data, with multiple code lines per natural language description.

The natural language descriptions can be tokenized simply by whitespace or by subword and fed to code generation model. Tokenizer for bash is provided by IBM/clai repository which can be used to tokenize the bash commands to supervise the above task.

About

An effort to create a natural language to bash command parallel corpus


Languages

Language:Python 69.6%Language:Java 15.7%Language:Jupyter Notebook 14.6%Language:Shell 0.0%Language:Makefile 0.0%