shubh0125 / LLM_Vulnerability

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LLM_Vulnerability

Installation and Usage Guide

Installation

  1. Clone the repository to your local machine.
  2. Navigate to the repository directory.
  3. Install the required packages using the requirements.txt file:
pip install -r requirements.txt

Steps to Extract the C++ Files

Step 1: Preprocess Data

  • Remove unwanted columns from your DataFrame.
  • Add a primary key column if it is not already present.

Step 2: Extract and Save Vulnerable and Commit Files

  • Use the SaveCPPFilesInFolder class to extract files using the GitHub API and save them in a designated folder.
SaveCPPFilesInFolder.to_save_as_CPP_files_in_folder(main_folder_name, repository_name, repository_owner, id_for_file_name, Github_Token, df, 'Commit', 'FilePath_y')

Step 3: Create JSON File to Record Outputs

  • Use the JsonFunctions class to create a JSON file format to record the outputs.
JsonFunctions.create_json_file_format_1("newjson", total_rows)

Step 4: Extract Functions and Add Code to JSON File

  • Extract each function from the C++ files and add the code to the JSON file.

Files Used:

  • path: Current commit file path constructed from id.
  • previous_commit_path: Previous commit file path constructed from id.

Functions Used:

  • ParseCPPToExtractFunction.read_cpp_file(path): Reads the content of a given C++ file (path).
  • ParseCPPToExtractFunction.find_function_content(content, file_path_and_function_name): Finds and returns the content of a specific function (file_path_and_function_name) within the parsed C++ content.
  • JsonFunctions.add_neutral_commit_code_to_json_format1(json_file_name, function_code, id): Adds neutral commit code to a JSON format file.
  • JsonFunctions.add_vulnerable_commit_code_to_json_format1(json_file_name, previous_commit_function_code, id): Adds vulnerable commit code to a JSON format file.

These functions are applied in a loop iterating over rows of a DataFrame (df), handling file paths and function names retrieved from each row's data (id and function_name_cfg).

By following these steps, you can preprocess your data, extract C++ files from GitHub commits, save them in folders, create JSON files to record outputs, and extract specific functions to store in the JSON file.



File Information

1. CommitInformation

The CommitInformation class provides methods for interacting with the GitHub API to retrieve commit details, file contents, and related information. It is designed to handle requests, manage rate limits, and retry on failures.

Features

  • Retrieve commit details.
  • Retrieve the Git tree of a commit.
  • Retrieve file content from a commit.
  • Find a specific file in the Git tree.
  • Retrieve the previous commit in a repository.
  • Extract file information from a commit with retry logic and rate limit handling.

Requirements

  • Python 3.x
  • requests library
  • GitHub personal access token

Usage

Class Methods

1. get_commit_details(owner, repo, commit_hash, token)

Fetches the details of a specific commit.

Parameters:

  • owner: Owner of the repository.
  • repo: Name of the repository.
  • commit_hash: Commit SHA.
  • token: GitHub personal access token.

Returns:

  • JSON response containing commit details.
commit_details = CommitInformation.get_commit_details(owner, repo, commit_hash, token)

2. get_tree(owner, repo, tree_sha, token)

Fetches the Git tree of a specific commit.

Parameters:

  • owner: Owner of the repository.
  • repo: Name of the repository.
  • tree_sha: Tree SHA.
  • token: GitHub personal access token.

Returns:

  • JSON response containing the tree details.
tree = CommitInformation.get_tree(owner, repo, tree_sha, token)

3. get_file_content(owner, repo, file_sha, token)

Fetches the content of a specific file from its SHA.

Parameters:

  • owner: Owner of the repository.
  • repo: Name of the repository.
  • file_sha: File SHA.
  • token: GitHub personal access token.

Returns:

  • Content of the file as text.
file_content = CommitInformation.get_file_content(owner, repo, file_sha, token)

4. find_file_in_tree(tree, file_path)

Finds a specific file in the Git tree.

Parameters:

  • tree: JSON response of the tree.
  • file_path: Path of the file.

Returns:

  • SHA of the file if found, otherwise None.
file_sha = CommitInformation.find_file_in_tree(tree, file_path)

5. view_file_in_commit(owner, repo, commit_hash, file_path, token)

Fetches the content of a specific file from a commit.

Parameters:

  • owner: Owner of the repository.
  • repo: Name of the repository.
  • commit_hash: Commit SHA.
  • file_path: Path of the file.
  • token: GitHub personal access token.

Returns:

  • Content of the file as text, or None if the file is not found.
file_content = CommitInformation.view_file_in_commit(owner, repo, commit_hash, file_path, token)

6. get_previous_commit(owner, repo, commit_hash, token)

Fetches the SHA of the previous commit.

Parameters:

  • owner: Owner of the repository.
  • repo: Name of the repository.
  • commit_hash: Current commit SHA.
  • token: GitHub personal access token.

Returns:

  • SHA of the previous commit, or None if there are no parents (initial commit).
previous_commit_hash = CommitInformation.get_previous_commit(owner, repo, commit_hash, token)

7. get_file_information(owner, repo, commit_sha, file_path, token)

Fetches file information from a commit with retry logic and rate limit handling.

Parameters:

  • owner: Owner of the repository.
  • repo: Name of the repository.
  • commit_sha: Commit SHA.
  • file_path: Path of the file.
  • token: GitHub personal access token.

Returns:

  • JSON response containing file information, or an error message if the request fails.
file_info = CommitInformation.get_file_information(owner, repo, commit_sha, file_path, token)

Notes

  • Ensure the GitHub token has the necessary permissions to access the repository and its commits.
  • Handle exceptions appropriately to avoid abrupt termination of the script.
  • Adjust the sleep duration and retry logic as needed to manage rate limits effectively.


2. JsonFunctions

The JsonFunctions class provides methods to handle JSON file operations, including creating JSON files in specific formats, opening and reading JSON files, and updating JSON files with specific data.

Features

  • Create JSON files in two different formats.
  • Open and read JSON files.
  • Add vulnerable and neutral commit code to JSON files.
  • Save JSON data to files.

Requirements

  • Python 3.x
  • json library

Usage

Class Methods

1. open_json_file(file_name)

Opens and reads the content of a JSON file.

Parameters:

  • file_name: Name of the JSON file to open.

Returns:

  • The data from the JSON file as a dictionary.
data = JsonFunctions.open_json_file("data.json")

2. create_json_file_format_1(file_name, limit)

Creates a JSON file in the first specified format with a given limit.

Parameters:

  • file_name: Name of the JSON file to create.
  • limit: Number of versions to include in the JSON file.

Creates a JSON file with the following structure:

{
    "V_001": {
        "commit_code": {
            "Code": "",
            "Size": 0,
            "Complexity": 0,
            "Memory Management": 0,
            "Code Complexity": 0,
            "Error Handling": 0
        },
        "neutral_code": {
            "Code": "",
            "Size": 0,
            "Complexity": 0,
            "Memory Management": 0,
            "Code Complexity": 0,
            "Error Handling": 0
        }
    },
    ...
}
JsonFunctions.create_json_file_format_1("data_format_1", 5)

3. create_json_file_format_2(file_name, limit)

Creates a JSON file in the second specified format with a given limit.

Parameters:

  • file_name: Name of the JSON file to create.
  • limit: Number of versions to include in the JSON file.

Creates a JSON file with the following structure (same as format 1 in this example):

{
    "V_001": {
        "commit_code": {
            "Code": "",
            "Size": 0,
            "Complexity": 0,
            "Memory Management": 0,
            "Code Complexity": 0,
            "Error Handling": 0
        },
        "neutral_code": {
            "Code": "",
            "Size": 0,
            "Complexity": 0,
            "Memory Management": 0,
            "Code Complexity": 0,
            "Error Handling": 0
        }
    },
    ...
}
JsonFunctions.create_json_file_format_2("data_format_2", 5)

4. add_vulnerable_commit_code_to_json_format1(file_name, code_to_be_added, id)

Adds vulnerable commit code to the JSON file in format 1.

Parameters:

  • file_name: Name of the JSON file to update.
  • code_to_be_added: The code to be added.
  • id: The version ID to update.
JsonFunctions.add_vulnerable_commit_code_to_json_format1("data_format_1.json", "vulnerable code", "V_001")

5. add_neutral_commit_code_to_json_format1(file_name, code_to_be_added, id)

Adds neutral commit code to the JSON file in format 1.

Parameters:

  • file_name: Name of the JSON file to update.
  • code_to_be_added: The code to be added.
  • id: The version ID to update.
JsonFunctions.add_neutral_commit_code_to_json_format1("data_format_1.json", "neutral code", "V_001")

6. dump_data_in_json_file(file_name, json_data)

Dumps the provided JSON data into a file.

Parameters:

  • file_name: Name of the JSON file to save.
  • json_data: The JSON data to save.
data = {"example": "data"}
JsonFunctions.dump_data_in_json_file("data.json", data)

Notes

  • Ensure the file name provided includes the correct path if the file is not in the same directory as the script.
  • Handle exceptions appropriately to avoid abrupt termination of the script.
  • The create_json_file_format_2 method is currently creating the same format as create_json_file_format_1. If a different format is needed, modify the method accordingly.


3. ParseCPPToExtractFunction

The ParseCPPToExtractFunction class provides methods to read C++ files and extract the content of specific functions from the given code.

Features

  • Read the content of a C++ file.
  • Extract the complete content of a specified function from the C++ code.

Requirements

  • Python 3.x
  • re library

Usage

Class Methods

1. read_cpp_file(file_path)

Opens and reads the content of a C++ file.

Parameters:

  • file_path: Path to the C++ file to be read.

Returns:

  • The content of the C++ file as a string.
  • Error message if the file is not found or an I/O error occurs.
content = ParseCPPToExtractFunction.read_cpp_file("example.cpp")

2. find_function_content(content, function_name)

Searches for and extracts the complete content of a specified function from the given C++ code.

Parameters:

  • content: The content of the C++ file as a string.
  • function_name: The name of the function to extract.

Returns:

  • The complete function as a string.
  • "None" if the function is not found or the end of the function cannot be determined.
function_content = ParseCPPToExtractFunction.find_function_content(content, "myFunction")

Notes

  • Ensure the file path provided includes the correct path if the file is not in the same directory as the script.
  • Handle exceptions appropriately to avoid abrupt termination of the script.
  • The find_function_content method uses regular expressions to locate the function declaration and identify the function body based on brace matching. This method assumes well-formed C++ code.


4. SaveCPPFilesInFolder

This Python script is designed to save the content of C++ files from specific commits and their previous commits in a GitHub repository into a local folder structure. The subfolders are based on a primary key column from a given dataframe, and the files are saved with the primary key as a prefix.

Features

  • Creates a main folder and subfolders based on a primary key column from a dataframe.
  • Iterates through the dataframe to get commit hashes and file paths.
  • Retrieves and saves the current and previous commit file content from GitHub.
  • Saves the files locally with appropriate naming conventions.

Requirements

  • Python 3.x
  • pandas
  • os
  • CommitInformation module (provides functions to interact with GitHub commits)

Usage

Function Definition

def to_save_as_CPP_files_in_folder(main_folder_name, repository_name, repository_owner, id_for_file_name, Github_Token, Dataframe, Github_CommitHash_Column_name, Github_File_Path_column_name):

Parameters

  • main_folder_name: Name of the main folder where the files will be saved.
  • repository_name: Name of the GitHub repository.
  • repository_owner: Owner of the GitHub repository.
  • id_for_file_name: The primary key column in the dataframe which will be used as a file prefix.
  • Github_Token: GitHub personal access token for authentication.
  • Dataframe: Dataframe containing the required information (commit hashes, file paths, etc.).
  • Github_CommitHash_Column_name: Column name in the dataframe that contains the commit hash.
  • Github_File_Path_column_name: Column name in the dataframe that contains the file path.

Example

import pandas as pd

# Sample dataframe
data = {
    'ID': [1, 2],
    'Commit_Hash': ['abc123', 'def456'],
    'File_Path': ['src/main.cpp', 'src/util.cpp']
}
df = pd.DataFrame(data)

# Call the function
to_save_as_CPP_files_in_folder(
    main_folder_name='MyCPPFiles',
    repository_name='my-repo',
    repository_owner='my-user',
    id_for_file_name='ID',
    Github_Token='your_github_token',
    Dataframe=df,
    Github_CommitHash_Column_name='Commit_Hash',
    Github_File_Path_column_name='File_Path'
)

Workflow

  1. Create Main Folder: Creates a main folder based on main_folder_name.
  2. Iterate DataFrame: Iterates through each row of the dataframe.
  3. Create Subfolders: Creates subfolders based on the primary key column.
  4. Retrieve File Content:
    • Retrieves the content of the file in the current commit.
    • Retrieves the content of the file in the previous commit.
  5. Save Files Locally: Saves the retrieved content to files in the corresponding subfolders.

CommitInformation Module Functions

  • view_file_in_commit: Retrieves the content of a file in a specific commit.
  • get_previous_commit: Retrieves the hash of the previous commit.

Example Usage

current_file_content = CommitInformation.view_file_in_commit(repository_owner, repository_name, commit_hash, file_path, Github_Token)
previous_commit_hash = CommitInformation.get_previous_commit(repository_owner, repository_name, commit_hash, Github_Token)
previous_file_content = CommitInformation.view_file_in_commit(repository_owner, repository_name, previous_commit_hash, file_path, Github_Token)

Notes

  • Ensure that the GitHub token has the necessary permissions to access the repository and its commits.
  • The script handles exceptions and prints appropriate error messages if the file is not present in the specified commit or if there are other issues.

Sample Output

  • The files will be saved in the following structure:
MyCPPFiles/
    ├── 1/
    │   ├── 1_commit.cpp
    │   └── 1_previous_commit.cpp
    └── 2/
        ├── 2_commit.cpp
        └── 2_previous_commit.cpp
  • Each file will contain the content of the C++ file from the specified commit and its previous commit.

This script simplifies the process of retrieving and saving C++ file content from GitHub commits for further analysis or reference.

About


Languages

Language:C++ 99.6%Language:Jupyter Notebook 0.4%Language:Python 0.0%