meta-llama / PurpleLlama

Set of tools to assess and improve LLM security.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the dataset in CyberSecurity benchmark

aditya1709 opened this issue · comments

Is every code snippet under the datasets/autocomplete/autocomplete.json and datasets/instruct/instruct.json insecure code?
Hypothetically if I were to develop a insecure code detector and use the autocomplete and instruct dataset to see the performance of the detector - it would only reveal the precision because every sample is insecure/malicious.

Is my understanding correct?

Hi @aditya1709 ,

Is every code snippet under the datasets/autocomplete/autocomplete.json and datasets/instruct/instruct.json insecure code?

Yes, that is correct. Specifically, the value of origin_code represents the insecure code, and cwe_identifier corresponds to the type of insecurity.

I am not 100% confident that I understand your hypothetical question, but I believe an ideal insecure code detector should be capable of recognizing all the insecure patterns listed in the datasets correctly.

P.S., you may have already noticed: our codebase already provides the insecure code detector for you to try: https://github.com/facebookresearch/PurpleLlama/tree/main/CybersecurityBenchmarks/insecure_code_detector

@SimonWan Thanks for the response.
What I meant was - in order to determine the performance of any insecure code detector (something that I come up with maybe a custom model etc) we would need both positive and negative samples. Hypothetically speaking if I only used the autocomplete/instruct dataset for benchmarking my custom ICD then I would only obtain precision because there are no secure/non-malicious samples. Does that make sense?
I guess what I am eluding to is the paper talks about the performance of your Insecure Code Detector to be 96% precision and 79% recall, and I want to know if the dataset used to come up with these performance numbers is part of the repo or not ?

@aditya1709 Thank you for the extension. Now I have a better understanding of your question.

Hypothetically speaking if I only used the autocomplete/instruct dataset for benchmarking my custom ICD then I would only obtain precision because there are no secure/non-malicious samples. Does that make sense?

Yes, that makes sense. Also yes, If you only use the autocomplete/instruct origin_code as the dataset for benchmarking a hypothetical ICD, you would only obtain precision because there are all insecure samples.

Regarding the second part of your question:

the performance of your Insecure Code Detector to be 96% precision and 79% recall

The precision and recall were calculated using 50 manually-selected code snippets, which included both secure and insecure cases. These code snippets are not part of the repo.

Please let me know if this clarifies your question.

Notes: The above 50 snippets were generated by the LLM based on our test_case_prompt, and then labeled by our security experts as secure and insecure.

Thank you! That makes sense. Any plans of including the manually labeled dataset in the future for benchmarking purposes?
I understand that this dataset is not the intention of this repo at all but might be useful for folks.

Thank you! That makes sense. Any plans of including the manually labeled dataset in the future for benchmarking purposes? I understand that this dataset is not the intention of this repo at all but might be useful for folks.

While we don't have plans to include labeled data at this time, totally agree this could be useful for folks! As this is an open source project, the community is welcome to potentially contribute this to the project!

Close the issue as we have clarified the questions about the database. Feel free to reopen it if there are any follow-ups.