This Python script is designed to identify whether a given sentence contains an error message.
This script works with the Bag-of-words algorithm to calculate how freequent and important a word is for each document. After creating multiple slightly different datasets 1x with +-84.000 lines and 1x with 114.000 lines, from which +-29% contains error messages. The datasets got splited into two sub-datasets: one containing only error messages and the other containing non-error messages. This division enables the calculation of the likelihood that a word is associated with either the error or non-error sub-dataset, extending to the evaluation of entire sentences. As a result, the script can determine the probability that a given sentence is an error message. By serializing the necessary data, the tool is able to perform the calculations in less than 0.03 seconds. This functionality can be integrated into bug bounty hunting activities to identify error messages related to specific behaviors.
- numpy==1.24.3
- pandas==2.0.1
- python-dateutil==2.8.2
- pytz==2023.3
- six==1.16.0
- tzdata==2023.3
- python3 -m pip install -r requirements.txt
- Unzip TextAndErrorMessages-2_cleaned.7z or TextAndErrorMessages_cleaned.7z
- Start to build the pickled object with the 1_Preprocess_and_serialize_dataset.py
NOTE: at line 11 in 1_Preprocess_and_serialize_dataset.py you have to specify the name of the unziped csv dataset file.
Python3 1_Preprocess_and_serialize_dataset.py
- Now you can change the error message in the second last line of 2_Predict_if_text_is_error_message_or_text.py where the function predict_text() gets called.
python3 2_Predict_if_text_is_error_message_or_text.py
The sentence "sorry, either you mistyped the url or we deleted that page, but let's agree to blame this on you." got successfully detected as a Error message.
NOTE: the number that is closer to 0 is True, in the following example: Error Score: is -91
and Non-Error Score is -93
that means Error score is close to 0 and this message got detected as a error message. The output also displays how biased each word is compared to the 2 subdatasets(Error messages, Non-Error messages), by the scores we can see that url
and deleted
are very biased and more related to error messages than non-error messages.
This section provides an overview of the performance and accuracy achieved by the training model using both datasets. By increasing and diversifying the dataset, the likelihood of words appearing in multiple datasets is enhanced. Consequently, this can lead to a shift in the bias associated with these words.
Meaning that this model predicted for 87% the right answers while testing thousands of neverseen sentences.
TextAndErrorMessages-2_cleaned.csv has 114.384 lines in total, of which are 82.140 / 72% WebsiteText and 32.244 / 28% ErrorMessages.
TextAndErrorMessages-2_cleaned.csv contains Error messages from:
1. Android
2. C
3. C#
4. COBOL
5. DB2
6. Firebase
7. HPWorkstation
8. HSQL
9. HTTP
10. Java
11. Javascript
12. Linux
13. MacOs
14. MySQL
15. Oracle
16. Perl
17. PHP
18. Pointbase
19. PostgreSQL
20. Python
21. Solaris
22. SQLServer
23. Sybase
24. Windows
Meaning that this model predicted for 95% the right answers while testing thousands of neverseen sentences.
TextAndErrorMessages_cleaned.csv has 84.052 lines in total, of which are 59.591 / 71% WebsiteText and 24.461 / 29% ErrorMessages.
TextAndErrorMessages_cleaned.csv contains Error messages from:
1. DB2
2. HSQL
3. HTTP
4. Javascript
5. Linux
6. MySQL
7. Oracle
8. PHP
9. Pointbase
10. PostgreSQL
11. SQLServer
12. Sybase
13. Windows