BenderScript / DLPDataScraper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DLP Data Scraper and Generator

DLP Data Scraper and Generator is a Python tool designed to scrape and generate data for data loss prevention (DLP) purposes. It has two modules:

  • It fetches DLP test sample data from specified URLs, saves the data in text format, and then converts these text files into PDFs.
  • It uses an OpenAI Assistant to generate DLP mock data

The output is suitable for benchmarking DLP systems or Generative AI Language Learning Models (GenAI LLMs) for prompt injection testing.

Features

  • Web scraping from specified URLs.
  • Data extraction and saving in text format.
  • Conversion of text data to PDF format, ideal for benchmarking DLP systems or GenAI LLMs.

Installation

To install DLP Data Scraper, clone the repository and install the required packages:

git clone https://github.com/BenderScript/DLPDataScraper.git
cd DLPDataScraper/dlp_data_scraper
pip3 install -r requirements.txt

Usage

To use the DLP Data Scraper:

from dlp_data_scraper import Umbrella

pdf_data = "pdf_data"
text_data = "text_data"
url = ('https://support.umbrella.com/hc/en-us/articles/4402023980692-Data-Loss-Prevention-DLP-Test-Sample-Data-for'
       '-Built-In-Data-Identifiers')

scraper = Umbrella(url=url, text_data=text_data, pdf_data=pdf_data)
html_content = scraper.initialize_browser()
scraped_data = scraper.scrape_data()
scraper.save_data_to_files()
scraper.convert_txt_to_pdf()

print("Scraping and conversion to PDF completed.")

This example demonstrates initializing the scraper, scraping data from the specified URL, saving the data to text files, and then converting those text files to PDFs in specified directories.

Contributing

Contributions to DLP Data Scraper are welcome. Please feel free to submit pull requests or open issues to improve the project.


About

License:Apache License 2.0


Languages

Language:Python 99.2%Language:Shell 0.8%