KeepGrabbing
This is a collection of scripts for managing scrapers
Summary
jsongen.rb
- Generates JSONs using a schema you specify. Can be used for anything, but it's good for making machine-readable lists of search termslinkedin.rb
- Runs the LinkedIn scraper on a set of search terms in a jsoncrypto
- Encrypt and decrypt all files in a directory with GPGconfig
- Scripts for setup and syncing a scraping machinedocuments.rb
- Convert document files to JSONemails.rb
- Convert email files to JSON
Detailed Instructions
- Run
ruby jsongen.rb
Currently this only supports single level JSONs.
To Run:
- Run
ruby json.rb
- Follow the directions to manually input the schema and items
- Stop adding items by adding an item with all blank fields
linkedin.rb
To run this, you need a JSON where every item has the following fields: Search Term: The phrase you want to search for Degrees: The number of degrees you want to go out with "people also viewed"
To Run:
- Run
ruby linkedin.rb
- When prompted, type in the name of the file with the search terms
- When prompted, type in the name of the directory where you want to save results
- Wait. A new .json and .csv file will be generated for each search term
crypto/
- Encrypt files with encrypt.rb and decrypt with decrypt.rb.
Encrypt & Decrypting Files
Encrypting
- Run
ruby encrypt.rb
- When prompted, type the email address of recipient (keys must be imported into GPG already). You can add as many recipients as you want
- Hit enter, leaving a recipient blank, when you want to stop adding recipients
- When prompted, enter the path to the directory where you want to save results
- Wait as the files are encrypted
Decrypting
- Run
ruby decrypt.rb
- When prompted, enter the path to the directory where you want to decrypt files.
- Enter the password for your GPG key
- Wait as the files are decrypted
config/
- Setup and syncing scripts for a scraping machine
Setup & Sync:
./install.sh
./sync.sh
Installing
- Install system dependencies for Debian
sudo apt-get install build-essential pkg-config curl libcurl3 libcurl3-gnutls
libcurl4-openssl-dev rmagic libmagickwand-dev imagemagick graphicsmagick
poppler-utils poppler-data ghostscript tesseract-ocr pdftk libreoffice
- Install Ruby dependencies
bundle install
from in the directory - Run the document converter script
By default, documents and images will be processed with the GiveMeText tool, but IS NOT GOOD FOR SENSITIVE DOCUMENTS as it sends normal HTTP requests over the internet. However, you can run a custom Tika server for converting documents yourself.
Running
You can process either emails or normal text documents using the following scripts:
Documents
Run the script to convert documents in JSON as well as with local Tika instance
ruby documents.rb path/to/your/files/
ruby documents.rb --tika=http://localhost:9998 /path/to/your/documents
Emails
Run email script to convert emails to JSON
ruby emails.rb /path/to/your/emails
Attachments
If your emails generated an attachments/
folder, then run the documents.rb
script as described above to convert attachments into JSON as well
ruby documents.rb --tika=http://localhost:9998 /path/to/youre/emails_output/attachments