mrthlinh / SECGOV

scrap data from https://www.sec.gov

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SECGOV Data Scrapper

Software Installation:

  1. Install Python >= 3.4:
  • https://www.python.org/getit/, double click to execute the installer
  • Select "Add Python to PATH" then Install Now
  • Hit "Next" or "Ok" to finish installation.
  1. Firefox Driver:
  • Download FireFox Browser https://www.mozilla.org/en-US/firefox/new/ then install FireFox.
  • Unzip folder of geckodriver
  • Now we need to add GeckoDriver to PATH of window
  • Press "Window" button and type Edit the system environment variables, hit Enter then in tab Advanced choose Environment Variables
  • Then in System Variables, find Path then Double-click to edit. If you are using Window XP, type ";" (don't forget the semicolon) to add new Path. For example my directory is at "E:\SECGOV" so I need to add ";E:\SECGOV".
  • In window of Edit environment variable, press Browse.. then choose the path of unzip GeckoDriver.
  • Hit "Enter" to finish procedure.
  1. Install wkhtmltopdf:
  • Run wkhtmltox-0.12.5-1.msvc2015-win64.exe
  • Remember Path of program, usually C:/Program Files/wkhtmltopdf/bin
  • Add PATH of wkhtmltopdf to System Variables like in second step

Format Output

  • Columns with "exact": match exactly words in "listofword.txt", lower and upper case are the same. "retirement" is different from "postretirement".
  • Columns without "exact": "postretirement" and "retirement" both count as 1.

How to Run

  • install.bat install needed libraries. If you see "Windows Protected your PC", choose "More info" then "Run anyway"
  • listofword.txt: define your search criteria
  • Compustat.csv: please convert excel file to csv.
  • RUN.bat: Double-click to run this file.
  • download: folder contains download PDF files
  • log: log file. If there is a bug, please send the log file and a screenshot to me.

Note If something interrupts the process, hit "Ctrl + C" many times to terminate the process.

About

scrap data from https://www.sec.gov


Languages

Language:Python 98.8%Language:Batchfile 1.2%