mrthlinh / FACTIVA

Scrap data from FACTIVA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FACTIVE Data Scrapper

Update

7/20:

  1. Add merging functions when downloads are finished
  2. Add log function
  3. Delete "download_directory" in "config.json"
  4. Fixed minor Issues

Folder

  1. companyList: contains CSV format files of company names
  2. json: configuration of program
    • actions.json: define search criteria
    • config.json: define directory path for example "download directory" and name of "companyFile" in folder "companyList"
    • error.json: record the failure points (auto generated)
  3. download: completed downloads will contain a folder of PDF and a CSV file.
  4. temp: incompleted / interrupted files. These files are merged after finishing.
  5. log: log file. If there is a bug, please send the log file and a screenshot to me.

Software Installation:

  1. Install Python >= 3.4: - https://www.python.org/getit/, double click to execute the installer - Select "Add Python to PATH" then Install Now - Hit "Next" or "Ok" to finish installation.
  2. Firefox Driver: - Download FireFox Browser https://www.mozilla.org/en-US/firefox/new/ then install FireFox. - Unzip folder of geckodriver - Now we need to add GeckoDriver to PATH of window - Press "Window" button and type Edit the system environment variables, hit Enter then in tab Advanced choose Environment Variables - Then in System Variables, find Path then Double-click to edit. If you are using Window XP, type ";" (don't forget the semicolon) to add new Path. For example my directory is at "E:\Factiva" so I need to add ";E:\Factiva". - In window of Edit environment variable, press Browse.. then choose the path of unzip GeckoDriver. - Hit "Enter" to finish procedure.

How to Run

  • install.bat install needed libraries. If you see "Windows Protected your PC", choose "More info" then "Run anyway"
  • config.json Edit this file to match your file name of company list
  • actions.json Edit this file to match your search criteria.
  • RUN-testSearch.bat: Double-click to run this file. Test your search criteria in actions.json
  • RUN.bat: Double-click to run this file. Loop over all files in company names and download files. If download fails, re-run this file to continue the program.

Note If something interrupts the process, hit "Ctrl + C" many times to terminate the process.

Issues

  • 7/19: Add function "merging incompleted files"
  • 7/20: Download only works for first company names

Fixed Issues

  • 7/19: Select "NOT" won't work in some cases

About

Scrap data from FACTIVA


Languages

Language:Python 99.5%Language:Batchfile 0.5%