jackvaughan09 / phil

Minimize the time requirement of audit report analysis with a containerized file conversion and scraping system

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

phil

A dockerized environment, document convervsion system, and automated scraping tool for extracting policy audit text data from .doc/.docx & .pdf files

INSTALLATION

For Windows Users

  1. You'll need to start out by downloading git bash https://git-scm.com/downloads

  2. Then ensure that you have WSL 2 on your machine.

    1. Check to see if you have it:
# run this in powershell
wsl -l -v 

#output
NAME        STATE        VERSION
something   something       2 
  1. If this is not what you see, then you'll need to either upgrade or install wsl

For Everyone

1. Install Docker

2. Clone phil git repo

  • in Git Bash (Windows) or your system terminal, run:

    git clone https://www.github.com/hudnash/phil.git

  • if your system doesn't have bash (Very likely if you're on Windows), get Git Bash, install it

3. Create a data/zip folder in your phil directory

  • You can do this easily by running the following command while in the phil directory
mkdir data/zip

4. Add some zip files from the audit website to the 'zip' folder, you can find them here

Phillipines Audit Website

5. Run a matching version of phil.sh

  • If you have an Intel or AMD processor run
sh x64phil.sh
  • If you have an Apple Silicon chip or otherwise have an arm64 CPU, run
sh arm64phil.sh

About

Minimize the time requirement of audit report analysis with a containerized file conversion and scraping system


Languages

Language:Jupyter Notebook 81.5%Language:Python 15.3%Language:Shell 1.4%Language:Dockerfile 1.4%Language:Makefile 0.4%