sfall / sec-file-parser

A parser used to extract text and info from SEC filings.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Approach
========
Once I read the projecrt description, I knew I'd be working with BeautifulSoup
since I was familiar with it. I first inspected the file in Google Chrome using
the developer tools and looked at the raw HTML file itself to better understand
how I'd develop the parser.

First, I preprocessed the files to speed up processing. I noticed that there
were a lot of style attributes; I stripped those since I wouldn't need them.
This reduced the file size by over 60%. I also converted '<br>' tags to '\n'.
This simplified the logic for formatting the 'document.txt' file.

To create document.txt, I went through the child tags in <body> and wrote
the text content to the file. Because of the <br> preprocessing, the spacing
was decent.

To create paragraphs.txt, I went again through all of the child tags in <body>
and looked for text content that had a '$' symbol but wasn't in a table. I initially
missed a lot of paragraphs due to unclosed <hr> tags that BeautifulSoup chose
to close at the end of the file, but I worked around that by using a 'dig' method
to process those nested tags.


Time On Project
===============
I completed the project over the course of 5 hours. I should note that there was
an hour-long break in my work.


Libraries
=========
I mostly relied on BeautifulSoup and Python builtins (re, json, sys).
I used the time module for timing purposes.


Instructions
============
Usage: python3 sec_parser.py <filename>


Output Files
============
The output files are located in the root of this repo, alongside this README file.

About

A parser used to extract text and info from SEC filings.


Languages

Language:HTML 99.8%Language:Python 0.2%