theOGognf / finagg

A Python package for aggregating and normalizing historical data from popular and free financial APIs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

File-based SEC EDGAR parser

theOGognf opened this issue · comments

I currently implement the SEC EDGAR API, but the API is still relatively new and doesn't contain all the data that may be available through the SEC EDGAR historical data file archives. I think we'd want to use the file-based SEC EDGAR data as an alternative to using the SEC EDGAR API in cases where a company's data cant be found through the API. Just glancing at how the data files are organized, I don't think itd be too big of an effort to implement. A first implementation should probably have the following elements

  • Methods for crawling through the index files for each year and quarter
  • Methods for parsing index files and storing them into SQL tables
  • Methods for getting filings based on an index entry
  • Methods for storing filings in SQL tables and querying them from SQL tables
  • Options for enabling the file-based methods as an alternative to the API methods in cases of errors

I messed around with this a bit. I'm not sure if this feature is quite worth the effort. The general workflow is as follows:

  • Use bs4 to crawl through the /Archives/full_index URL links and download the tables for each quarter-year pair to get filing URLs for each company
  • Use bs4 to parse a filing and search through tags
  • Look through tags to get metadata about each XBRL tag
  • Store tags in their own table

This is really straightforward, but I wonder if this is just replicating what the SEC EDGAR REST API is already doing behind-the-scenes. I'm going to pause development for now until I find that this is not the case

Settled on this not being worth the effort and will close this. Can be reopened if necessary