aircup / Financial-data-collection-from-web-

A python scripe that collecting financial data from ju-chao web, and can download pdf files from it , more important is it can parase data you want from pdf files using pdfplumber .

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


A python scripe that collecting financial data from ju-chao web, and can download pdf files from it , more important is it can parase data you want from pdf files using pdfplumber .


win10 anaconda python3.7 pdfplumber==0.5.12

(Don't install pdfminer if you have installed pdfplumber,it will destroy the envoronment as pdfplumber used a another version of pdfminer as backend)

  • original_data dir: test files prepared for test,you can use them by modify the file path in .py files

  • download_files dir: download dir saving download files from the web

  • output_files dir: a dir of output files, you can find files here which were created in .py files

  • can download files from the web according the url-link in .csv files, the url-link is just like:

  • is a formal scrip that for getting pdf_url_link from Ju-Chao website,and it creats a csv file which saving url-link like:

  • is a multi threads project, which can get the url,download pdf files and parase files at the same time,for using this project you just need to modify "OUT_DIR" "START_DATE" .... such parameters to your owns.
    START_DATE = '2001' END_DATE = '2008' #str(time.strftime('%Y-%m-%d')) are parameters to limit the year of annual report,
    table_keyword=['其他与经营活动','现金'] is the flag of the data you want ,it means you want '其他与经营活动' and '现金' appears, and you can change the keywords and the count of keywords,it can also be ['其他与经营活动','现金','支付','xxxx']
    inside_keyword=['审计','咨询','中介'] are keywords you want to appear in the sentence, it means you want '审计'or '咨询' or '中介'appears.(here is 'or' not 'and',one of these keywords appears is ok)
    outside_keyword=['收到'] are keywords you don't want to appear in a sentence, it means you don't want '收到' and other keywords appears.

  • is a scrip for analyse a signal pdf file,you just need to modify the pdf file path to yours for using , and if pdf file have the infomation you want ,it will print message on the console just like:
    " find咨询及审计费 value is 1,252,388 " " find in page75 "

  • is a scrip for parasing stock_id from a csv can save a excel file copy to a csv file. and it returns a set of stock ids which having no repeating elements.

  • is a file that for parasing pdf files as your pdf files are under a dir,it can find out all pdf files undering the dir,and parasing them one by one.

some testing files: import other scrips as a model,and you can use them having a simple processing stream. is a example of using pdfminer model not pdfplumber to making a analyse. is just a test file that you can ignore,and can also be a test files for testing some function and are examples showing the skills using pdfplumber,you can have a preview before running. is also a example that for testing ideas. also is a testing file can be ignore

Debug experience:

  • using try except to avoid exceptions, some exceptions may happen when opening pdf files,use try-except struct to ignore them,and turn to the next
  • if you are downloading many pdf files, suggesting to use os.remove() after parasering files
  • their may be a error when accessing the ju chao web for downloading many pdf files, because the ju chao web think you as a web-attack as accessing frequently. use time.sleep or try except to solve this problem. error: ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接

to be continued......


A python scripe that collecting financial data from ju-chao web, and can download pdf files from it , more important is it can parase data you want from pdf files using pdfplumber .


Language:Python 100.0%