iaminblacklist / fintech_spider

Based on Scrapy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FinTech Spider

FinTech(i.e. Financial Technology)

"FinTech Spider" is a spider based on Scrapy to crawl a large number of financial data on the Internet.

The data crawled by "FinTech Spider" has been used by 嗅金牛, 数知源.

Structrue of "FinTech Spider"

Only important dirs & files are listed here.

Directory/File Author Usage
README.md lxw The document for this project
Anti_Anti_Spider/ hee
Demo/ Some Demonstrations(e.g. PhantomJS/Proxies, etc.)
Demo/ArticleSpider/ hee
Demo/CNKI_Patent/ lxw A demo for Scrapy spiders project which supports Selenium/PhantomJS/User-Agent/IP-Proxy
Demo/geetestcrack.py hee
Demo/phantomjs_proxy.py lxw Add IP proxy in PhantomJS
Demo/user_agent.txt hee A large number of User-Agents
Spiders/ The Spiders directory stores Python scripts that crawl data we need from the Internet)
Spiders/CJODocIDSpider/ lxw (w/ scrapy)Spiders for crawling data(case details) from **裁判文书网(China Judgements Online)
Spiders/CJOSpider/ lxw (w/ scrapy)Spiders for crawling data(basic info) from **裁判文书网(China Judgements Online)
Spiders/CninfoSpider/ hee Spiders for crawling data from 巨潮资讯
Spiders/CNKI_Patent_Spider/ lxw (w/o scrapy)Spiders for crawling patent data from **知网
Spiders/NECIPSSpider/ lxw (w/ scrapy)Spiders for crawling data from 国家企业信用信息公示系统(National Enterprise Credit Information Publicity System)
Spiders/new_three_board/ lxw (w/ scrapy)Spiders for crawling data from 全国中小企业股份转让系统
Spiders/SBJSpider/ hee
Spiders/TYCSpider/ lxw (w scrapy, PhantomJS)Spiders for crawling patent/copyright data from 天眼查

TODO

He Chen:

  1. 在README.md中更新所提交的关键目录的用途(如果子目录中有关键的文件,也请列出)

Xiaowei Liu:

  • CJOSpider CJOSpider架构存在问题,把URL去重关闭了, 可能会存在重复抓取的问题
  1. 【比rpush可能会稍微好一点儿,这个暂时不改了,感觉怎么改都会有问题】proxy的获取策略改成lpop() + insert(第六个位置),而不是lpop() + rpush()
  2. [NO, 按理说只用CJOSpider.py然后重新运行就可以] 增加对Redis中TASKS_HASH没有爬取结束任务的爬取代码(一定小于CONCURRENT_REQUESTS个?)
  3. [NO, 按理说只用CJODocIDSpider.py然后重新运行就可以] 增加对Redis中DOC_ID_HASH没有爬取结束任务的爬取代码

About

Based on Scrapy


Languages

Language:Python 100.0%Language:Shell 0.0%