ACM-IEEE-arXiv Info Spider (Developing)

The project is part of my graduation design which aims to crawl structured information of papers from digital library.

Profile spider will be released soon.

Supported Libraries

ACM (Done, Support Digital Library Search Result)
IEEE (Developing, Support Single Page)
arXiv (Done, Support All Categories)
AAAI (Done, Support 2009-2019 AAAI Conferences)

Keywords: Python, Scrapy, MySQL, Papers

Dependencies & Requirements

Python 3.6
MySQL 8.0.17
scrapy
selenium
PhantomJS (optional only for IEEE_Spider)
scrapy_proxies
pymysql
twisted
fake_useragent

Data Structure of Database

You can execute papers.sql to initialize the database.

MYSQL_DBNAME = 'papers'
TABLE_NAME = {'ACM_Data', 'IEEE_Data', 'arXiv_Data'}

attribute	data_type	length	not NULL
p_id	int	0	✅(key)
title	varchar	255
authors	varchar	2047
year	varchar	255
type	varchar	255
subjects	varchar	255
url	varchar	255
abstract	varchar	4095
citation	int	0

Features

A Script runs automatically to get free proxies (HTTP only) and will be integrated to scrapy-based main program soon.
For every request, it will generate a random proxy and user-agent.
TXT file, raw json (not exact json) and MySQL are provided to store data.
Level-based optional log is given.
Asynchronous mode is used as data storage mechanism for MySQL pipeline, thus the program is more efficient and reliable when encounts data flood from spider.

Install & Run

Before you launch scrapy, you should customize the settings first. When you start IEEE_Spider, js middleware based on selenium and PhantomJS needs adding.

In terminal

scrapy crawl ACM_Spider

or

scrapy crawl IEEE_Spider

etc.

Developing in Process

IEEE Spider (The HTML is JS-dynamic.)
arXix (easy)
Proxy Downloader Integration
MongoDB Storage
Robuster Xpath Rules
UUID for Database
Crawl Specific Pages

Bugs Found (Ask for help)

arXiv_Spider searches nothing when requests too much.
Pipeline encounters MySQL error.

Preview

About

Crawl information of papers from ACM/IEEE/arXiv/AAAI digital library.

Apache License 2.0

Languages

Language:Python 100.0%