prirai / rare-diseases-data-scraping

This repository aims to be a central place for all data scraping and analysis related to rare diseases.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rare-diseases-data-scraping

Data scraped from - https://rarediseases.info.nih.gov

data1.csv is the currently incomplete detailed description about the diseases.

File and their purpose


File Purpose
links.md links of interest
disease_links_scraper.py extracts the big list from https://rarediseases.info.nih.gov/diseases.
disease_links.csv data extracted using the above script
scrape_specific_page.ipynb Scrapes a given page for disease details (names, symptoms and causes).
reqmul.py Requests with multithreading - made to save all the 5910 pages offline as HTML for easy scraping later.

Contents of pages folder

This directory contains all the files, the links for which were already scraped and are included in disease_links.csv. The naming convention is just the part after the last / in the corresponding url. reqmul.py checks from previously downloaded files and doesn't overwrite. Current script allows somewhere around 700 pages after which it gets a server restriction. The previous versions were even worse and could only get 100 to max 200 at a time and that too in a long time. wget, axel, aria2, selenium have been already tried.

Scraped Disease at a glance section, people affected, symptoms, categories, ages and causes for all diseases using the offline HTML pages (code in page_details_extractor-offline.ipynb). They are saved in the disease_details.csv file.

About

This repository aims to be a central place for all data scraping and analysis related to rare diseases.


Languages

Language:Jupyter Notebook 85.5%Language:Python 14.5%