marswh12312313 / GeneSumCrawler

Python-based web scraper for extracting gene summaries from GeneCards.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GeneCards Web Scraper

Python version License: MIT Visit GeneCards

Project Overview

This project contains a Python script for web scraping, designed to extract summary information about genes from the GeneCards website. The script reads a list of genes from a CSV file, accesses each gene's page on GeneCards, extracts specific summary information, and saves this information to another CSV file.

Features

  • Extracts summary information for specified genes.
  • Reads gene lists from a CSV file.
  • Outputs results in a CSV file for easy analysis.

How to Use

Prerequisites

Installation Steps

  1. Clone the repository to your local machine:
git clone https://github.com/marswh12312313/GeneSumCrawler.git
  1. Install the dependencies:
pip install selenium beautifulsoup4 pandas
  1. GeckoDriver:
  • The repository includes a GeckoDriver binary suitable for Linux systems, located in the root directory.
  • If you are using a different operating system (Windows or macOS), please download the appropriate version of GeckoDriver from GeckoDriver Releases and replace the existing file in the root directory, or update the script to point to your installed location of GeckoDriver.
  • Ensure GeckoDriver is correctly installed and in your system's PATH, or update the script with the correct path to the GeckoDriver executable.

Running the Script

  1. Save your list of genes in a file named genelist.csv in root directory, with each gene name on a new line.

  2. Run the script:

python gene_spider.py
  1. The results will be saved in a file named gene_summaries.csv.

License

MIT

Contributions

Contributions are welcome! Please submit pull requests or open issues to discuss proposed changes.

Contact

You can request an issue.

About

Python-based web scraper for extracting gene summaries from GeneCards.

License:MIT License


Languages

Language:Python 100.0%