GeneCards Web Scraper

Project Overview

This project contains a Python script for web scraping, designed to extract summary information about genes from the GeneCards website. The script reads a list of genes from a CSV file, accesses each gene's page on GeneCards, extracts specific summary information, and saves this information to another CSV file.

Features

Extracts summary information for specified genes.
Reads gene lists from a CSV file.
Outputs results in a CSV file for easy analysis.

How to Use

Prerequisites

Installation Steps

Clone the repository to your local machine:

git clone https://github.com/marswh12312313/GeneSumCrawler.git

Install the dependencies:

pip install selenium beautifulsoup4 pandas

GeckoDriver:

The repository includes a GeckoDriver binary suitable for Linux systems, located in the root directory.
If you are using a different operating system (Windows or macOS), please download the appropriate version of GeckoDriver from GeckoDriver Releases and replace the existing file in the root directory, or update the script to point to your installed location of GeckoDriver.
Ensure GeckoDriver is correctly installed and in your system's PATH, or update the script with the correct path to the GeckoDriver executable.

Running the Script

Save your list of genes in a file named genelist.csv in root directory, with each gene name on a new line.
Run the script:

python gene_spider.py

The results will be saved in a file named gene_summaries.csv.

License

MIT

Contributions

Contributions are welcome! Please submit pull requests or open issues to discuss proposed changes.

Contact

You can request an issue.

About

Python-based web scraper for extracting gene summaries from GeneCards.

MIT License

Languages

Language:Python 100.0%