RPA News Scraper

This News Scraper showcases my ability to build a bot for the purposes of process automations.

Overview

This project automates the process of extracting news articles from a news website using Robotic Process Automation (RPA). It leverages the RPA framework and Selenium for web automation to streamline the extraction process.

Features

Search for news articles based on a specified search phrase.
Filter news articles by category, section, or topic.
Extract data such as title, date, description, and picture URL for each news article.
Store extracted data in an Excel file for further analysis or reporting.
Download images associated with news articles and link them to the corresponding Excel entry.
Count occurrences of the search phrase in the title and description of each news article.
Identify if the title or description contains any monetary values.

Installation

Clone the repository to your local machine:

git clone https://github.com/tony-rsa/rpa-news-scraper.git

Navigate to the project directory:
```
cd rpa-news-scraper
```
Install the necessary dependencies:
```
pip install -r requirements.txt
```
Download the appropriate WebDriver for your browser (e.g., ChromeDriver for Google Chrome) and place it in the project directory.

Usage

Open the config.ini file and set the desired parameters:
- search_phrase: The keyword or phrase to search for in news articles.
- news_category: Optional parameter to filter news articles by category, section, or topic.
- num_months: Specifies the number of months for which to retrieve news articles (0 or 1 for the current month, 2 for the current and previous month, and so on).
Run the main Python script to start the automation process:
```
python main.py
```
After execution, the extracted data will be saved in an Excel file (news_data.xlsx) located in the output directory.

Parameters

Search Phrase: The keyword or phrase to search for in news articles.
News Category/Section/Topic: Optional parameter to filter news articles by category, section, or topic.
Number of Months: Specifies the number of months for which to retrieve news articles (0 or 1 for the current month, 2 for the current and previous month, and so on).

These parameters can be provided via the config.ini file or as command-line arguments.

Directory Structure

src/: Contains the main Python script for the RPA News Scraper.
output/: Directory to store output files such as Excel files and downloaded images.
tests/: Directory containing unit tests for the RPA News Scraper.

Contributing

Contributions are welcome! If you have any suggestions, improvements, or feature requests, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

tony-rsa / rpa-news-scraper