This project uses Playwright, a powerful web scraping library, to extract data from the HPRERA website. It navigates through the website, identifies project cards, and retrieves details such as project names, PAN, GSTIN, and addresses from each card. The script can handle a large number of project cards efficiently and prints the extracted data for further processing or analysis.
- Python: Version 3.10
- Libraries:
playwright
- Python 3.10
- Playwright library
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required packages:
pip install playwright playwright install
To run the script, use the following command:
python script.py
Ensure you replace script.py
with the actual name of your script file.
The script performs the following actions:
- Launches a headless Chromium browser.
- Navigates to the HPRERA Public Dashboard.
- Waits for the registered projects section to load.
- Selects and iterates through project cards to extract details.
- Prints the extracted project details to the console.
- Closes the browser.
- extract_data_from_card(page, card_div): Extracts and prints project details from an individual project card.
BASE_URL
: The URL of the HPRERA Public Dashboard.REGISTERED_PROJECTS_SELECTOR
,CARDS_SELECTOR
,CARD_CLASS_SELECTOR
,CARD_NAME_SPAN_SELECTOR
,CARD_DETAILS_BUTTON_SELECTOR
,CARD_DETAILS_ID_SELECTOR
,CARD_DETAILS_TABLE_ROW_SELECTOR
,CARD_DETAILS_NAME_SELECTOR
,CARD_DETAILS_DATA_SELECTOR
,CARD_DETAILS_CLOSE_BUTTON_SELECTOR
: Various CSS selectors used to locate elements on the page.TIMEOUT
: Timeout value for waiting for elements to load.NUMBER_OF_CARDS_TO_READ
: The number of project cards to read and extract data from.
NAME: Project XYZ
Name: Project XYZ Name data not available
PAN No.: ABCDE1234F
GSTIN No.: 22ABCDE1234F1Z5
Permanent Address: 123 Main St, City, State
Contributions are welcome! Please fork the repository and submit a pull request with your changes.
This project is licensed under the MIT License.
Special thanks to the Playwright team for creating such a robust web scraping tool.
For any questions or feedback, please visit kumarsomesh.in.
PS: This README file was generated using GPT.