beautifulsoup4 crawling-sites csv filehandling googlesearch python3 regex scraping-websites

Email-Phone scraping

This project allows you to easily crawl through the websites' script to collect bulk of emails and phone numbers which are then dumped into a .csv file in an organized way.

The main concern of this 'Advanced' Email and phone scraping using python3 is to provide a platform where we can garner the data (emails and phone no:) in a neat and swift manner.

Applications:

Generally used by marketers to stockpile the data of several organizations.
Used in Business/ eCommerce: Market Analysis

Getting Started

These instructions will help you to deploy this project in your local systems for development and testing purposes. Given below are the steps to be followed systematically to build this project.

Pre-requisites

What are the things which are to be installed in your system?

This project is built using python version 3.7

Libraries to be installed ?

pip install regex (2020.7.14)
pip install google-search (1.0.2)
pip install requests (2.24.0)
pip install beautifulsoup4 (4.9.1)
pip install tld (0.12.2)

Deployment

Now you are good to go :)

Clone and download the zip file.
Extract the file into your required directory.
Erase the content in the .csv file and keep the header undisturbed.
Run the script

Execution

Enter the organization name along with the location if necessary. Ex: Deloitte Hyderabad
The link associated with it will be stored in the 'web_urls.txt'
Enjoy Harvesting Emails and Phone numbers :)

How does it Work?

Firstly, It generates a link for the input which is being provided. It does this using 'search' from the google-search library and stores the present and all the successive urls in the 'web_urls.txt'
Secondly, We now process each and every URL by requesting a HTTP response to the website.
We convert the entire page of that respective url into a html scripted text using bs4.
Now that we have extracted the entire content from the web page, we have to scrap all the emails and phone numbers present in the home page.
The scraping of the data is all done by regular expressions.
The regex code employed in this project is the one which is generalized, which detects and throws back mails along with phone no's from most of the websites. Nevertheless, for some it might not go well.
If the data is not detected in the home page of the website, It traces the contact page and starts collecting the data if present, as most of the websites' contact details reside in the contact-us webpage
Now we merge the home page data and contact page data into a single data structure.
Finally, We dump the entire stuff into a .csv file, so that the data is not in a dishevelled manner and is used for inspection.

Built with

Python 3.x - A Programming Language

Contributing

Open to contributions from the public.

Author

K Sai Chaitanya

About

Web scraping of Emails and Phone numbers from various websites

beautifulsoup4 crawling-sites csv filehandling googlesearch python3 regex scraping-websites

Languages

Language:Python 100.0%