To scrape the data, I used the BeautifulSoup library in Python. The data was scraped from https://companiesmarketcap.com/. To scrape the 100 companies, i search for the div related to that data. I then, visit related links to extract extra data such as the company's description,category and past performance. The data is then stored in a dictionary and array, to be exported into JSON file.
By using Beautiful soup and some html knowledge, I was able to locate the exact thing I needed to scrape. I then used the requests library to get the data from the website. After that, I used the json library to store the data in a JSON file.
To run the scraper, you can run the following command in the terminal:
python 'Data Scraping/src/scrape.py'
To insert the data into the database, you can run the following command in the terminal:
python 'Data Scraping/src/insert.py'
Make sure you install all the nessecary dependencies before running the scrape code. In the reference section, I have listed all the dependencies used in this project.
To insert to database, you need to have a MySQL server running. You can change the database configuration in the .env file, based on the .env.example file.
If you encounter some sort of coallition error, you can run the following command, please make sure the version of the mysql-connector-python is 8.0.17.
After doing some transformation at the scraping process, here is the data that is stored in the final JSON file:
- Categories.json
{
"category_name": string,
"category_description": string
}
- Companies.json
{
"company_name": string,
"company_description": string,
"company_country": string
}
- Stocks.json
{
"stock_code": string,
"current_price": number,
"current_marketcap": number,
"company_name": string
}
- Performances.json
{
"stock_code": string,
"revenue": number,
"earnings": number
}
- Countries.json
{
"country_name": string
}
- Company_Category.json
{
"company_name": string,
"category_name": string
}
There are 4 strong entities and 1 weak entitiy in this ERD. The strong entities are Company, Stock, Category, and Country. The weak entity is Performance. In short, this is the description for each entity:
- Company: Contains the company's name and description. Basically the company's general information.
- Stock: Contains the company's stock information such as the stock's name, price, and market cap.
- Category: The category that is availabe in this scope
- Performance: Contains the company's historical performance such as revenue and earnings.
- Country : Contains the country where the company is located.
The relationship between the entities can be seen in the picture.
The relational model is derived from the ERD. The tables are as follows:
- Company : Contains the company's name, description, and FK reference to Country as the country where the company is located.
- Stock : Contains the stock's name, price, market cap, and FK reference to Company as the company that the stock belongs to.
- Category : Contains the category's name and description.
- Performance : Contains the year revenue, earnings, and FK reference to company table as the company that the performance belongs to.
- Country : Contains the country's name.
- Company_Category : Contains the company's name and category's name as the company's category.
The steps to translate from ERD to Relational Model are as follows:
- Change all strong entities into tables.
- Change all weak entities into tables with the composition of primary key of the strong entity and the discriminator of the weak entity as PK.
- Make the many to many relationship into a new table with the primary key of the two entities as the primary key of the new table.
In detail, change the Category, Company, Stock, and Country entities into tables with the PK from ERD as their PK. Then, change the Stock Category relationship into a new table called Stock_category. After that, change the Performance entity into a table with the PK from Company and the year as the PK.
Some libraries used in this project:
- BeautifulSoup V4.12.2
- aiohttp V3.10.0
- mysql-connector-python V8.0.17
- python-dotenv V1.0.1
- pandas V1.5.1
- asyncio V3.4.3
The data is scraped from https://companiesmarketcap.com/
Name | NIM | |
---|---|---|
Imanuel Sebastian Girsang | 13522058 | imanuelgirsang1@gmail.com |