isa96 / webscrapping

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Webscrapping-with-BeautifulSoup

At this module we will learn on how do simple web scrapping using beautiful soup. Web scrapping is one of a method that we can use to colleting the data from internet. At this particular module, we will try to scrap Indonesian inflation rate from pusatdata.kontan.co.id, it's one of data center from indonesian economic newspaper that provide couple of useful financial information. To do this we will only use a couple default library from python and BeautifulSoup.

This module is made as easy and simple as possible which can be used for new developer to learn how to webscrapping using Beautiful Soup. But to do webscrapping you will need a bit of knowlage in html which I'll also try to help to explain what you needed at this module, but it is always better if you understand a bit what in html first. You can read it quickly at beautifulsoup documentation. It explain what is html and what beautiful soup exactly do at it landing page.

This module is made as easy and simple as possible which can be used for new developer to learn how to webscrapping using Beautiful Soup. But to do webscrapping you will need a bit of knowlage in html which I'll also try to help to explain what you needed at this module, but it is always better if you understand a bit what in html first. You can read it quickly at beautifulsoup documentation. It explain what is html and what beautiful soup exactly do at it landing page.

Dependencies

Actually to follow this module you only need to install beautifulsoup4 with pip install beautifulsoup4 and you are good to go. But here some libraries that needed to be installed first that I use at this module :

  • beautifulSoup4
  • pandas
  • matplotlibs
  • request

What is BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib.

Since beautifulsoup used to pull the data out of a HTML, so first we need to pull out the html first. How we do it? We will use library request.

So all this code is doing is sending a GET request to spesific address we give. This is the same type of request your browser sent to view this page, but the only difference is that Requests can't actually render the HTML, so instead you will just get the raw HTML and the other response information.

Conclusion

In conclusion when you don't have a direct access to a data from a website you can always do the scrapping method. There is a couple library that can do same task like scrapy that can build bot to automaticly crawl data, but we choose beautiful soup since it's more beginner friendly and a helpful utility that allows a programmer to get specific elements out of a webpage (for example, a list of images).

After this you also can implement the scrapping to one function and put it at the flask webapp, which you can find the demo here and you can go to inflation branch to see example that scrap a same page or you can visit Pricemate. Which scrap tiket.com data to get train price list. I hope this short module help you to understand and can kickstart you to learn more about webscrapping using Beautifulsoup. Also feel free to contact us at mentor@algorit.ma if you have more question.

Happy learning~

About


Languages

Language:Jupyter Notebook 100.0%