pablo-pnunez / TripAdvisor-review-downloader

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TripAdvisor Review Downloader

For efficiency, this program ( Main.py ) splits the data download into four different stages, each of which can be executed in parallel using threads:

  1. Item download: The first step is to obtain a list with all available items (Restaurants) and their basic details. In this phase we will obtain, therefore, the name of the item, its identifier within the website, its rating (calculated by TripAdvisor based on user reviews) and most importantly, the URL of the page where the details of the specific item can be found. This last element will allow us to run the next step.
  2. Obtain reviews: Starting from the list of items in the previous step, we are going to enter the URL of each one in order to extract the basic details of their reviews. The way the website is implemented, it is not possible to see all the content of each review directly on the item's details page, so this will have to be done in a later step. From this phase we will extract the identifier, title, stars and URL of each review of each item. It should be noted that the reviews can be written in different languages, but we have chosen to download those in the native language of the city.
  3. Extend reviews: With the list and basic details of each review, we can now expand each of them by obtaining the full text of the review and a list of the images uploaded by the user. Regarding the user, we will take advantage of this phase to store the name and identifier of the author of each review. For the images, we will simply store the URL to download, in the last phase, the associated file. It should be noted that, due to the limitations of the website, only a maximum of four photographs can be viewed per review (although there are more in some cases), which limits the maximum number of photographs to this number.
  4. Image download: Finally, the last step is the downloading and storage of images. This phase can be omitted if these files are not going to be used.

About


Languages

Language:Python 51.2%Language:Jupyter Notebook 48.0%Language:Shell 0.8%