rdoume / HN_Search

Some Notebooks explory Hacker news Seach queries dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HN_Search

This is the repository regarding the analysis of the HackerNews search queries dataset.

Requirements

This analysis is done in Python, in an interactive way through Jupyter notebooks.

The required packages for this analysis are in the file requirements.txt

FastText can be tricky to install, you should follow the instructions on their github

I would advise to create a dedicated virtualenv in order to not mess with you current python installation. These notebooks have been tested under Ubuntu 16.04 and Mint Sonia

External data

The notebooks are using the binarized fasttext wordvectors, trained on the CommonCrawl Dataset, available at this address One you downloaded the wordvectors, you shall put in the the data folder.

Prepare the dataset

The data has been given as separate day data. For the sake of simplicity, I concatenated each file as a unique one with the command

cat ./data/2018* > all.txt

This file all.txt is required in order to execute the notebooks

Visualisation

The visualisation in the notebooks are done through the plotly visualisation library, and is highly recommended due to it's interactivity.

However, because of the number of datapoints the visualtion are several megabytes, the notebook will contains static images (saved in the folder images/) of the visualisation, therefore, you should re-run the notebooks.

3 notebooks are provided in this analysis:

  • EDA.ipynb: This notebook is the first to open, as it acts as an exploratory notebook around the dataset, and the findings in this notebooks will lead to the other two.

  • Clustering.ipynb: This notebook is the reflexion of the study on how can we cluster user search queries in an unsupervised manner, in order to better provide insights on the type of queries.

  • Trends_per_day.ipynb : This notebook is used as a study on how can we compute, in an efficient manner, trending queries per day, in order to have a better representation of what queries are done in an time interractive manner.

About

Some Notebooks explory Hacker news Seach queries dataset.

License:GNU Affero General Public License v3.0


Languages

Language:Jupyter Notebook 100.0%