yorek / analyzing-stackexchange-with-azure-data-lake

Analyzing StackExchange data with Azure Data Lake

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Analyzing StackExchange data with Azure Data Lake

This repository contains all the code & scripts for my 'Analyzing StackExchange data with Azure Data Lake' series.

This series will take you through the process of storing StackExchange data in Data Lake Store, aggregating all the User-data from all the websites into one file and gaining knowledge from it with Data Lake Analytics. After that we'll use PowerBI to visualize the gained knowledge.

In the introduction I've talked about the four major blocks in the series:

  1. Storing the data in Azure Data Lake Store or Azure Storage (post)
  2. Aggregating the data with Azure Data Lake Analytics
  3. Analyzing the data with Azure Data Lake Analytics
  4. Visualizing the data with Power BI

Getting the StackExchange Data Dump

Stack Exchange has made their data available from all their websites under Creative Commons license. It includes data about users, posts, comments, votes, etc for every single site.

Stack Exchange Logo

We will use this data as a demo set as this reflect real-world data. The data contains information about every website by StackExchange going from users & posts to comments and votes and beyond.

Here is an example of how the folder for coffee-stackexchange-com is structured:

+ coffee-stackexchange-com
	- Badges.xml
	- Comments.xml
	- PostHistory.xml
	- PostLinks.xml
	- Posts.xml
	- Tags.xml
	- Users.xml
	- Votes.xml

You can find all the data here.

License

Licensed under the terms of the MIT license.

About

Analyzing StackExchange data with Azure Data Lake

License:MIT License


Languages

Language:PowerShell 100.0%