Zorigt / wiki_dump_etl

Map music artists who are mentioned on each other's wiki page

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Map Music Artists in Wikipedia

#Table of Contents

#Introduciton Map music artists who are mentioned on wiki pages of other music artists.

#Data Set The raw data is wiki dump XML file. It contains all wiki page articles including music artists.

#Data Transformations The wiki dump XML file contains tags for musicians. Therefore, the first step is to parse through the XML's page tags and stream each page tag through Kafka. Kafka then refines the page tags to select wiki pages that only include singer tags within the page tags. Kafka sends the XML page tags to HDFS and Apache Spark. HDFS store the logs and Spark further processes the singer pages into noSQL table and stores the table into Cassandra. Once the data is in Cassandra, then do the mapping of music artists with other singers who are mentioned on the wiki pages.

#Schemas

artist name other artist 1 other artist 2 ....

Work in progress

#Live Demo Work in progress

#Presentation Deck Work in progress

#Instructions to Run this Pipeline Work in progress

About

Map music artists who are mentioned on each other's wiki page


Languages

Language:Java 100.0%