tommenx / jaccard-similarity

calculate jaccard similarity using mapreduce framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jaccard-similarity

this project use hadoop to calculate jaccard similarity between statements

Task

  • Segmentation of a statement

Each sentence is divided into three characters regardless the speace、symbol such as _ , - etc...

Each letter retains the original capitalization

today is Monday
["tod","oda","day","ay_","y_i","_is","is_","s_M","_Mo","Mon","ond","nda","day"]
  • First Map Stage

Input format:

ID	words
1   Jurgen Annevelink

Output format

key: prefix_item1 value:ID|item1#item2#item3  ...
key: prefix_item2 value:ID|item1#item2#item3  ...
key: prefix_item3 value:ID|item1#item2#item3  ...
key: Jur value:1|Jur#urg#rge#gen#en #n A#Ann#nne  ...
  • First Reduce Stage

This stage calculate the similarity with same prefix item

Input format:

key: prefix_item1
value: ID1|item1,item2,item3,... ; ID2|item1,item2,item3,...  ; ID3|item1,item2,item3,...  ; ...

Output format:

ID1 	ID2 	similarity
ID1 	ID3 	similarity
ID2 	ID2 	similarity
  • Final MapReduce Stage

Remove the duplicates

Deployment

  • JDK1.8
  • Hadoop 2.9.0
  • Maven
  • IntelliJ IDEA

How to use

using hdfs to uoload test data into dirctory input

hadoop jar jaccard.jar Jaccard input output

Simple

input

1 Jurgen Annevelink
2 Rafiul Ahad
3 Amelia Carlson
4 Daniel H. Fishman
5 Michael L. Heytens
6 William Kent
7 Jos
8 Yuri Breitbart
9 Hector Garcia-Molina
10 Abraham Silberschatz
11 Stavros Christodoulakis
12 Leonidas Koveos
13 Umeshwar Dayal
......
......

output

(1,90)		1.0
(4,89)		1.0
(5,385)		0.3
(5,551)		0.3333333333333333
(5,1045)	0.3125
(5,1789)	0.30434782608695654
(5,2746)	0.3157894736842105
(6,22)		0.5714285714285714
(6,39)		0.5714285714285714
(6,59)		1.0
(6,62)		1.0
(6,97)		1.0
(6,270)		0.375
(6,495)		0.3157894736842105
(6,790)		0.3
......
......

About

calculate jaccard similarity using mapreduce framework


Languages

Language:Java 100.0%