thotamohan / Spark-streaming

Bloom filtering, Flajolet Martin algorithm, and reservoir sampling algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spark-streaming

Bloom filtering, Flajolet Martin algorithm, and reservoir sampling algorithm

Introduction:

In this repository, there is implementation of three algorithms:

  • Bloom filtering.
  • FlajoletMartin algorithm.
  • reservoir sampling.

Programming Requirements:

  • You will need Spark Streaming library for task2- FlajoletMartin Algorithm. You will use Twitter streaming API streaming for task3: you can use the Python library, tweepy, and Scala library, sparkstreaming-twitter for this task.
  • You can only use Spark RDD and standard Python or Scala libraries except for the ones in (b). i.e. no point if using Spark DataFrame or DataSet.

Programming Environment:

Python 3.6, Scala 2.11 and Spark 2.3.2

Dataset:

Yelp Business Data i.e., business_first.json and business_second.json

  • For Bloom filtering you need to download the business_first.json and business_second.json from
  • The first file is used to set up the bit array for Bloom fitering, and the second file is used for prediction.

Tasks

Task1: Bloom Filtering:

  • I implemented the Bloom Filtering algorithm to estimate whether the city of a business in business_second.json has shown before in business_first.json.
  • We need to find proper bit array size, hash functions and the number of hash functions in the Bloom Filtering algorithm.
  • Some possible the hash functions are: f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m where p is any prime number and m is the length of the filter bit array. You can use any combination for the parameters (a, b, p). The hash functions should keep the same once you created them.
  • Since the city of a business is a string, you need to convert it into an integer and then apply hash functions to it., the following code shows one possible solution:
    • import binascii

      int(binascii.hexlify(s.encode('utf8')),16)

(We only treat the exact the same strings as the same cities. I did not consider alias. If one record in the business_second.json file does not contain the city field, or the city field is empty, I predicted zero for that record.)

Execution Details:

the code ran within 60 seconds and it is evaluated on the false positive rate (FPR) and the false negative rate(FNR).

Task2: Flajolet-Martin algorithm

  • In task2, I implement the Flajolet-Martin algorithm (including the step of combining estimations from groups of hash functions) to estimate the number of unique cities within a window in the data stream.
  • I found proper hash functions and the proper number of hash functions in the Flajolet-Martin algorithm. Image of Sensorplacement

Task3: Fixed Size Sampling on Twitter Streaming

  • You will use Twitter API of streaming to implement the fixed size sampling method (Reservoir Sampling Algorithm) and find popular tags on tweets based on the samples.
  • In this task, we assume that the memory can only save 100 tweets, so we need to use the fixed size sampling method to only keep part of the tweets as a sample in the streaming.
  • When the streaming of the Twitter coming, for the first 100 tweets, you can directly save them in a list.
  • After that, for the nth twitter, you will keep the nth tweet with the probability of 100/n, otherwise discard it.
  • If you keep the nth tweet, you need to randomly pick one in the list to be replaced. If the coming tweet has no tag, you can directly ignore it.
  • You also need to keep a global variable representing the sequence number of the tweet. If the coming tweet has no tag, the sequence number will not increase, otherwise the sequence number increases by one.
  • Every time you receive a new tweet, you need to find the tags in the sample list with the top 3 frequencies.
  • All the results are printed in a csv file.

About

Bloom filtering, Flajolet Martin algorithm, and reservoir sampling algorithm


Languages

Language:Python 100.0%