Spark-streaming
Bloom filtering, Flajolet Martin algorithm, and reservoir sampling algorithm
Introduction:
In this repository, there is implementation of three algorithms:
- Bloom filtering.
- FlajoletMartin algorithm.
- reservoir sampling.
Programming Requirements:
- You will need Spark Streaming library for task2- FlajoletMartin Algorithm. You will use Twitter streaming API streaming for task3: you can use the Python library, tweepy, and Scala library, sparkstreaming-twitter for this task.
- You can only use Spark RDD and standard Python or Scala libraries except for the ones in (b). i.e. no point if using Spark DataFrame or DataSet.
Programming Environment:
Python 3.6, Scala 2.11 and Spark 2.3.2
Dataset:
Yelp Business Data i.e., business_first.json and business_second.json
- For Bloom filtering you need to download the business_first.json and business_second.json from
- The first file is used to set up the bit array for Bloom fitering, and the second file is used for prediction.
Tasks
Task1: Bloom Filtering:
- I implemented the Bloom Filtering algorithm to estimate whether the city of a business in business_second.json has shown before in business_first.json.
- We need to find proper bit array size, hash functions and the number of hash functions in the Bloom Filtering algorithm.
- Some possible the hash functions are: f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m where p is any prime number and m is the length of the filter bit array. You can use any combination for the parameters (a, b, p). The hash functions should keep the same once you created them.
- Since the city of a business is a string, you need to convert it into an integer and then apply hash functions to it., the following code shows one possible solution:
-
import binascii
int(binascii.hexlify(s.encode('utf8')),16)
-
(We only treat the exact the same strings as the same cities. I did not consider alias. If one record in the business_second.json file does not contain the city field, or the city field is empty, I predicted zero for that record.)
Execution Details:
the code ran within 60 seconds and it is evaluated on the false positive rate (FPR) and the false negative rate(FNR).
Task2: Flajolet-Martin algorithm
- In task2, I implement the Flajolet-Martin algorithm (including the step of combining estimations from groups of hash functions) to estimate the number of unique cities within a window in the data stream.
- I found proper hash functions and the proper number of hash functions in the Flajolet-Martin algorithm.
Task3: Fixed Size Sampling on Twitter Streaming
- You will use Twitter API of streaming to implement the fixed size sampling method (Reservoir Sampling Algorithm) and find popular tags on tweets based on the samples.
- In this task, we assume that the memory can only save 100 tweets, so we need to use the fixed size sampling method to only keep part of the tweets as a sample in the streaming.
- When the streaming of the Twitter coming, for the first 100 tweets, you can directly save them in a list.
- After that, for the nth twitter, you will keep the nth tweet with the probability of 100/n, otherwise discard it.
- If you keep the nth tweet, you need to randomly pick one in the list to be replaced. If the coming tweet has no tag, you can directly ignore it.
- You also need to keep a global variable representing the sequence number of the tweet. If the coming tweet has no tag, the sequence number will not increase, otherwise the sequence number increases by one.
- Every time you receive a new tweet, you need to find the tags in the sample list with the top 3 frequencies.
- All the results are printed in a csv file.