llijiajun / NDV_Estimation_in_distributed_environment

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sampling-based estimation of the number of distinct values in distributed environment

Environment

Simulated Experiment

  • Ubuntu
  • C++ 11
  • GCC 4.8

Experiments on Spark

  • Scala
  • Python 3.7
  • JDK
  • Hadoop 3.3
  • Spark 3.1

Preparation

Generate sampling data from Poisson Distribution and Zipfian Distribution.

python genpoi.py
python genzipf.py

About


Languages

Language:Jupyter Notebook 47.4%Language:C++ 44.4%Language:Python 6.2%Language:Cython 2.0%