There are 77 repositories under pyspark topic.
the portable Python dataframe library
State of the Art Natural Language Processing
A curated list of awesome Apache Spark packages and resources.
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Implementing best practices for PySpark ETL jobs and applications.
Jupyter magics and kernels for working with remote Spark clusters
PySpark-Tutorial provides basic algorithms using PySpark
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Hopsworks - Data-Intensive AI platform with a Feature Store
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Sparkling Water provides H2O functionality inside Spark cluster
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
This is a repo documenting the best practices in PySpark.
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
Pandas and Spark DataFrame comparison for humans and more!
A boilerplate for writing PySpark Jobs
Process Common Crawl data with Python and Spark
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Gathers Python deployment, infrastructure and practices.
🐍 Quick reference guide to common patterns & functions in PySpark.
Fundamentals of Spark with Python (using PySpark), code examples
A tool for building feature stores.
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines