pyspark

There are 84 repositories under pyspark topic.

ibis-project / ibis
the portable Python dataframe library
python impala pandas database clickhouse postgresql sqlite mysql datafusion sql pyspark duckdb bigquery pyarrow mssql polars snowflake trino
Language:Python 6200
SynapseML
microsoft / SynapseML
Simple and Distributed Machine Learning
spark pyspark azure scala microsoft ml machine-learning databricks cognitive-services lightgbm http model-deployment deep-learning ai apache-spark data-science synapse big-data onnx opencv
Language:Scala 5175
spark-nlp
JohnSnowLabs / spark-nlp
State of the Art Natural Language Processing
nlp natural-language-processing spark pyspark named-entity-recognition sentiment-analysis lemmatizer spell-checker entity-extraction part-of-speech-tagger bert transformers tensorflow language-detection machine-translation text-classification llm question-answering llamacpp onnx
Language:Scala 4066
apache / linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
sql spark hive pyspark livy linkis engine storage resource-manager application-manager context-service scriptis udf hive-table rest-api jobserver thrift-server jdbc presto impala
Language:Java 3392
AlexIoannides / pyspark-example-project
Implementing best practices for PySpark ETL jobs and applications.
pyspark etl-job python data-engineering spark data-science etl etl-pipeline
Language:Python 2019
uber / petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
deep-learning machine-learning parquet parquet-files pyarrow pyspark pytorch sysml tensorflow
Language:Python 1867
awesome-spark / awesome-spark
A curated list of awesome Apache Spark packages and resources.
apache-spark awesome pyspark sparkr
Language:Shell 1839
jadianes / spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
spark python pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning big-data bigdata
Language:Jupyter Notebook 1666
ptyadana / SQL-Data-Analysis-and-Visualization-Projects
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
sql mysql mysql-notes exercises mysqlworkbench mysql-database data-analysis postgresql postgres sqlite pgadmin tableau challenges digital-music-store sql-queries sql-data-analysis python pyspark apache-spark
Language:Jupyter Notebook 1594
hi-primus / optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Language:Python 1524
jupyter-incubator / sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
spark kernel cluster livy magic sql-query pandas-dataframe jupyter pyspark kerberos notebook jupyter-notebook
Language:Python 1361
narwhals-dev / narwhals
Lightweight and extensible compatibility layer between dataframe libraries!
cudf ibis pandas polars pyarrow dask duckdb pyspark
Language:Python 1359
logicalclocks / hopsworks
Hopsworks - Data-Intensive AI platform with a Feature Store
feature-store aws azure data-science feature-engineering feature-management gcp governance kserve machine-learning mlops model-serving pyspark python serverless ml hopsworks
Language:Java 1258
mahmoudparsian / pyspark-tutorial
PySpark-Tutorial provides basic algorithms using PySpark
big-data big-data-analytics data-algorithms dataframes pyspark pyspark-sql pyspark-tutorial ranking-functions rdds spark spark-dataframes spark-rdd
Language:Jupyter Notebook 1257
graphframes
graphframes / graphframes
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
apache-spark big-data connected-components dataframe dataframes graphs network-motif network-motifs networks spark pyspark
Language:Scala 1092
mahmoudparsian / data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
hadoop-mapreduce java distributed-computing scala mapreduce data-algorithms python machine-learning pyspark distributed-algorithms mappers reducers apache-hadoop apache-spark design-patterns partitioning
Language:Java 1081
lakehq / sail
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.
arrow artificial-intelligence big-data data data-engineering datafusion distributed-computing machine-learning pyspark python rust spark sql
Language:Rust 1047
h2oai / sparkling-water
Sparkling Water provides H2O functionality inside Spark cluster
h2o spark machine-learning integration pysparkling rsparkling big-data pyspark scala
Language:Scala 977
lyhue1991 / eat_pyspark_in_10_days
pyspark🍒🥭 is delicious，just eat it!😋😋
pyspark spark
Language:Python 821
WeBankFinTech / Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
hue zeppelin spark hive sql pyspark scala ide hql hive-table udf resouce-management errorcode linkis
Language:Vue 813
HariSekhon / DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
cloudformation python hbase json avro parquet spark pyspark travis-ci elasticsearch solr hadoop hdfs dockerhub docker linux aws devops gcp gcf
Language:Python 807
kuwala
kuwala-io / kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
data data-integration data-science open-data spatial-analysis elt kuwala open-source scraping dbt postgres pyspark python jupyter admin-boundaries google-trends population react no-code react-flow
Language:JavaScript 805
MrPowers / chispa
PySpark test helper methods with beautiful error messages
pyspark testing
Language:Python 724
ankurchavda / SparkLearning
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
spark big-data pyspark
681
mrpowers-io / quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
pyspark apache-spark
Language:Python 675
koheesio
Nike-Inc / koheesio
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
data-engineering delta-lake pydantic pyspark python
Language:Python 648
kevinschaich / pyspark-cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
pyspark pyspark-tutorial cheatsheet cheat cheatsheets reference references documentation docs data-science data spark spark-sql guide guides quickstart
625
capitalone / datacompy
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
compare dask data data-science dataframes fugue numpy pandas polars pyspark python snowflake snowpark spark
Language:Python 611
spark-standalone-cluster-on-docker
cluster-apps-on-docker / spark-standalone-cluster-on-docker
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
docker jupyter pyspark python r scala spark sparkr
Language:Jupyter Notebook 497
cartershanklin / pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
spark apache-spark pyspark big-data
Language:Python 479
ericxiao251 / spark-syntax
This is a repo documenting the best practices in PySpark.
pyspark best-practices
Language:Jupyter Notebook 462
commoncrawl / cc-pyspark
Process Common Crawl data with Python and Spark
spark warc-files wet commoncrawl sparksql pyspark wat-files common-crawl
Language:Python 442
databrickslabs / dbldatagen
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
datagen pyspark python data-generation faker spark spark-streaming delta-live-tables deltalake databricks synthetic-data datageneration datagenerator
Language:Python 432
CamDavidsonPilon / tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
python percentile quantile estimate pyspark distributed-computing mapreduce
Language:Python 403
ekampf / PySpark-Boilerplate
A boilerplate for writing PySpark Jobs
python apache-spark pyspark boilerplate
Language:Python 394
typedef-ai / fenic
Build reliable AI and agentic applications with DataFrames
agents ai arrow dataframe-library dataframes duckdb elt etl llm orchestration polars pyspark python rust
Language:Python 378

pyspark

ibis-project / ibis

microsoft / SynapseML

JohnSnowLabs / spark-nlp

apache / linkis

AlexIoannides / pyspark-example-project

uber / petastorm

awesome-spark / awesome-spark

jadianes / spark-py-notebooks

ptyadana / SQL-Data-Analysis-and-Visualization-Projects

hi-primus / optimus

jupyter-incubator / sparkmagic

narwhals-dev / narwhals

logicalclocks / hopsworks

mahmoudparsian / pyspark-tutorial

graphframes / graphframes

mahmoudparsian / data-algorithms-book

lakehq / sail

h2oai / sparkling-water

lyhue1991 / eat_pyspark_in_10_days

WeBankFinTech / Scriptis

HariSekhon / DevOps-Python-tools

kuwala-io / kuwala

MrPowers / chispa

ankurchavda / SparkLearning

mrpowers-io / quinn

Nike-Inc / koheesio

kevinschaich / pyspark-cheatsheet

capitalone / datacompy

cluster-apps-on-docker / spark-standalone-cluster-on-docker

cartershanklin / pyspark-cheatsheet

ericxiao251 / spark-syntax

commoncrawl / cc-pyspark

databrickslabs / dbldatagen

CamDavidsonPilon / tdigest

ekampf / PySpark-Boilerplate

typedef-ai / fenic