from pyspark .sql import SparkSession
import pandas as pd
Create a variable for starting a session
# Set environment variables:
import os
import sys
os .environ ['PYSPARK_PYTHON' ] = sys .executable
os .environ ['PYSPARK_DRIVER_PYTHON' ] = sys .executable
# PySpark applications start with initializing SparkSession:
spark = SparkSession .builder .appName ("spark" ).getOrCreate ()
spark .conf .set ("spark.sql.execution.arrow.enabled" , "true" )
spark
# Output
SparkSession - in-memory
SparkContext
Spark UI
Version v3.4.1
Master local[*]
AppName spark
Read a dataset and create a dataframe
# Read the file:
df = spark .read .csv ('Data.csv' , header = True , inferSchema = True )
# Show the dataframe:
df .show ()
# Show the schema details:
df .printSchema ()
# df.show():
+-----------+----+------------------------+
| Name| Age| Designation|
+-----------+----+------------------------+
| Kirankumar| 28| Data Science Specialist|
| Paramveer| 29| Data Analyst|
| Gaurav| 29| SDE|
+-----------+----+------------------------+
# df.printSchema():
root
| -- Name: string (nullable = true)
| -- Age: integer (nullable = true)
| -- Role: string (nullable = true)
pyspark.sql.dataframe.DataFrame
# List of all the column names:
df .columns
# Show only top n rows:
df .head (3 )
# In pyspark, the output will be a list not a dataframe.
# Output:
[Row(Name='Kirankumar', Age=28, Role='Data Science Speicalist'),
Row(Name='Paramveer', Age=29, Role='Data Analyst'),
Row(Name='Gaurav', Age=29, Role='SDE')]
# Select a particular column:
df .select ('Name' ).show ()
# Pass list of columns to display multiple columns:
df .select (['Name' , 'Experience' ]).show ()
PySpark DataFrame can be created by passing a list of lists
, tuples
, dictionaries
, pyspark.sql.Rows
(List of Rows), a pandas DataFrame
and an RDD
list.
Create DataFrame using Pandas
import pandas as pd
data = {
'Name' : ['Kirankumar Yadav' , 'Suraj Sanka' , 'Sumit Suman' ],
'Age' : [28 , 28 , 27 ],
'Designation' : ['Data Science Specialist' , 'DevOps Engineer' , 'Python Developer' ]
}
df = pd .DataFrame (data )
df
Name
Age
Designation
Kirankumar Yadav
28
Data Science Specialist
Suraj Sanka
28
DevOps Engineer
Sumit Suman
27
Python Developer
# Create spark DataFrame from Pandas DataFrame:
sdf = spark .createDataFrame (df )
sdf .show ()
# Show all the columns
+-----------------+----+------------------------+
| Name| Age| Designation|
+-----------------+----+------------------------+
| Kirankumar Yadav| 28| Data Science Specialist|
| Suraj Sanka| 28| DevOps Engineer|
| Sumit Suman| 27| Python Developer|
+-----------------+----+------------------------+
Converting the spark DataFrame to Pandas DataFrame
Name
Age
Designation
Kirankumar Yadav
28
Data Science Specialist
Suraj Sanka
28
DevOps Engineer
Sumit Suman
27
Python Developer
Description of Data Frame (Especially numerical features)
# Statistical descriptions (count, mean, stddev, max, min):
sdf .describe ().show ()
Define the schema of a DataFrame
from pyspark .sql .types import StructField , StringType , IntegerType , StructType
# Define schema:
data_schema = [StructField (name = 'age' , dataType = IntegerType (), nullabel = True ), StructField (name = 'name' , dataType = StringType (), nullable = True )]
new_schema = StructType (fields = data_schema )
sdf = spark .read .json ('People.json' , schema = new_schema )
sdf .printSchema ()
sdf .withColumn (colName = 'New Age' , col = sdf ['age' ]).show ()
sdf .withColumn (colName = 'New Age' , col = sdf ['age' ] * 2 ).show ()
Rename the existing column
sdf .withColumnRenamed (existing = 'sex' , new = 'gender' ).show ()
Create a temporary view like SQL view
sdf .createOrReplaceTempView ('People' )
# We can write SQL queries to select the data:
spark .sql ('SELECT * FROM people WHERE age > 20' ).show ()