Spark

In this repository will explain the Apache Spark an open source distributed general purpose cluster computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone.

1 Create a dataframe from the source with Spark

Spark can retrieve raw data from the source in different types.

  # JSON File
  dataframe = sc.read.json("the/path/of/the/file.json")
  
  # TXT File 
  dataframe_txt = sc.read.text("the/path/of/the/file.txt")
  
  # CSV File
  dataframe_csv = sc.read.csv("the/path/of/the/file.csv")

The raw data from the file is transformed into a spark dataframe format.

1.1 Shows the structure of the dataframe

Print only the columns of the dataframe

  dataframe = sc.read.json('file.json')
  print(dataframe.columns)

The below command shows the structure of the dataframe. Describe each column of the type.

  print(dataframe.printSchema())

root
-- _id: struct (nullable = true)

-- amazon_product_url: string (nullable = true)
-- author: string (nullable = true)
-- bestsellers_date: struct (nullable = true)


-- description: string (nullable = true)
-- price: struct (nullable = true)


-- published_date: struct (nullable = true)


-- publisher: string (nullable = true)
-- rank: struct (nullable = true)

-- rank_last_week: struct (nullable = true)

-- title: string (nullable = true)
-- weeks_on_list: struct (nullable = true)

1.2 How to show the dataframe

To show the dataframe with all the columns (20 first rows)

  print(dataframe.show())

To show a specific number of rows. The variable n indicates the number of rows.

  print(dataframe.show(n))

1.3 The shape of the dataframe

In order to print the number of rows and columns

  print(dataframe.count(), len(dataframe.columns))

The first number is the number of rows and the second number is the number of columns: 10195 12

1.4 Show the basic statistics for a column .

  print(dataframe.describe(['author']).show())

1.5 Show the data type of each column

  print(dataframe.dtypes)

Name
[('_id', 'struct<$oid:string>'), ('amazon_product_url', 'string'), ('author', 'string'), ('bestsellers_date', 'struct<$date:struct<$numberLong:string>>'), ('description', 'string'), ('price', 'struct<$numberDouble:string,$numberInt:string>'), ('published_date', 'struct<$date:struct<$numberLong:string>>'), ('publisher', 'string'), ('rank', 'struct<$numberInt:string>'), ('rank_last_week', 'struct<$numberInt:string>'), ('title', 'string'), ('weeks_on_list', 'struct<$numberInt:string>')]

Name

[('_id', 'struct<$oid:string>'), ('amazon_product_url', 'string'), ('author', 'string'), ('bestsellers_date', 'struct<$date:struct<$numberLong:string>>'), ('description', 'string'), ('price', 'struct<$numberDouble:string,$numberInt:string>'), ('published_date', 'struct<$date:struct<$numberLong:string>>'), ('publisher', 'string'), ('rank', 'struct<$numberInt:string>'), ('rank_last_week', 'struct<$numberInt:string>'), ('title', 'string'), ('weeks_on_list', 'struct<$numberInt:string>')]

1.6 The select operation

To select a specific column in spark dataframe. You have to use select command (like in SQL)

  print(dataframe.select("author").show())

author
Dean R Koontz
Stephenie Meyer
Emily Giffin
Patricia Cornwell
Chuck Palahniuk
James Patterson a...
John Sandford
Jimmy Buffett
Elizabeth George
David Baldacci
Troy Denning
James Frey
Garth Stein
Debbie Macomber
Jeff Shaara
Phillip Margolin
Jhumpa Lahiri
Joseph O'Neill
John Grisham
James Rollins
+--------------------+
only showing top 20 rows

If you want to select more than one column, then it needs to write:

  print(dataframe.select("author", "price").show(5))
  
  # Alternative command is 
  print(dataframe.select(dataframe["author"], dataframe["price"]).show(5))

author	price
Dean R Koontz	[, 27]
Stephenie Meyer	[25.99,]
Emily Giffin	[24.95,]
Patricia Cornwell	[22.95,]
Chuck Palahniuk	[24.95,]
only showing top 5 rows

1.7 The filter operation

The filter command, filters rows using the given condition. Both shows the same results.

  print(dataframe.filter(dataframe.author == "Dean R Koontz").show())
  
  # The where is an alias command (alternative command)
  print(dataframe.where(dataframe.author == "Dean R Koontz").show())

1.8 The "when" operation

In the first example, the “title” column is selected and a condition is added with a “when” condition.

  print(dataframe.select(dataframe.title, dataframe.author, when(dataframe.author == 'Debbie Macomber', 1).otherwise(0)).show())

title	author	CASE WHEN (author = Debbie Macomber) THEN 1 ELSE 0 END
ODD HOURS	Dean R Koontz	0
THE HOST	Stephenie Meyer	0
LOVE THE ONE YOU'...	Emily Giffin	0
THE FRONT	Patricia Cornwell	0
SNUFF	Chuck Palahniuk	0
SUNDAYS AT TIFFANY�S	James Patterson a...	0
PHANTOM PREY	John Sandford	0
SWINE NOT?	Jimmy Buffett	0
CARELESS IN RED	Elizabeth George	0
THE WHOLE TRUTH	David Baldacci	0
INVINCIBLE	Troy Denning	0
BRIGHT SHINY MORNING	James Frey	0
THE ART OF RACING...	Garth Stein	0
TWENTY WISHES	Debbie Macomber	1
THE STEEL WAVE	Jeff Shaara	0
EXECUTIVE PRIVILEGE	Phillip Margolin	0
UNACCUSTOMED EARTH	Jhumpa Lahiri	0
NETHERLAND	Joseph O'Neill	0
THE APPEAL	John Grisham	0
INDIANA JONES AND...	James Rollins	0
only showing top 20 rows

The table shows an overall result of the findings. It is easy to count only the results where the value is 1.

 print(dataframe.where(dataframe.author == 'Debbie Macomber').where(dataframe.title == 'TWENTY WISHES').count())

The result is: 1

1.9 Multiple operation

In one Spark query can be used multiple operation. Like multiple where.

print(dataframe.select(dataframe.title, dataframe.author, dataframe.publisher).filter(dataframe.author == 'Debbie Macomber').filter(dataframe.publisher == 'Mira').filter(dataframe.title != 'TWENTY WISHES').show())

title	author	publisher
A CEDAR COVE CHRI...	Debbie Macomber	Mira
A CEDAR COVE CHRI...	Debbie Macomber	Mira
A CEDAR COVE CHRI...	Debbie Macomber	Mira
A CEDAR COVE CHRI...	Debbie Macomber	Mira
A CEDAR COVE CHRI...	Debbie Macomber	Mira
A CEDAR COVE CHRI...	Debbie Macomber	Mira
SUMMER ON BLOSSOM...	Debbie Macomber	Mira
SUMMER ON BLOSSOM...	Debbie Macomber	Mira
SUMMER ON BLOSSOM...	Debbie Macomber	Mira
SUMMER ON BLOSSOM...	Debbie Macomber	Mira
THE PERFECT CHRIS...	Debbie Macomber	Mira
THE PERFECT CHRIS...	Debbie Macomber	Mira
HANNAH'S LIST	Debbie Macomber	Mira
HANNAH'S LIST	Debbie Macomber	Mira
HANNAH'S LIST	Debbie Macomber	Mira
HANNAH'S LIST	Debbie Macomber	Mira
CALL ME MRS. MIRACLE	Debbie Macomber	Mira
CALL ME MRS. MIRACLE	Debbie Macomber	Mira
A TURN IN THE ROAD	Debbie Macomber	Mira
A TURN IN THE ROAD	Debbie Macomber	Mira
``

Count the result of the above array.

  print(dataframe.filter(dataframe.author == 'Debbie Macomber').filter(dataframe.publisher == 'Mira').filter(dataframe.title != 'TWENTY WISHES').count())

The result is: 30

2.0 SQL queries with Spark

Another ability is taht, someone can write also SQL queries into spark. Create a register of dataframe as a global temporary view. After the global temporary view is tied to a system preserved database global_temp.

  dataframe.createGlobalTempView("people")

# The current sql query is the replication of the spark query of the 1.9 
  print(sc.sql("select title from global_temp.people where author = 'Debbie Macomber' and publisher = 'Mira' and title <> 'TWENTY     WISHES'").show())

BardisRenos / Spark