In this project, I utilized my knowledge of SparkSQL to determine key metrics about home sales data. By leveraging Spark SQL, I performed various transformations, aggregations, and calculations on the home sales data to extract valuable insights and key metrics.I also used Spark to create temporary views, partition the data, cache and uncache a temporary table, and verify that the table was uncached.
To begin, I imported the necessary PySpark SQL functions for this assignment. Following that, I read the home_sales_revised.csv data into a Spark DataFrame using the appropriate format and file reading functions.
-
After performing initial exploration of the loaded dataFrame to understand its structure and contents, I created a temporary table called
home_sales
and then SQL queries were written to calculate various key metrics from the home sales data in order to answer the following questions:- What is the average price for a four-bedroom house sold for each year?
- What is the average price of a home for each year it was built that has three bedrooms and three bathrooms?
- What is the average price of a home for each year that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet?
- What is the "view" rating for homes costing more than or equal to $350,000? I also determined the run time for this query.
-
Here, I explored the overall performance of cached data vs. uncached data under the following steps:
- Cache the temporary table home_sales and check if temporary table is cached.
- Using the cached data, run the query that filters out the view ratings with an average price of greater than or equal to $350,000. Determine the runtime and compare it to uncached runtime.
- Partition by the "date_built" field on the formatted parquet home sales data and create a temporary table for the parquet data.
- Run the query that filters out the view ratings with an average price of greater than or equal to $350,000. Determine the runtime and compare it to uncached runtime.
In this analysis, Spark SQL was utilized to extract key metrics from home sales data. The following steps highlighted above were followed. Caching the data in memory provided performance improvements by reducing the need for disk I/O operations. It avoided redundant computations and enabled faster iterations. Overall, the analysis demonstrated the power of Spark SQL for processing and analyzing home sales data, providing valuable insights through key metrics calculation.
Data for this dataset was generated by edX Boot Camps LLC, and is intended for educational purposes only.