Data Manipulation and Visualization using Pandas and Matplotlib

It is a project where I used python's powerful libraries (Pandas and Matplotlib) to manipulate data and visualize it.

This project analyzes a dataset of police records of car violations using Python's pandas and matplotlib libraries. The dataset includes various details about each stop, such as the stop date, driver demographics, type of violation, and outcomes.

Project Overview

This project aims to analyze police violation records to uncover patterns and insights about the nature of traffic stops, driver demographics, and the outcomes of these stops. The analysis includes visualizations to better understand the data and identify any significant trends.

Dataset Description

The dataset contains the following columns:

  • stop_date: The date of the stop.
  • stop_time: The time of the stop.
  • country_name: The country where the stop occurred (dropped due to all null values).
  • driver_gender: The gender of the driver (M or F).
  • driver_age_raw: The raw age of the driver.
  • driver_age: The processed age of the driver.
  • driver_race: The race of the driver.
  • violation_raw: The raw violation description.
  • violation: The processed violation description.
  • search_conducted: Whether a search was conducted (True or False).
  • search_type: The type of search conducted.
  • stop_outcome: The outcome of the stop (e.g., citation, warning).
  • is_arrested: Whether the driver was arrested.
  • stop_duration: The duration of the stop.
  • drugs_related_stop: Whether the stop was drug-related.

Setup Instructions

To set up the project locally, follow these steps:

Project code:

import pandas as pd 
import matplotlib.pyplot as plt

data = pd.read_csv(r'D:\1. APPS\Works\Playground\Project\DATASETS\Police_dataset.csv')
# Define font dictionaries for titles and labels
font1= {'family':'serif','color':'blue','size':10}
font2= {'family':'serif','color':'darkred','size':10}

#Total sum of null values present in which column?

## 1. Remove the column that only contains null / missing values:
data.drop(columns='country_name', inplace=True)
print('>> After removing null values: \n \n', data.isnull().sum())

## 2. For Speeding, if Men or Women are stopped more often?
speeding_data = data[data.violation == 'Speeding']  # Filtering data for speeding violations

gender_counts = speeding_data.driver_gender.value_counts()  # Counting gender distribution
## Plotting the bar chart for gender distribution in speeding violations
gender_counts.plot(kind='bar', color=['blue', 'pink'])
plt.title('Gender Distribution for Speeding Violations', fontdict=font1)
plt.xlabel('Gender', fontdict=font2)

Gender Distribution for Speeding Violations

This analysis determines if men or women are more often stopped for speeding violations. Gender_Violations

## 3.  Does gender affect who gets searched during a stop?
gender_count = data.groupby('driver_gender').search_conducted.sum()
# Plotting the pie chart for searches conducted by gender
gender_count.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=['red', 'yellow'])
plt.title('Search Conducted by Gender')

Search Conducted by Gender

This analysis examines whether gender affects the likelihood of a search being conducted during a stop. gender_count

## 4.  What is the mean stop duration? and Mapping stop duration to numerical values
# we need to convert it into integer:
data['stop_duration'] = data['stop_duration'].map({'0-15 Min': 7.5, '16-30 Min': 24, '30+ Min': 45})

# Plotting the histogram for stop duration distribution
data['stop_duration'].plot(kind='hist', bins=[0, 15, 30, 45], edgecolor='black')
plt.title('Distribution of Stop Duration',fontdict=font1)
plt.xlabel('Stop Duration (minutes)',fontdict=font2)

# Now we can calculate the mean value of stop duration column:
mean_stop_duration = data['stop_duration'].mean()
print(f'The mean value of stop_duration column is: {mean_stop_duration:.2f}')

Distribution of Stop Duration

This analysis visualizes the distribution of stop durations. distrubution_of_stop_duration

## 5.  Compare the Age distribution for each violation:
# Histogram for Age Distribution by Violation:
violations = data['violation'].unique()
plt.figure(figsize=(12, 8))

for violation in violations:
    subset = data[data['violation'] == violation]
    plt.hist(subset['driver_age'], bins=20, alpha=0.5, label=violation)

plt.title('Age Distribution for Each Violation Type')
plt.xlabel('Driver Age')

Age Distribution for Each Violation Type

This analysis compares the age distribution of drivers for each violation type. age_violation

## 6. Monthly Trends in Violations: Analyze how violations vary by month to identify any seasonal trends or patterns
# Convert stop_date to datetime and extract month
data['stop_date'] = pd.to_datetime(data['stop_date'])
data['month'] = data['stop_date'].dt.month

# Group by month and violation type
monthly_violations = data.groupby(['month', 'violation']).size().unstack()

# Plotting monthly trends
monthly_violations.plot(kind='line', figsize=(12, 8))
plt.title('Monthly Trends in Violations')
plt.ylabel('Number of Violations')
plt.legend(title='Violation Type')

Monthly Trends in Violations

This analysis identifies seasonal trends in violations by month. monthly_trends_violation

## 7. Violation Type by Race:
# Group by race and violation type
race_violations = data.groupby(['driver_race', 'violation']).size().unstack()

# Plotting violation type by race
race_violations.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Violation Type by Race')
plt.xlabel('Driver Race')
plt.ylabel('Number of Violations')
plt.legend(title='Violation Type')

Violation Type by Race

This analysis examines the distribution of violation types by the race of the driver. violation_type_race

## 8. Search Conducted vs. Violation Type:
# Analyze if certain types of violations are more likely to result in a search.
search_by_violation = data.groupby(['violation', 'search_conducted']).size().unstack()  # Group by violation type and search conducted

# Plotting search conducted by violation type
search_by_violation.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Search Conducted by Violation Type')
plt.xlabel('Violation Type')
plt.ylabel('Number of Searches')
plt.legend(title='Search Conducted', labels=['No', 'Yes'])

Search Conducted vs. Violation Type

This analysis explores if certain types of violations are more likely to result in a search. search_conducted_by_violation_type


