This repository is designed for practicing the use of Hugging Face datasets.
- Learn to load datasets from the Hugging Face hub using various methods.
- Explore techniques for preparing datasets.
- Understand different tokenization, batching, and padding techniques.
- Identify and choose suitable datasets from the Hugging Face hub.
- Practice applying preprocessing to real raw data from A to Z.
Note
Here I used raw wikipedia dataset and applying the preprocessing operations such as:
- Removing unnecessary columns.
- Shuffling data.
- Computing article body lengths.
- Filtering content.
- Cleaning HTML character codes.
- Converting between Datasets and DataFrames.
- Splitting the data into training and testing sets.
- Clone this repository to your local machine:
git clone git@github.com:IsmaelMousa/playing-with-datasets.git
- Navigate to the playing-with-datasets directory:
cd playing-with-datasets
- Setup virtual environment:
python3 -m venv .venv
- Activate the virtual environment:
source .venv/bin/activate
- Install the required dependencies:
pip install -r requirements.txt
- Run the Notebook:
jupyter-notebook
The dataset used is a raw wikipedia dataset stored in Parquet format.
-
Load the Dataset: Load the dataset from a Parquet file.
-
Remove Unnecessary Columns: Drop columns such as
URL
,Introduction/Summary
,Sections/Headings
,References
,Categories
, andInfobox
. -
Shuffle the Dataset: Shuffle the dataset using a seed for reproducibility.
-
Compute Article Body Length: Calculate the word count for the
Body
of each article. -
Filter Out Empty and Short Articles: Remove articles with empty bodies or fewer than 30 words.
-
Clean HTML Character Codes: Convert HTML character codes to plain text.
-
Convert Between Datasets and DataFrames: Convert the dataset to a Pandas DataFrame for easier manipulation, then convert it back to a dataset.
-
Split the Dataset: Split the dataset into
training
(70%) andtesting
(30%) sets. -
Save the Processed Dataset: Save the processed dataset in various formats such as JSONL, CSV, and Parquet.
-
Test Loading the Processed Dataset: Load the saved datasets to ensure they are correctly formatted and usable.