Hugging Face Datasets Practice

Overview

This repository is designed for practicing the use of Hugging Face datasets.

Note

Here I used raw wikipedia dataset and applying the preprocessing operations such as:

git clone git@github.com:IsmaelMousa/playing-with-datasets.git

cd playing-with-datasets

python3 -m venv .venv

source .venv/bin/activate

pip install -r requirements.txt

jupyter-notebook

The dataset used is a raw wikipedia dataset stored in Parquet format.

Load the Dataset: Load the dataset from a Parquet file.
Remove Unnecessary Columns: Drop columns such as URL, Introduction/Summary, Sections/Headings, References, Categories, and Infobox.
Shuffle the Dataset: Shuffle the dataset using a seed for reproducibility.
Compute Article Body Length: Calculate the word count for the Body of each article.
Filter Out Empty and Short Articles: Remove articles with empty bodies or fewer than 30 words.
Clean HTML Character Codes: Convert HTML character codes to plain text.
Convert Between Datasets and DataFrames: Convert the dataset to a Pandas DataFrame for easier manipulation, then convert it back to a dataset.
Split the Dataset: Split the dataset into training (70%) and testing (30%) sets.
Save the Processed Dataset: Save the processed dataset in various formats such as JSONL, CSV, and Parquet.
Test Loading the Processed Dataset: Load the saved datasets to ensure they are correctly formatted and usable.

Practice using and preparing datasets for training from Hugging Face

Language:Jupyter Notebook 100.0%