FareedKhan-dev / NLP-1K-Stories-Dataset-Genres-100

This repository hosts a diverse NLP dataset comprising 1,000 stories spanning 100 genres for comprehensive language understanding tasks.

Home Page:https://huggingface.co/datasets/FareedKhan/1k_stories_100_genre

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NLP-1K-Stories-Dataset-Genres-100

This repository hosts a diverse NLP dataset comprising 1,000 stories spanning 100 genres for comprehensive language understanding tasks.

HuggingFace Space

license: cc-by-2.0 (Personal or commercial use but give attribution)

language:

  • en

size_categories:

  • 1K<n<10K

pretty_name: Thousand Stories, Hundred Genres

task_categories:

  • summarization
  • text-generation
  • text-classification

tags:

  • data science
  • Storytelling
  • Genre Classification
  • NLP
  • LLM
  • Deep Learning

Dataset Documentation

Overview

This dataset contains 1000 stories spanning 100 different genres. Each story is represented in a tabular format using a dataframe. The dataset includes unique IDs, titles, and the content of each story.

Genre List

The list of all genres can be found in the genres.txt file.

reading genre_list variable

with open('story_genres.pkl', 'rb') as f:
    story_genres = pickle.load(f)

Sample of genre list:

genres = ['Sci-Fi', 'Comedy', ...]

Dataframe Format

The dataset is structured in the following format:

  1. id: Unique identifier for each dataframe.
  2. title: Title of the story.
  3. story: The content of the story.
  4. genre: The genre of the story.

Sample Dataframe

id title story genre
25235 The Unseen Miracle It was a stormy night in ... Horror
... ... ... ...

Average Length of Words

  • Title: 6 words
  • Story: 960 words

License

This dataset is licensed under the cc-by-2.0

About

This repository hosts a diverse NLP dataset comprising 1,000 stories spanning 100 genres for comprehensive language understanding tasks.

https://huggingface.co/datasets/FareedKhan/1k_stories_100_genre