maximegmd / HomoScriptor

Fuel innovation and advance language models with HomoScriptor: A vibrant, community-driven dataset for fine-tuning large language models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HomoScriptor: A Community-driven Human-Written Dataset for Language Model Fine-Tuning

Stargazers Forks Disscussion

Together, let's create a remarkable dataset that fuels innovation and drives the progress of language models!

HomoScriptor is a vibrant and collaborative project that thrives on community contributions.
It serves as a curated collection of human-written datasets specifically designed for fine-tuning large language models (LLMs).
With its diverse range of categories and organized JSON files

File Structure

  • CONTRIBUTING.md - Guidelines for contributing to the dataset.
  • data/
    • language_tasks.json - JSON file of language tasks including rhyming, poetry, tongue twisters, summarising, and some of the differences between UK and US spelling.
    • logic_tasks.json - JSON file containing logic-related tasks, including puzzles, riddles and brainteasers.
    • medicine_tasks.json - JSON file containing medicine related questions, differentials, care plan and knowledge.
  • LICENSE - License information for the dataset.
  • README.md - Contains information about the dataset.

Key Features

  • Categorized JSON Files: Our dataset is thoughtfully organized, with each category having its own JSON file. This structured approach makes it effortless to explore specific linguistic domains and seamlessly incorporate them into your LLM training pipeline.

  • Short and Long Variant Outputs: Every task in the JSON files includes both short and long variant outputs. This versatility allows you to tailor the dataset to your specific needs, accommodating a wide range of applications and use cases.

  • Open-Source and Collaborative: HomoScriptor embraces the power of community collaboration. We actively encourage and welcome contributors to join our project and contribute to its growth. Your input and expertise can help enhance the dataset's overall quality and ensure its relevance to the broader language model research community.

Contributing

We firmly believe that the strength of HomoScriptor lies in its community of contributors. We invite language model enthusiasts, researchers, and data scientists from all backgrounds to join us in shaping the future of language models. Contributing to HomoScriptor is a rewarding experience, as it allows you to leave your mark on this dynamic dataset.

If you are interested in contributing, please refer to the guidelines outlined in the Contribute file. We look forward to your valuable contributions and appreciate your dedication to advancing the field of language modeling.

Together, let's create a remarkable dataset that fuels innovation and drives the progress of language models!

About

Fuel innovation and advance language models with HomoScriptor: A vibrant, community-driven dataset for fine-tuning large language models.

License:Apache License 2.0