ruped / clojurellm-data

Clojure LLM - Dataset curation for fine tuning an LLM for Clojure.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

clojurellm-data

Clojure LLM - Dataset curation for fine tuning an LLM for Clojure.

Dataset Location Size Launch Main Launch Sample
Clojure email groups data/clojure_mailgroup 25.31MB clj -X:email-clojure-main clj -X:email-clojure-sample
Clojurescript email groups data/clojurescript_mailgroup 3.1MB clj -X:email-clojurescript-main clj -X:email-clojurescript-sample
Clojurians chat logs data/clojurians_chat N/A N/A N/A
Clojurians forum data/clojurians_forum N/A N/A N/A
General programming data/general_programming N/A N/A N/A
Clojure/script Projects data/projects N/A N/A N/A
Stackoverflow data/stackoverflow N/A N/A N/A
Synthetic Clojure data/synthetic N/A N/A N/A
Clojure web crawl data/web_crawl N/A N/A N/A

ClojureLLM Data Management Policy

  1. Intro
    1. What

      The purpose of the data in this repository is for the fine-tuning of an LLM for a Clojure coding assistant.

      The purpose of this document is to outline the policies and procedures by which data custodians acting on behalf of the ClojureLLM project manage the security and safety of the data being used in the project, as well as other information useful to different stakeholders. These documents are sometimes called "datasheets."

    2. Who

      ClojureLLM Data is developed and supported by members of the Clojure community for the benefit of the Clojure community.

      In this document, any ClojureLLM developer working on the data in this repository shall be referred to as a "Data Custodian."

    3. Support and Funding

      Infrastructure for this project is currently funded by the ClojureLLM team but we will have a method for folks to contribute funds for training runs soon.


  1. Data Sources
    1. Clojure Code Data Sources

      ClojureLLM will use the following sources of Clojure code for data.

      Not all of these sources may be used and others may be added to this list over time.

    2. Clojure Conversation Data Sources

      ClojureLLM will use the following sources of Clojure code for data.

      Not all of these sources may be used and others may be added to this list over time.

    3. Non-Clojure Code Data Sources

      ClojureLLM may leverage some existing and/or future datasets, made available in the larger open source community, so as to facilitate the translation of programming concepts from other languages into Clojure.


  1. Collection
    1. Scraping

      ClojureLLM Data Custodian may use any Clojure web scraping tool they'd like, but skyscraper is recommended.

      Before scraping any given site, ensure the copyright of the site does not prohibit the usage of its code-related data for LLM training for any reason.

      The script for a given dataset should be added as a launch alias in the project deps.edn in order to run the script.

    1. Storing

      Due to storage constraints on Github, ClojureLLM will not be storing entire datasets in the repo. User's running particular pipelines will execute the download/scraping scripts for the dataset they're working on instead.

      However, it is advised to store a small sample of the dataset that the scripts will produce, so that folks can experiment without having to run the scrape.

      Note: We'd like to keep the repo under 100MB in general


  1. Sanatization
    1. Remove Garbage

      We're only interested in the Clojure code and the human langauge related to the Clojure code. However, those values will usually be embedded within HTML, JSON and various document formats. That data should be purged from the dataset.

    2. Deduplication

      It is possible that some the code or conversation data exists in more than one location on the internet. So it's possible for there to be duplicates the scraped data. Therefore, it is the responsibility of the data custodian defining the download script to eliminate duplication of data both within their dateset as well as the rest of the datasets in the repo.

    3. Remove ClojureLLM Outputs

      ClojureLLM outputs may end up in chat logs and we don't want to waste test space allocated for human training data. This likely won't be a huge problem - just be sure to avoid including massive amounts of outputs, especially from ClojureLLM.

      We may eventually develop some data watermarks and tools to later help automatically detect ClojureLLM code in text for possible elision from the dataset.

    4. Toxicity and Bias

      Make an effort to remove toxicity, sarcasm, hyperbole, bias, personal opinions, jokes, or anything not related to Clojure code or advice around the usage and understanding of Clojure code and other related programming technologies.

      We plan on having LLM based sentiment/semantic classification tools in the future to help automate the detection of toxicity and general divergence from the target content for ClojureLLM. Different datasets will then be able to leverage those tools.

    5. PII

      Data custodians should make an effort to remove Personally Identifiable Information, including but not limited to:

      • credit card numbers
      • personal names (except library authors)
      • emails (except library/solution contact info)
      • home addresses
      • social security numbers
      • phone numbers
      • financial data
      • publicly accessble IP addresses (not local)
      • employer of speaker
      • social network handles
      • anyone mentioning their name explicitly

      We plan to provide PII scanning tools for datasets that can be used generically from all of the dataset collection scripts.

    6. Injection Attack Detection

      One potential danger of LLMs is the ability for an attacker to surruptitiously poisons public datasets with data that either corrupts the data or injects prompts or information into the data that produces undesirable inference or side-effects in the LLM training on and infering on the data.

      We are still learning about this mode of attack, but as our understanding increases, we plan to automate the detection and removal of these instances from ClojureLLM datasets.

    7. User Anonymization

      Data custodians should make an effort to anonymize the users associated with Clojure code and conversations around Clojure code.

      This includes:

      • cross-conversation anonymization
      • psuedo names will be wellknown names
        • ["Bob" "Alice" "Jamal" "Myleen" "Oliver" etc]
      • 50% male / 50% famale names (open to comment)
      • psuedo names will be ethnically / culturally diverse
      • Redact descriptions of human likenesses ("oh, no, I have green eyes")
      • Redact any mentioning children or family members

      These items are open to feedback and expansion. In general, we want to represent a diversity of backgrounds for a dataset that is helpful to existing and future Clojurists around the world.

      Tools for anonymizing users will be shared across the different dataset pipelines as they are built out.


  1. Data Enrichment
    1. Clojure Code Generation

      A large part of this project will involve the synthetic generation of large amounts of Clojure code, so as to give the ClojureLLM a very deep intuition around how the Clojure compiler behaves.

      This is open question and we hope the community will give feedback on how best to accomplish this.

      Eventually we may use Clojure code generation tools to help grow out the other datasets in this repository as well.

    2. Conversation Grammar

      There is often grammatical and syntactical errors in common language between humans. We can correct these errors though with tools that will automatically fix those mistakes, which can increase the comprehensibility of the training data.

      Again, data custodians that build tools for cleaning up grammar in a particular dataset should make an effort to make those same tools available in the rest of the datasets in this repository.


  1. Usage
    1. Code Completion

      One model will be used primarily for Code Completion. This model will be smaller, fit in more applications and will execute faster, for more immediate feedback while the Clojurist is typing.

    2. Code Conversation / Pair Programming

      Another model will be used for pair programming, asking ClojureLLM questions and getting a written response in natural language explaining the answer.

      This model will be necessarily larger, to understand more general natural language concepts, translating between them and code concepts. It will also be slower, as the Clojurist will see the words in the response be written out in realtime.


  1. Distribution
    1. Open / Restricted

      Some models that we'll be starting off with may have licenses that restrict what we can do with them. Some allow for commercial use, others do not. We intend on supporting and working on both.

      If the best model available, that can provide the best experience for Clojure devs, is a restricted model, we may still want to use that in some projects, like an open source LSP server that can use an LLM. Because an open source project like that is not commercial, it is free to use models that have a commercial restriction. We're not going to go with a lesser model for that purpose, just because it cannot be used commercially.

      That being said, a stated purpose of this project is to also make available the development of commercial Clojure applications on top of LLM-based technologies.

    2. Large / Small

      As stated above, code completion models will likely need to be smaller, in order to be fast and useful. Conversational models will likely need to be larger.

      That being said, this space is evolving fast and smaller and larger models with different performance characteristics will continue coming out and we intend to experiment with many of them.


  1. Maintenance
    1. Community Feedback

      The direction of this project is an open community effort and it is likely to change as things progress, so we encourage everyone to engage and provide feedback on what can be improved and where you'd like to see things go.

      Feel free to file a PR to update this document or file an issue if you have an questions or concerns.

    2. Dataset Versioning and Updates

      Some datasets will be made available in the Releases of this project. A zip file of all the datasets will be made available on huggingface.

About

Clojure LLM - Dataset curation for fine tuning an LLM for Clojure.

License:MIT License


Languages

Language:Clojure 100.0%