SciPhi is an configurable Python framework designed to tackle the challenges of efficiently training LLM (Large Language Model) through synthetic data. At its core, SciPhi offers:
- Configurable Data Generation: Efficiently produce LLM-mediated synthetic training and tuning datasets tailored to your specific needs.
- The Library of Phi: An initiative to leverage AI-driven techniques to craft high-quality open source textbooks.
-
Engage with our active Discord community for discussions, troubleshooting, and collaboration.
-
For specialized support or collaboration inquiries, feel free to reach out directly.
Introduction:
The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing information from the MIT OCW course webpages.
"
Workflow:
The workflow encompasses data scraping, data processing, YAML configuration creation, and RAG over all of Wikipedia, with intermittent work done by LLMs.
- Scrape MIT OCW Course Webpages.
- Extract Syllabi.
- Formulate Table of Contents.
- Craft Textbooks.
poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --do-wiki=False --textbook=Aerodynamics_of_Viscous_Fluids --log-level=DEBUG
- Draft a table of contents and save as
textbook_name.yaml
. - Place it in
[Your Working Directory]/sciphi/data/library_of_phi/table_of_contents
. - Format similarly to
Aerodynamics_of_Viscous_Fluids.yaml
.
- Enable the
--do-wiki
flag:True
. - In
.env
, set:WIKI_SERVER_URL
WIKI_SERVER_USERNAME
WIKI_SERVER_PASSWORD
Output:
Generated textbooks reside in:
[Your Working Directory]/sciphi/data/library_of_phi
Note: The Wikipedia embeddings server is not yet public. Meanwhile, ensure your configuration aligns with our specifications if you wish to use wikipedia for RAG. If you would like to peruse more example textbooks, go here.
# Clone the repository
git clone https://github.com/emrgnt-cmplxty/sciphi.git
cd sciphi
# Install dependencies
# If you don't have poetry installed: pip3 install poetry
poetry install -E all
# Set up your environment
# Note: Modify the .env file as needed after copying
cp .env.example .env && vim .env
- Python: >= 3.11 and < 3.12
- Poetry: For package management
Install optional dependencies for enhanced features:
poetry install -E <extra_name>
Options include:
anthropic_support
: For Anthropic models.hf_support
: For diverse model access with the HuggingFace package.openai_support
: For OpenAI models.vllm_support
: For VLLM, aiding fast inference.llama_index_support
: For LlamaIndex, enhancing grounded synthesis.chroma_support
: For Chroma support in large vector databases.all
: Includes all dependencies (excludingvllm
, which needs separate installation).all_with_cuda
: Everything.
For fully configurable and flexible data generation, execute the relevant runner.py
with various command-line arguments.
poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split
The above command will generate a single sample from GPT-4. This sample is generated using the textbooks_are_all_you_need_basic_split
configuration, and the output is appended to example_output.jsonl
.
The long-term view of the SciPhi framework is to provide a training-feedback loop as shown below:
See arguments and their default values in the README. Notable ones include --provider
, --model_name
, and --temperature
.
Step 0: Scrape MIT OCW for course details.
poetry run python sciphi/examples/library_of_phi/raw_data/ocw_scraper.py scrape
Step 1: Convert scraped data into 'draft' syllabi YAMLs.
poetry run python sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py run
Step 2: Refine the draft YAML into the finalized syllabi.
poetry run python sciphi/examples/library_of_phi/gen_step_2_clean_syllabi.py run
Step 3: Transition the syllabi to a 'draft' table of contents.
poetry run python sciphi/examples/library_of_phi/gen_step_3_draft_table_of_contents.py run
Step 4: Produce clean table of contents YAML files.
poetry run python sciphi/examples/library_of_phi/gen_step_4_clean_table_of_contents.py run
Licensed under the Apache-2.0 License.
If using SciPhi in academic work, please cite:
@software{Emergent_AGI_SciPhi,
author = {Colegrove, Owen},
doi = {Pending},
month = {09},
title = {{SciPhi}},
url = {https://github.com/emrgnt-cmplxty/sciphi},
year = {2023}
}