jsurrea / LLM-Latino

Collection of ETL scripts used to create a dataset of text in Spanish to train Large Language Models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Seneca Extractor

Description

seneca_extractor is a Python package designed for extracting files and metadata from the Seneca institutional repository at Universidad de los Andes. This project is part of the LLM-Latino project and focuses on facilitating the access and manipulation of data stored in the repository.

Authors

  • Juan Sebastian Urrea Lopez
  • David Santiago Ortiz Almanza

Contact

Installation

To install this package, it is recommended to use a Python virtual environment to avoid dependency conflicts. You can follow these steps to set up your environment and install seneca_extractor:

  1. Create and activate a virtual environment (optional, but recommended):

    • On Windows:
      python -m venv venv
      .\venv\Scripts\activate
    • On Unix or MacOS:
      python3 -m venv venv
      source venv/bin/activate
  2. Install the package:

    • Navigate to the directory where the source code is located and run:
      pip install -e .

    This will install seneca_extractor in editable mode, which means any changes to the package source code will be immediately available without needing to reinstall the package.

About

Collection of ETL scripts used to create a dataset of text in Spanish to train Large Language Models.


Languages

Language:Python 100.0%