etl-pipeline google-cloud-platform llm-datasets python web-scraping

Seneca Extractor

Description

seneca_extractor is a Python package designed for extracting files and metadata from the Seneca institutional repository at Universidad de los Andes. This project is part of the LLM-Latino project and focuses on facilitating the access and manipulation of data stored in the repository.

Authors

Juan Sebastian Urrea Lopez
David Santiago Ortiz Almanza

Contact

Installation

To install this package, it is recommended to use a Python virtual environment to avoid dependency conflicts. You can follow these steps to set up your environment and install seneca_extractor:

Create and activate a virtual environment (optional, but recommended):
- On Windows:
```
python -m venv venv
.\venv\Scripts\activate
```
- On Unix or MacOS:
```
python3 -m venv venv
source venv/bin/activate
```
Install the package:
- Navigate to the directory where the source code is located and run:
```
pip install -e .
```
This will install seneca_extractor in editable mode, which means any changes to the package source code will be immediately available without needing to reinstall the package.

About

Collection of ETL scripts used to create a dataset of text in Spanish to train Large Language Models.

etl-pipeline google-cloud-platform llm-datasets python web-scraping

Languages

Language:Python 100.0%