arjenpdevries / CIFF2DuckDB

Load CIFF indices into Tables in DuckDB

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CIFF to DuckDB

Introduction

The Common Index File Format (CIFF) was introduced as a binary data exchange format for open-source search engines to interoperate by sharing index structures.

CIFF has been adopted by the OpenWebSearch.EU project to distribute (partitions of) Web indexes.

This repository provides the code necessary to load a CIFF file through Arrow into DuckDB.

The goal is to load and transform the CIFF data into an index for the DuckDB Full Text Search extension. (The version provided has not yet completely achieved that goal.)

Preliminaries

Install DuckDB CLI for testing:

wget https://artifacts.duckdb.org/latest/duckdb-binaries-linux.zip
unzip -p duckdb-binaries-linux.zip duckdb_cli-linux-amd64.zip | funzip > ./duckdb ; chmod a+rx ./duckdb

Upload Index

python ciff-arrow.py

About

Load CIFF indices into Tables in DuckDB

License:MIT License


Languages

Language:Python 100.0%