yberreby / ocaml-semsearch-jsoo

OCaml + js_of_ocaml + SBERT + TensorFlow.js

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ocaml-semsearch-jsoo

OCaml + js_of_ocaml + SBERT + TensorFlow.js.

This project converts a SBERT model from PyTorch to TensorFlow to TensorFlow.js, and loads that model in OCaml code transpiled to JavaScript. The test code runs it under Node.js, but this can run in the browser as well (with pure JS, WASM, WebGL or WebGPU TensorFlow.js backend).

Why? If you need semantic text embeddings in a js_of_ocaml project, this is one of the easiest ways to do so.

If you’re interested in running your code natively and don’t want anything to do with JavaScript, this is not for you.

This package internally uses require() (for TensorFlow.js, the BERT tokenizer…), so make sure that the dependencies described in package.json are available at runtime, including any optional dependencies corresponding to your desired TF.js backend.

System requirements

Versions indicated are what I used, not hard requirements.

  • opam: 2.1.5
  • yarn: 1.22.19
  • Python: 3.11.3

Supported TF.js backends

cpu, tensorflow and wasm were tested. Others should work as well, but may require changes.

Usage

Set up

# Export a SBERT model to TensorFlow.js
./export_model.sh

# Install Node dependencies
yarn install --ignore-optional
yarn add -O @tensorflow/tfjs-backend-wasm

# Set up opam switch
opam switch create . ocaml.4.14.1 --no-install --yes

# Install dev dependencies
opam install ocaml-lsp-server merlin --yes

# Install main dependencies
cd semsearch-jsoo
opam install . --deps-only --yes --with-test

Try it out

# in semsearch-jsoo
dune test

Caveats

  • The embedding dimension cannot be configured at runtime for now.
  • The tokenizer is fixed, relies on bert-tokenizer (npm)
    • Hasn’t been tested beyond ASCII text.
  • The vector search is linear.

Possible / future improvements

  • Generally cleaning up the code
  • dune-project needs proper dependency specifications.
  • JS bindings:
  • Documentation
  • Makefile
  • Allow loading models from IndexedDB
  • Pure-OCaml BERT tokenizer: would drop an unnecessary JS dependency.
  • Bind to huggingface/candle, which can target WASM, and has support for BERT. This would remove the need for the ugly Torch -> TF -> TFJS conversion. It also has the benefit of enabling portability to native code, though that would require binding twice (once through FFI, once through JSOO).
  • Approximate Nearest Neighbor search algorithms instead of a linear search. Note that in practice, optimized, vectorized linear search is very fast (4-16ms for a few thousand entries). This is only necessary when scaling up dramatically or with tight real-time constraints.

Acknowledgements

Thanks to Philipp Schmid for his article on converting Sentence Transformers to TensorFlow.

Citation

If you’ve used this software in a scientific publication, please cite it as follows:

@software{BERREBY_ocaml-semsearch-jsoo_2023,
    author = {BERREBY, Yohaï-Eliel},
    month = aug,
    title = {{ocaml-semsearch-jsoo}},
    url = {https://github.com/yberreby/ocaml-semsearch-jsoo},
    version = {0.0.1},
    year = {2023}
}

About

OCaml + js_of_ocaml + SBERT + TensorFlow.js

License:GNU General Public License v3.0


Languages

Language:OCaml 67.0%Language:Python 22.1%Language:JavaScript 8.9%Language:Shell 2.0%