ocaml-semsearch-jsoo
OCaml + js_of_ocaml + SBERT + TensorFlow.js.
This project converts a SBERT model from PyTorch to TensorFlow to TensorFlow.js, and loads that model in OCaml code transpiled to JavaScript. The test code runs it under Node.js, but this can run in the browser as well (with pure JS, WASM, WebGL or WebGPU TensorFlow.js backend).
Why? If you need semantic text embeddings in a js_of_ocaml
project, this is one of the easiest ways to do so.
If you’re interested in running your code natively and don’t want anything to do with JavaScript, this is not for you.
This package internally uses require()
(for TensorFlow.js, the BERT
tokenizer…), so make sure that the dependencies described in package.json
are available at runtime, including any optional dependencies corresponding to your desired TF.js backend.
System requirements
Versions indicated are what I used, not hard requirements.
- opam: 2.1.5
- yarn: 1.22.19
- Python: 3.11.3
Supported TF.js backends
cpu
, tensorflow
and wasm
were tested. Others should work as well, but may require changes.
Usage
Set up
# Export a SBERT model to TensorFlow.js
./export_model.sh
# Install Node dependencies
yarn install --ignore-optional
yarn add -O @tensorflow/tfjs-backend-wasm
# Set up opam switch
opam switch create . ocaml.4.14.1 --no-install --yes
# Install dev dependencies
opam install ocaml-lsp-server merlin --yes
# Install main dependencies
cd semsearch-jsoo
opam install . --deps-only --yes --with-test
Try it out
# in semsearch-jsoo
dune test
Caveats
- The embedding dimension cannot be configured at runtime for now.
- The tokenizer is fixed, relies on bert-tokenizer (npm)
- Hasn’t been tested beyond ASCII text.
- The vector search is linear.
Possible / future improvements
- Generally cleaning up the code
dune-project
needs proper dependency specifications.- JS bindings:
brr
?- LexiFi/gen_js_api?
- Documentation
Makefile
- Allow loading models from IndexedDB
- Pure-OCaml BERT tokenizer: would drop an unnecessary JS dependency.
- Bind to huggingface/candle, which can target WASM, and has support for BERT. This would remove the need for the ugly Torch -> TF -> TFJS conversion. It also has the benefit of enabling portability to native code, though that would require binding twice (once through FFI, once through JSOO).
- Approximate Nearest Neighbor search algorithms instead of a linear search. Note that in practice, optimized, vectorized linear search is very fast (4-16ms for a few thousand entries). This is only necessary when scaling up dramatically or with tight real-time constraints.
Acknowledgements
Thanks to Philipp Schmid for his article on converting Sentence Transformers to TensorFlow.
Citation
If you’ve used this software in a scientific publication, please cite it as follows:
@software{BERREBY_ocaml-semsearch-jsoo_2023,
author = {BERREBY, Yohaï-Eliel},
month = aug,
title = {{ocaml-semsearch-jsoo}},
url = {https://github.com/yberreby/ocaml-semsearch-jsoo},
version = {0.0.1},
year = {2023}
}