florisboard / nlp

NLP core implementation for FlorisBoard as well as Tools for preprocessing raw word data into dictionary files and n-gram models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FlorisBoard NLP

This repository is the heart of the NLP functionality of FlorisBoard. It mainly consists of 2 major components:

  • nlpcore: The Core NLP library for FlorisBoard, which is responsible for managing dictionaries and for generating word suggestions/performing spell check for given input data. Compiles for both desktop and Android targets.
  • nlptools: Debug tools for preprocessing raw word data into dictionary files and a debug UI for testing the core library without having to compile the full FlorisBoard project. Compiles only for desktop targets.

This repository is currently in alpha and will move along with the 0.4.0 FlorisBoard development cycle.

If you want to contribute to this repository please see CONTRIBUTING!

Building & Running the project

This project can be build in 2 types:

  • Submodule build: NLP is built together with FlorisBoard and included as a submodule into the main project. If you are unsure what you are doing this option is probably the correct one.
  • Standalone build: NLP is built as a standalone utility and can only run on the machine on which it was compiled on. Useful for testing and dictionary generation.

System requirements

To be able to compile this project on your PC, you must run a supported host system:

System Submodule build Standalone build Notes
Windows 7/8/10/11 -
Windows 10/11 with WSL2 ⚠️ tested with distro:
Ubuntu 23.10
MacOS untested but should be supported; please provide feedback
Debian 11.0 ⚠️ submodule: python3 package is outdated in the package repository
Debian 12.0+ ⚠️ standalone: cmake and clang packages are outdated in the package repository
Ubuntu 22.04 -
Ubuntu 22.10 ⚠️ standalone: cmake and clang packages are outdated in the package repository
Ubuntu 23.04 ⚠️ standalone: cmake and clang packages are outdated in the package repository
Ubuntu 23.10+ ⚠️ standalone: cmake and clang packages are outdated in the package repository
Other Linux systems try yourself

Submodule build (targeting Android)

Requirements if you compile NLP as a submodule for the main FlorisBoard project:

  • Android SDK
  • Java 17
  • Android NDK r25 or newer
  • CMake 3.22+
  • Ninja 1.10+
  • Clang 14.x+ & libc++ (bundled with Android NDK)
  • Python 3.10+
  • Git
  • Optional: GNU make 3.80+
    • Only required if ICU_BUILD_FROM_SOURCE is enabled

If you have trouble installing the requirements or don't know how to install some you can also refer to this excellent guide written by @Thithic, which guides you step by step in setting up the requirements on Ubuntu 22.04.

Building

NLP cannot be built as a standalone module if Android is targeted. In the main FlorisBoard project's root directory execute ./gradlew clean && ./gradlew assembleDebug to build FlorisBoard.

Standalone build (targeting Desktop)

Requirements if you compile NLP and NLP tools as a standalone utility for desktop:

  • CMake 3.28+
  • Ninja 1.11+
  • Clang 16.x+ (see below if your distro does not have version 16 yet)
    • The following packages are needed (all version 16.x+):
      • clang clang-tools libc++-dev libc++abi-dev
    • Tip: the default clang-packages my not be clang 16.x+ yet, in this case you can try to install
      • clang-16 clang-tools-16 libc++-16-dev libc++abi-16-dev
  • Git
  • Optional: GNU make 3.80+
    • Only required if ICU_BUILD_FROM_SOURCE is enabled

Initializing the local source repository

# One-time setup
git clone https://github.com/florisboard/nlp.git
cd nlp
git submodule update --init --recursive

Set up cmake/clang compiler (Ubuntu 22.04+ only)

Thetoolchain setup is automated for Ubuntu 22.04+ and can be invoked like this:

./setup-toolchain.sh

After a successful run of the script, you can use ./cmake.sh in-place of the normal cmake command.

Set up cmake/clang compiler (Manual)

Before you can build this project for Desktop targets you need to set up the clang compiler. First check which version you have installed:

clang -v
The reported version is 16.x or newer

Great! You do not have to set up anything else, and you can skip to the project build section!

The reported version is 15.x or older

In this case you do not have a supported version of clang installed, and we need to download and integrate the compiler manually. Head to https://github.com/llvm/llvm-project/releases and download the appropriate prebuilt llvm-project for your system. Below example assumes you are on Ubuntu 22.04 or newer.

# Download clang 17.0.x for Ubuntu 22.04+
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-17.0.5/clang+llvm-17.0.5-x86_64-linux-gnu-ubuntu-22.04.tar.xz
tar -xf clang+llvm-17.0.5-x86_64-linux-gnu-ubuntu-22.04.tar.xz
# Alternatively you can also download clang 16.0.x for Ubuntu 22.04+
#   not recommended anymore due to minor issues between cmake&clang, however still supported
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-16.0.4/clang+llvm-16.0.4-x86_64-linux-gnu-ubuntu-22.04.tar.xz
tar -xf clang+llvm-16.0.4-x86_64-linux-gnu-ubuntu-22.04.tar.xz

After this we need to change the compiler path to the downloaded one, else the build will fail. To change it, open CMakePresets.json in a text editor and change the C/CXX compiler vars like so:

    ...
    "CMAKE_C_COMPILER": "/path/to/downloaded/llvm-project/build/bin/clang",
    "CMAKE_CXX_COMPILER": "/path/to/downloaded/llvm-project/build/bin/clang++",
    ...

Building

# Initialize CMake
cmake --preset=release .

# Build the project
cmake --build --preset=release

Running the NLP tools binary

The NLP tools binary is intended to run on a desktop PC for debugging the core library and for preprocessing various data sources into dictionary files.

./build/release/bin/nlptools

TODO: documentation

Known issues

  • Compilation issues in submodule build mode:
    • This is known and tracked in florisboard#2218, please report there if you encounter any issues.
  • Memory usage of the NLP core trie map is high:
    • This is indeed a big issue right now, but unlike with the previous NLP attempt we are not restricted by Java's heap space restrictions anymore, only by natively available RAM, so for now we have to bite the bullet (or reduce the entries in the preprocessed dictionaries). If you think you have an idea on how to decrease the memory usage significantly (without overcomplicating the codebase) I am all ears!
  • Size of preprocessed dictionaries is quite large:
    • While this is also true for now this poses less of an issue as these preprocessed dictionaries are not included in the APK and thus do not contribute towards the strict max size of the FlorisBoard APK.
  • The suggestion ranking is weird for some inputs
    • The weighting system of the suggestions needs a lot of refinement - If you have expertise in this field and want to help I would gladly appreciate it :)
  • There's little documentation for parts of this code base
    • I am aware and will work on this bit by bit in the future.
  • There aren't schemas available for all json files in this project / schemas.florisboard.org does not work yet
    • I am aware and will work on this in the near future.

External libraries used

License

Copyright 2022-2023 Patrick Goldinger

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

About

NLP core implementation for FlorisBoard as well as Tools for preprocessing raw word data into dictionary files and n-gram models

License:Apache License 2.0


Languages

Language:C++ 72.5%Language:Python 17.2%Language:Shell 6.0%Language:CMake 4.3%