RD17 / ambar

:mag: Ambar: Document Search Engine

Home Page:https://ambar.cloud/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for Swedish [sv-SE] OCR

Yavari opened this issue · comments

Can you please add support for Swedish language or guide me to have I can do it so that I can add a pull request?

Here is some code I am using in a another project. Please let me know if you want me to create a pull request.

    "ambar_sv": {
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "icu_folding_se",
        "swedish_stop",
        "swedish_stemmer"
      ],
	  
   "swedish_stemmer": {
      "type": "stemmer",
      "language": "swedish"
    },

    "swedish_stop": {
      "type": "stop",
      "stopwords": "_swedish_"
    },
   "icu_folding_se": {
      "type": "icu_folding",
      "unicodeSetFilter": "[^åäöÅÄÖ]"
    }

analysis-icu plugin needs to be installed for icu_folding.

    RUN bin/elasticsearch-plugin install analysis-icu

I guess https://github.com/RD17/ambar/blob/master/Pipeline/Dockerfile also needs the following line:

tesseract-ocr-swe \
commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.