saidsef / tika-document-to-text

Apache Tika - Toolkit detects and extracts metadata

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Apache Tika Implementation CI Tagging Release

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Prerequisite

Deployment

Kubernetes Deployment

Create namespace, via kubectl create ns web Assuming you've checked out this repo

kubectl kustomize deployment/ | kubectl apply -f -

Or, to deploy via argocd:

kubectl apply -f deployment/argocd/application.yml

NOTE: Remeber to update Ingress hostname

Take it for a test drive:

Via CLI:

You'll need to forward service via kubectl port-forward -n web svc/tika-ui 8080

curl -d @test/url.json http://localhost:8080/ -H 'Content-Type: application/json'

Or, via Web UI:

Using a browser visit:

http://loclahost:8080/

About

Apache Tika - Toolkit detects and extracts metadata

License:MIT License


Languages

Language:JavaScript 42.3%Language:Dockerfile 29.2%Language:HTML 17.2%Language:Python 11.3%