FieldDB / FieldDB

An offline/online field database which adapts to its user's terminology and I-Language. http://fielddb.github.io

Home Page:http://lingsync.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Import georgian legal judgements as pdfs to search

cesine opened this issue · comments

for #1514

Install the ingester:
https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

Remove the index
curl -X DELETE $SEARCH_URL/courtge

Resources
https://gist.github.com/karmi/5594127
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html
http://stackoverflow.com/questions/37861279/how-to-index-a-pdf-file-in-elasticsearch-5-0-0-with-ingest-attachment-plugin

Create the index 
curl -X PUT "$SEARCH_URL/courtge?pretty=true" -d'
{ 
	"settings" : {
	    "index" : {
	        "number_of_shards" : 3, 
	        "number_of_replicas" : 2 
	    }
	},
    "mappings" : { 
        "datum" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text"
                } 
            } 
        } 
    } 
}'

curl -X PUT "$SEARCH_URL/courtge?pretty=true" -d'
{ 
    "mappings" : { 
        "datum" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text"
                } 
            } 
        } 
    } 
}'


Create a pipeline ?
curl -X PUT "$SEARCH_URL/_ingest/pipeline/attachment?pretty=true" -d'
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}'


Index base64 
curl -X PUT "$SEARCH_URL/courtge/datum/from_base64?pipeline=attachment&pretty=true" -d'
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}'

Index file from base64
result=`openssl base64 -in temp.txt`
echo $result
curl -X PUT "$SEARCH_URL/courtge/datum/fromfile?pipeline=attachment&pretty=true" -d "{
  \"data\"  : \"$result\"
}"

# Index larger file fails 
# result=`uuencode -m  temp.txt`
# echo $result
# curl -X PUT "$SEARCH_URL/courtge/datum/fromlargerfile?pipeline=attachment&pretty=true" -d "{
#   \"data\"  : \"$result\"
# }"

Index larger file fails 
result=`openssl base64 -in 1376314887_1358519941_11.txt | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n//g'`
echo "{
  \"data\"  : \"$result\"
}"
curl -X PUT "$SEARCH_URL/courtge/datum/fromlargerfile?pipeline=attachment&pretty=true" -d "{
  \"data\"  : \"$result\"
}"

Index larger pdf fails 
result=`openssl base64 -in 1376314887_1358519941_11.pdf | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n//g'`
echo $result
curl -X PUT "$SEARCH_URL/courtge/datum/fromlargerfile?pipeline=attachment&pretty=true" -d "{
  \"data\"  : \"$result\"
}"      


Search the index 
curl "$SEARCH_URL/courtge/_search?pretty=true"