Import georgian legal judgements as pdfs to search
cesine opened this issue · comments
for #1514
Install the ingester:
https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html
Remove the index
curl -X DELETE $SEARCH_URL/courtge
Resources
https://gist.github.com/karmi/5594127
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html
http://stackoverflow.com/questions/37861279/how-to-index-a-pdf-file-in-elasticsearch-5-0-0-with-ingest-attachment-plugin
Create the index
curl -X PUT "$SEARCH_URL/courtge?pretty=true" -d'
{
"settings" : {
"index" : {
"number_of_shards" : 3,
"number_of_replicas" : 2
}
},
"mappings" : {
"datum" : {
"properties" : {
"attachment.data" : {
"type": "text"
}
}
}
}
}'
curl -X PUT "$SEARCH_URL/courtge?pretty=true" -d'
{
"mappings" : {
"datum" : {
"properties" : {
"attachment.data" : {
"type": "text"
}
}
}
}
}'
Create a pipeline ?
curl -X PUT "$SEARCH_URL/_ingest/pipeline/attachment?pretty=true" -d'
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}'
Index base64
curl -X PUT "$SEARCH_URL/courtge/datum/from_base64?pipeline=attachment&pretty=true" -d'
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}'
Index file from base64
result=`openssl base64 -in temp.txt`
echo $result
curl -X PUT "$SEARCH_URL/courtge/datum/fromfile?pipeline=attachment&pretty=true" -d "{
\"data\" : \"$result\"
}"
# Index larger file fails
# result=`uuencode -m temp.txt`
# echo $result
# curl -X PUT "$SEARCH_URL/courtge/datum/fromlargerfile?pipeline=attachment&pretty=true" -d "{
# \"data\" : \"$result\"
# }"
Index larger file fails
result=`openssl base64 -in 1376314887_1358519941_11.txt | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n//g'`
echo "{
\"data\" : \"$result\"
}"
curl -X PUT "$SEARCH_URL/courtge/datum/fromlargerfile?pipeline=attachment&pretty=true" -d "{
\"data\" : \"$result\"
}"
Index larger pdf fails
result=`openssl base64 -in 1376314887_1358519941_11.pdf | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n//g'`
echo $result
curl -X PUT "$SEARCH_URL/courtge/datum/fromlargerfile?pipeline=attachment&pretty=true" -d "{
\"data\" : \"$result\"
}"
Search the index
curl "$SEARCH_URL/courtge/_search?pretty=true"