ewg118 / numishare

Numishare is an open source suite of applications for managing digital cultural heritage artifacts, with a particular focus on coins and medals.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Reindex] Error getting Solr document from XQuery ingestion pipeline.

Msch0150 opened this issue · comments

SW: actual version of Numishare using Docker
DB consists of about 1200 coins, all are published.

When I start "Reindex Published Objects":

Error getting Solr document from XQuery ingestion pipeline.

The orbeon.log shows:

2021-03-12 10:44:06,884 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "start: filter", "path": "/xforms-server", "method": "POST"}
2021-03-12 10:44:06,884 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "start: chain", "path": "/xforms-server", "method": "POST", "wait": "0"}
2021-03-12 10:44:06,884 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "service", "message": "start: handle"}
2021-03-12 10:44:06,884 INFO ProcessorService - /xforms-server - Received request
2021-03-12 10:44:06,912 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "xforms", "message": "ajax with update events", "uuid": "92144dbe1b8c8e50eeddf1ac0bbde96b8dd48b49"}
2021-03-12 10:44:06,913 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "xforms", "message": "before document lock", "uuid": "92144dbe1b8c8e50eeddf1ac0bbde96b8dd48b49"}
2021-03-12 10:44:06,913 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "xforms", "message": "got document lock", "path": "/xforms-server", "method": "POST", "uuid": "92144dbe1b8c8e50eeddf1ac0bbde96b8dd48b49", "wait": "0"}
2021-03-12 10:44:10,785 INFO lifecycle - event: {"request": "714", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "service", "message": "start: handle", "path": "/numishare/foc/ingest", "method": "GET"}
2021-03-12 10:44:10,786 INFO ProcessorService - /numishare/foc/ingest - Received request
2021-03-12 10:44:10,834 INFO PageFlowControllerProcessor - HTTP status code 414 {controller: "oxf:/apps/numishare/page-flow.xml", method: "GET", path: "/numishare/foc/ingest", status-code: "414"}
2021-03-12 10:44:10,835 INFO ProcessorService - /numishare/foc/ingest - Timing: 49
2021-03-12 10:44:10,835 INFO lifecycle - event: {"request": "714", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "service", "message": "end: handle", "time": "50 ms"}
2021-03-12 10:44:10,836 ERROR XFormsServer - xforms-submit-error - xf:submission for submission id: generate-add-document, error code received when submitting instance: 414
2021-03-12 10:44:10,846 WARN XFormsServer - instance() - instance not found {instance id: "list"}
2021-03-12 10:44:10,846 WARN XFormsServer - xf:send: submission does not refer to an existing xf:submission element, ignoring action {submission id: "query-solr"}
2021-03-12 10:44:10,899 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "xforms", "message": "after cacheOrStore", "document cache current size": "4", "document cache max size": "50"}
2021-03-12 10:44:10,899 INFO ProcessorService - /xforms-server - Timing: 4015
2021-03-12 10:44:10,901 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "service", "message": "end: handle", "time": "4,016 ms"}
2021-03-12 10:44:10,901 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "end: chain", "time": "4,017 ms"}
2021-03-12 10:44:10,901 INFO lifecycle - event: {"request": "713", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "end: filter", "time": "4,017 ms"}
2021-03-12 10:44:11,045 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "start: filter", "path": "/xforms-server", "method": "POST"}
2021-03-12 10:44:11,045 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "start: chain", "path": "/xforms-server", "method": "POST", "wait": "0"}
2021-03-12 10:44:11,045 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "service", "message": "start: handle"}
2021-03-12 10:44:11,045 INFO ProcessorService - /xforms-server - Received request
2021-03-12 10:44:11,049 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "xforms", "message": "ajax with update events", "uuid": "92144dbe1b8c8e50eeddf1ac0bbde96b8dd48b49"}
2021-03-12 10:44:11,049 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "xforms", "message": "before document lock", "uuid": "92144dbe1b8c8e50eeddf1ac0bbde96b8dd48b49"}
2021-03-12 10:44:11,050 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "xforms", "message": "got document lock", "path": "/xforms-server", "method": "POST", "uuid": "92144dbe1b8c8e50eeddf1ac0bbde96b8dd48b49", "wait": "0"}
2021-03-12 10:44:11,051 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "xforms", "message": "after cacheOrStore", "document cache current size": "4", "document cache max size": "50"}
2021-03-12 10:44:11,051 INFO ProcessorService - /xforms-server - Timing: 6
2021-03-12 10:44:11,051 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "service", "message": "end: handle", "time": "6 ms"}
2021-03-12 10:44:11,051 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "end: chain", "time": "6 ms"}
2021-03-12 10:44:11,052 INFO lifecycle - event: {"request": "715", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "end: filter", "time": "6 ms"}

It looks like that this might cause the issue:
2021-03-12 10:44:10,834 INFO PageFlowControllerProcessor - HTTP status code 414 {controller: "oxf:/apps/numishare/page-flow.xml", method: "GET", path: "/numishare/foc/ingest", status-code: "414"}

Error "414" is usually "URI request too long".
I have no idea what can cause this. My first thought was the new reference to RPC, so I unpublished all these coins but the error message remains the same.

Do you have any hint for troubleshooting this?

Just for info: a manual request:

http://localhost:8081/orbeon/numishare/foc/ingest

shows a simple xml-tag:

false

The orbeon.log:
2021-03-12 10:56:05,825 INFO lifecycle - event: {"request": "716", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "start: nofilter", "path": "/numishare/foc/ingest", "method": "GET"}
2021-03-12 10:56:05,826 INFO lifecycle - event: {"request": "716", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "service", "message": "start: handle"}
2021-03-12 10:56:05,826 INFO ProcessorService - /numishare/foc/ingest - Received request
2021-03-12 10:56:05,910 INFO ProcessorService - /numishare/foc/ingest - Timing: 84
2021-03-12 10:56:05,910 INFO lifecycle - event: {"request": "716", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "service", "message": "end: handle", "time": "84 ms"}
2021-03-12 10:56:05,910 INFO lifecycle - event: {"request": "716", "session": "7CF0E421BC9C634AF57002EB085C4637", "source": "limiter", "message": "end: nofilter", "time": "85 ms"}

The batch indexing process passes a request parameter of all the IDs in that batch, which might be a really long string. In Tomcat or Apache, you likely need to set a maximum HTTP header size. The default is 8kb, which is probably good enough for a few hundred IDs, but not the 1000 that's necessary for the batch publication of coin collections. Find the Connector in your $TOMCAT_HOME/conf/server.xml and insert the maxHttpHeaderSize for either your HTTP port 8080 and/or HTTPS 8443:

<Connector port="8080" protocol="HTTP/1.1"
           connectionTimeout="20000"
           redirectPort="8443" maxHttpHeaderSize="1000000"/>

Be sure to restart Tomcat after editing server.xml.

I have no luck with this issue.

I assume that the tomcat of the Orbeon is meant. So I changed the maxHttpHeaderSize to "100000" but the problem stays.
I changed the value to "100", and as expected the tomcat didn't work correctly. So, I am pretty sure that the value was adjusted with the valid key. I made a lot of tests with different values "10000000", "65536", "65535", "49.152", "49151" but I run always into the same error.

Maybe the 414 is coming from solr. So I added:

to in /opt/solr-8.6.2/server/etc/jetty.xml but the problems stays.

Hm... Stange. Who has the problem with the long URI?
So I tried to check the access log:
Tomcat: logs/localhost_access_log.2021-03-13.txt
But no appearance of a request similar to "/numishare/foc/ingest". No respond code "414" in the access file at all.
So, I setup access log files for solr, but:
But no appearance of a request similar to "/numishare/foc/ingest". No respond code "414" in the access file at all.
Maybe it is using the WebServer? So, I setup access log for Apache HTTPS Server,, but:
But no appearance of a request similar to "/numishare/foc/ingest". No respond code "414" in the access file at all.

And now? I am running out of ideas.

Which HTTP Server component is complaing about a long URI?
Why isn't the request or 414 error written in the access logs files of Tomcat (orbeon), Apache HTTP or SOLR?
Is there an internal way of communication between Orbeon and solr?

Does an "unpublish" > "publish" include the same function as "reindex"?

Any suggestion is welcome.

Anyhow I changed the

It could be the Apache web server also, if Numishare is requesting the ingest API URL that is being obscured by Apache ProxyPass. Try putting LimitRequestLine 1000000 in your Apache config.

It wouldn't be a Solr header request limit. The ingest pipeline is what accepts a very long request parameter called identifiers, and the resulting XML document gets posted into Solr. So the HTTP header has to be increased in Tomcat and/or Apache.

Is this still a problem?

I will recheck with the latest version in the next week.

I need some more time for the recheck.

Just a short update on this one:
The problems still exists in my environment (during reindexing in the UI). I traced the network and I could see the

HTTP/1.1 414 URI Too Long
...
Server: Jetty(9.4.26.v20200117)

And just before I do see:

GET /exist/rest/db/cf/aggregate-ids.xql?identifiers=myidentifier_01%7Cmyidentifier02%7Cmyidentifier03%7Cmyidentifier04...(very very long)

Host: exist:8080
Connection: Keep-Alive
....

So it is a problem related to Exist-DB. I will check how to get rid of the Exist-DB limitation.

Finally I could fixed it by modifying "requestHeaderSize" in etc/jetty/jetty.xml for the Exist-DB:

<Set name="requestHeaderSize"><Property name="jetty.httpConfig.requestHeaderSize" deprecated="jetty.request.header.size" default="1008192" /></Set>

Example above shows the increase from 8192 to 1008192.