AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Uploading images from files is not supported

ansell opened this issue · comments

The RemoteMediaStore.uploadImage(...) method that looks like it should support uploading images from files is not currently called from anywhere:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/develop/src/main/scala/au/org/ala/biocache/load/MediaStore.scala#L532

The reference in RemoteMediaStore.save is only to images that are hosted externally:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/develop/src/main/scala/au/org/ala/biocache/load/MediaStore.scala#L312

This is blocking image uploads that are done from local files where the images have been sent directly to us rather than being hosted somewhere.

The image server now directly loads images from URLs, rather than them being downloaded to a temporary location and then uploaded. This, of course, doesn't work for locally stored images.

Notes on testing. Attached is a test configuration that is pointing towards images-dev, collections-test etc. and a null cassandra store. You will need to have localhost:8983 pointing towards a solr instance.

biocache-test-config.properties.zip

https://collections-test.ala.org.au/dataResource/show/dr8259 is linked to a DwCA with images and a dud URL.

Note that the rowType of the core in the DwCA must have a type of http://rs.tdwg.org/dwc/terms/Occurrence otherwise the biocache-store will decide that it shouldn't load any image extensions. Watch out for Occurrences. The extensions can be of type http://rs.gbif.org/terms/1.0/Image or, preferably, http://rs.gbif.org/terms/1.0/Multimedia

The test archive that Doug linked to above only has a small number of images. Using this code on a larger set shows that it has a resource leak in the HTTP client connection pool where it isn't releasing resources after use. This is the console output and the start of the threaddump showing the resource leak (The threaddump was performed by running kill -3 on the PID from another console):

ans025@nci-sandbox-dev:~$ sudo -u tomcat7 biocache load-dwca dr13290
nci-sandbox-dev 2019-11-21 11:49:48,000 INFO : [ConfigModule] - Using config file: /data/biocache/config/biocache-config.properties
nci-sandbox-dev 2019-11-21 11:49:50,017 INFO : [Config] - Using the default set of blacklisted media URLs
nci-sandbox-dev 2019-11-21 11:49:51,457 INFO : [DataLoader] - SFTP the most recent from sftp://upload.ala.org.au:ala/dr13290
Nov 21, 2019 11:49:52 AM org.rev6.scf.SshConnection executeTask
INFO: Beginning SshTask of org.rev6.scf.SshCommand Task: date -r ala/dr13290/Mapped-Robert-Read-20181116.zip +%s
Nov 21, 2019 11:49:53 AM org.rev6.scf.SshConnection executeTask
INFO: Completed SshTask of org.rev6.scf.SshCommand Task: date -r ala/dr13290/Mapped-Robert-Read-20181116.zip +%s in 1 seconds.
Nov 21, 2019 11:49:53 AM org.rev6.scf.SshConnection executeTask
INFO: Beginning SshTask of org.rev6.scf.ScpDownload@c35af2a
Nov 21, 2019 11:50:35 AM org.rev6.scf.SshConnection executeTask
INFO: Completed SshTask of org.rev6.scf.ScpDownload@c35af2a in 41 seconds.
Nov 21, 2019 11:50:35 AM org.rev6.scf.SshConnection executeTask
INFO: Beginning SshTask of org.rev6.scf.SshCommand Task: date -r ala/dr13290/Mapped-Robert-Read-20181116.zip +%s
Nov 21, 2019 11:50:36 AM org.rev6.scf.SshConnection executeTask
INFO: Completed SshTask of org.rev6.scf.SshCommand Task: date -r ala/dr13290/Mapped-Robert-Read-20181116.zip +%s in 1 seconds.
Nov 21, 2019 11:50:36 AM org.rev6.scf.SshConnection executeTask
INFO: Beginning SshTask of org.rev6.scf.ScpDownload@aa4d8cc
Nov 21, 2019 11:51:22 AM org.rev6.scf.SshConnection executeTask
INFO: Completed SshTask of org.rev6.scf.ScpDownload@aa4d8cc in 46 seconds.
nci-sandbox-dev 2019-11-21 11:51:22,881 INFO : [DataLoader] - The most recent file is /data/biocache-load/dr13290/ala/dr13290/Mapped-Robert-Read-20181116.zip with last modified date : Thu Nov 21 11:47:59 AEDT 2019
nci-sandbox-dev 2019-11-21 11:51:22,884 INFO : [DataLoader] - Extracting ZIP /data/biocache-load/dr13290/ala/dr13290/Mapped-Robert-Read-20181116.zip
nci-sandbox-dev 2019-11-21 11:51:45,172 INFO : [DataLoader] - Archive extracted to directory: /data/biocache-load/dr13290/ala/dr13290/Mapped-Robert-Read-20181116
nci-sandbox-dev 2019-11-21 11:51:45,173 INFO : [DataLoader] - File last modified date: Thu Nov 21 11:47:59 AEDT 2019
nci-sandbox-dev 2019-11-21 11:51:45,174 INFO : [DataLoader] - Loading archive: /data/biocache-load/dr13290/ala/dr13290/Mapped-Robert-Read-20181116 for resource: dr13290, with unique terms: List(dwc:catalogNumber), stripping spaces:  false, incremental: true,  load missing only: false,  testing: false
nci-sandbox-dev 2019-11-21 11:51:45,304 INFO : [Vocab] - Reading vocab file: /data/biocache/vocab/dwc.txt
nci-sandbox-dev 2019-11-21 11:51:46,215 INFO : [Vocab] - Reading vocab file: /data/biocache/vocab/mime-types.txt
nci-sandbox-dev 2019-11-21 11:51:46,355 INFO : [DataLoader] - 10, >> last key : dr13290|107, UUID: , records per sec: 11.098779
nci-sandbox-dev 2019-11-21 11:51:46,418 INFO : [DataLoader] - 20, >> last key : dr13290|116, UUID: , records per sec: 163.93442
nci-sandbox-dev 2019-11-21 11:51:46,456 INFO : [DataLoader] - 30, >> last key : dr13290|126, UUID: , records per sec: 263.1579
nci-sandbox-dev 2019-11-21 11:51:46,494 INFO : [DataLoader] - 40, >> last key : dr13290|140, UUID: , records per sec: 270.27026
nci-sandbox-dev 2019-11-21 11:51:46,559 INFO : [DataLoader] - 50, >> last key : dr13290|151, UUID: , records per sec: 163.93442
nci-sandbox-dev 2019-11-21 11:51:46,616 INFO : [DataLoader] - 60, >> last key : dr13290|163, UUID: , records per sec: 178.57143
nci-sandbox-dev 2019-11-21 11:51:46,665 INFO : [DataLoader] - 70, >> last key : dr13290|173, UUID: , records per sec: 204.08163
nci-sandbox-dev 2019-11-21 11:51:46,718 INFO : [DataLoader] - 80, >> last key : dr13290|182, UUID: , records per sec: 188.67924
nci-sandbox-dev 2019-11-21 11:51:46,751 INFO : [DataLoader] - 90, >> last key : dr13290|192, UUID: , records per sec: 303.0303
nci-sandbox-dev 2019-11-21 11:51:46,809 INFO : [DataLoader] - 100, >> last key : dr13290|201, UUID: , records per sec: 172.4138
2019-11-21 12:08:51
Full thread dump OpenJDK 64-Bit Server VM (25.222-b10 mixed mode):

"Thread-6" #31 prio=5 os_prio=0 tid=0x00007ff074d48000 nid=0x6682 waiting on condition [0x00007fefecb65000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000b740fdb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)
	at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
	at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246)
	- locked <0x00000000b12633b8> (a org.apache.http.pool.AbstractConnPool$2)
	at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:303)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
	at au.org.ala.biocache.load.RemoteMediaStore$.uploadImage(MediaStore.scala:571)
	at au.org.ala.biocache.load.RemoteMediaStore$.save(MediaStore.scala:306)
	at au.org.ala.biocache.load.DataLoader$$anonfun$processMedia$1.apply(DataLoader.scala:323)
	at au.org.ala.biocache.load.DataLoader$$anonfun$processMedia$1.apply(DataLoader.scala:295)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at au.org.ala.biocache.load.DataLoader$class.processMedia(DataLoader.scala:295)
	at au.org.ala.biocache.load.DwCALoader.processMedia(DwCALoader.scala:95)
	at au.org.ala.biocache.load.DwCALoader$PersistConsumer.run(DwCALoader.scala:419)
	at java.lang.Thread.run(Thread.java:748)
.....

Note the time difference between 11:51 in the biocache-store log and 12:08 when I ran the threaddump.

Fixed the resource issue, but found new issue related to spaces in file names that are not being encoded correctly at this point in some part of the image handling code:

Exception in thread "Thread-6" java.lang.IllegalArgumentException: Illegal character in path at index 88: file:/data/biocache-load/dr13290/ala/dr13290/Mapped-Robert-Read-20181116/images/P1010232 crop to pp.jpg
	at java.net.URI.create(URI.java:852)
	at au.org.ala.biocache.load.RemoteMediaStore$.save(MediaStore.scala:307)
	at au.org.ala.biocache.load.DataLoader$$anonfun$processMedia$1.apply(DataLoader.scala:323)
	at au.org.ala.biocache.load.DataLoader$$anonfun$processMedia$1.apply(DataLoader.scala:295)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at au.org.ala.biocache.load.DataLoader$class.processMedia(DataLoader.scala:295)
	at au.org.ala.biocache.load.DwCALoader.processMedia(DwCALoader.scala:95)
	at au.org.ala.biocache.load.DwCALoader$PersistConsumer.run(DwCALoader.scala:419)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.URISyntaxException: Illegal character in path at index 88: file:/data/biocache-load/dr13290/ala/dr13290/Mapped-Robert-Read-20181116/images/P1010232 crop to pp.jpg
	at java.net.URI$Parser.fail(URI.java:2848)
	at java.net.URI$Parser.checkChars(URI.java:3021)
	at java.net.URI$Parser.parseHierarchical(URI.java:3105)
	at java.net.URI$Parser.parse(URI.java:3053)
	at java.net.URI.<init>(URI.java:588)
	at java.net.URI.create(URI.java:850)
	... 9 more

This appears to be fixed now. Just waiting on the next reindex to verify that the data resource was loaded properly. The images look okay in the viewer, just want to make sure they were linked up to the record correctly:

https://images.ala.org.au/?q=&fq=dataResourceUid%3Adr13290&offset=0&max=50&sort=dateUploaded&order=desc