kiselev-dv / gazetteer

OSM ElasticSearch geocoder and addresses exporter

Home Page:http://osm.me

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Netherlands - Out of memory

ricadete opened this issue · comments

Good morning master,

I need your help once more, it seems that we need some tricks to resolve one of the most detailed country on OSM: Netherlands. So I ran the application as you suggested:

1st step
bzcat $inputFile | java -jar gazetteer-1.4.jar split - none

2nd step
java -jar gazetteer-1.4.jar slice --x10

3rd step
java -jar gazetteer-1.4.jar join --handlers out-gazetteer $outFile

2015-11-20 10.01.17.187 [join-stripe18544.gjson.gz] ERROR JoinSliceRunable - Join failed. File: data/stripe18544.gjson.gz.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:2367)
...

and there also more stripes failing after these one.

Source of the file: http://download.geofabrik.de/europe/netherlands-latest.osm.bz2

What can be done here?
Thank you in advance.

java -jar gazetteer-1.4.jar will run with default settings. It depends on JDK version but it's something about one gigabyte of ram or two.

So first step, specify amount of memory:

java -Xmx4g -jar gazetteer-1.4.jar 

Next step, how many execution threads do you have? Each one will takes about 0.5-1g of ram. (It's estimated average, some of them could take more)

So if you have 8 or 16 threads, strict join with number of threads

 java -Xmx4g -jar gazetteer-1.4.jar --threads 2 join --handlers out-gazetteer $outFile

Good good morning,

so we have a vm with 15GB of ram and we ran the join as:
java -Xmx10g -jar gazetteer-1.4.jar --threads 1 join --handlers out-gazetteer $outFile

it rans for hours and eventually gets stuck, does not throw any exception, it just stops. We also track the memory, it raises up to 11GB. Eventually I had to stop the process. Do you have any idea what is happening, maybe you can also check with this file?
http://download.geofabrik.de/europe/netherlands-latest.osm.bz2

Best regards,

Ok, I'll check it.
Not enough minerals.

Hi again!

So it finality did it, we had to increase the memory to 15GB, set single thread and wait around 14h.
It seems that we may have a memory leak somewhere, the memory seems to be always increasing rather that your code goes split by split, that is suspicious. To help you I had my logs in attachment.
Let me know if you find this useful for you.

FYI:
The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.

These were the cmds:
bzcat /opt/data/regions/netherlands-latest.osm.bz2 | java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-split-2015-11-24-22:38:45.log --data-dir netherlands split - none

java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-slice-2015-11-24-22:50:06.log --data-dir netherlands slice --x10

java -Xmx15g -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-join-2015-11-24-23:12:40.log --data-dir netherlands --threads 1 join --handlers out-gazetteer netherlands_2015-11-24-22:38:45.json.bz

netherlands-join-2015-11-24-23:12:40.txt
netherlands-slice-2015-11-24-22:50:06.txt
netherlands-split-2015-11-24-22:38:45.txt

split consumes memory from start to end and free mem at the end.

join should work with small pieces of data, so could you give me output of

ls -lh | grep stripe

The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.

It's actually a design issue, every line contains all data for address, with data for all address parts and can be processed line by line without fetching related objects. So all related parts of address inprinted into main feature. It takes a huge amount of space, but it have been done by purpose.

You could overwrite out-gazetteer handler with groovy script, to produce not so verbose output. Or use out-csv handler which produces much less verbose output. If it's the case, I could write an example of such handler.

the joins are small: between few KB to few MB
stripe.txt

I think you have done a great job so far :) let me know if I can help you somehow.

Thank you, but it's still a lot of things to be done.

So as I understand, most of the time was taken by join?

Yes the join is really the bottleneck, if you check the logs the last steps really takes long time. There was nothing really happing in foreground, I would think I saw was the pid still running.

2015-11-25 06.59.42.555 [main] INFO JoinExecutor - Join stripes done in 7:47:01.702
2015-11-25 06.59.42.562 [main] INFO JoinBoundariesExecutor - Run join boundaries, with filter []
2015-11-25 06.59.48.480 [main] INFO JoinBoundariesExecutor - 2999 boundaries was sorted
2015-11-25 06.59.48.482 [main] INFO JoinBoundariesExecutor - Admin levels: [2, 3, 4, 6, 7, 8, 9, 10]
2015-11-25 07.00.05.797 [main] INFO JoinBoundariesExecutor - 0 boundaries skiped
2015-11-25 07.00.05.859 [main] INFO JoinBoundariesExecutor - Join boundaries done in 0:00:23.297
2015-11-25 07.00.05.859 [main] INFO JoinExecutor - Join boundaries done in 0:00:23.300
2015-11-25 12.30.30.31 [main] INFO GazetteerOutWriter - Wrote poi points: 277689
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote address points: 8701051
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway segments: 1139017
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway networks: 370605
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place boundaries: 0
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place points: 6502
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote admin boundaries: 2999
2015-11-25 12.30.30.55 [main] INFO JoinExecutor - All handlers done in 5:30:24.194

It's a good news actually, a kind of good news :)

https://github.com/kiselev-dv/gazetteer/blob/develop/Gazetteer/src/main/java/me/osm/gazetter/join/out_handlers/GazetteerOutWriter.java#L969

So 5 hours 30 minutes was taken by sorting out the results.
There are two things actually happens:

  1. sort features with hierarchy (referenced features before features which uses dependancy)
  2. merge highways into networks (to find out one highway instead of tons of small segments)

I've added some options to skip this part in last commit, I'll test it out and give you a note.

Try 1.5 https://github.com/kiselev-dv/gazetteer/releases/tag/Gazetteer-1.5 please
If you didn't delete --data-dir netherlands folder just run it again with

java -Xmx10g -jar gazetteer-1.5.jar --log-file netherlands-join.log --data-dir netherlands --threads 1 join --handlers out-gazetteer out=netherlands.json.gz sort=NONE

Successfully convert Netherlands within 4 hours 6g of ram in two threads.