Nigiri stuck at 0% RUNNING

Question

Nigiri stuck at 0% RUNNING

laem opened this issue 4 months ago · comments

Trying to load the new version of the Bretagne GTFS aggregate, motis start is stuck.

https://www.korrigo.bzh/ftp/OPENDATA/KORRIGOBRET.gtfs.zip

I'm using the latest Motis release.

I cannot find any useful log. It's not stuck loading this particular GTFS, but stuck at the nigiri global step only when this GTFS is included, wether it's one of multiple gtfs config scheduls, or the only one.

The nigiri logs :

My guess is that there is an error in the GTFS files, but I don't know how to probe nigiri's output to find it.

Mael · Answer 1 · Wed Feb 21 2024 21:19:37 GMT+0800 (China Standard Time)

This page provides some validation information about the GTFS file, but no error is visible https://transport.data.gouv.fr/resources/81559#validation-report

Mael · Answer 2 · Wed Feb 21 2024 22:43:41 GMT+0800 (China Standard Time)

gtfstidy solved my problem.

Volker Krause · Answer 3 · Thu Feb 22 2024 00:53:29 GMT+0800 (China Standard Time)

FTR, this is the backtrace from that state:

#0  0x000055a5ea44131d in nigiri::floyd_warshall<unsigned short> (mat=...) at /k/transport/src/motis/deps/nigiri/include/nigiri/loader/floyd_warshall.h:21
#1  0x000055a5ea43d614 in nigiri::loader::process_component (tt=..., lb=..., ub=ub@entry=..., fgraph=..., matrix_memory=..., adjust_footpaths=true) at /k/transport/src/motis/deps/nigiri/src/loader/build_footpaths.cc:452
#2  0x000055a5ea43e4c5 in nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_1::operator()<std::__1::__wrap_iter<std::__1::pair<unsigned int, unsigned int>*>, std::__1::__wrap_iter<std::__1::pair<unsigned int, unsigned int>*> >(std::__1::__wrap_iter<std::__1::pair<unsigned int, unsigned int>*>, std::__1::__wrap_iter<std::__1::pair<unsigned int, unsigned int>*>) const (lb=..., lb@entry=..., ub=..., this=<optimized out>)     at /k/transport/src/motis/deps/nigiri/src/loader/build_footpaths.cc:522
#3  utl::equal_ranges_linear<std::__1::__wrap_iter<std::__1::pair<unsigned int, unsigned int>*>, nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_0, nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_1>(std::__1::__wrap_iter<std::__1::pair<unsigned int, unsigned int>*>, std::__1::__wrap_iter<std::__1::pair<unsigned int, unsigned int>*>, nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_0&&, nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_1&&) (begin=..., end=..., eq=..., func=...) at /k/transport/src/motis/deps/utl/include/utl/equal_ranges_linear.h:34
#4  utl::equal_ranges_linear<std::__1::vector<std::__1::pair<unsigned int, unsigned int>, std::__1::allocator<std::__1::pair<unsigned int, unsigned int> > >, nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_0, nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_1>(std::__1::vector<std::__1::pair<unsigned int, unsigned int>, std::__1::allocator<std::__1::pair<unsigned int, unsigned int> > >&, nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_0&&, nigiri::loader::transitivize_footpaths(nigiri::timetable&, bool)::$_1&&) (c=..., eq=..., func=...) at /k/transport/src/motis/deps/utl/include/utl/equal_ranges_linear.h:41
#5  nigiri::loader::transitivize_footpaths (tt=..., adjust_footpaths=<optimized out>) at /k/transport/src/motis/deps/nigiri/src/loader/build_footpaths.cc:518
#6  0x000055a5ea440024 in nigiri::loader::build_footpaths (tt=..., adjust_footpaths=true, merge_duplicates=false) at /k/transport/src/motis/deps/nigiri/src/loader/build_footpaths.cc:616
#7  0x000055a5ea3ea250 in nigiri::loader::finalize (tt=..., adjust_footpaths=true, merge_duplicates=false) at /k/transport/src/motis/deps/nigiri/src/loader/init_finish.cc:49

Felix Gündling · Answer 4 · Thu Feb 22 2024 01:46:17 GMT+0800 (China Standard Time)

I also checked. The problem is that it has almost 13k stops with coordinates stop_lat=0, stop_lon=0:

$ csvcut -c stop_lat stops.txt | grep ^0$ | wc -l
12841

I guess the GTFS validator should consider stops with stop_lat == 0 && stop_lng == 0 invalid.

MOTIS computes the transitive hull of all footpaths and since those 13k stops are close to each other (same coordinate) they are automatically connected and therefore we run a Floyd-Warshall with 13k entries.

The only thing we can do to "fix" it would be to exclude stops at (0, 0) from the process that creates additional footpaths.

Removing the whole "we create additional footpaths" thing would require to have perfect datasets which is unreasonable to assume if you have no control over the data creation process.

Felix Gündling · Answer 5 · Thu Feb 22 2024 01:49:42 GMT+0800 (China Standard Time)

FTR, this is the backtrace from that state:

Thank you for checking! Yes, Floyd Warshall has cubic complexity. Usually we have relatively small numbers of stops that are connected in one component.

Another option would be to figure out which component size causes problems and skip this step completely for components that are larger.

Jonah Brüchert · Answer 6 · Sat Mar 02 2024 20:06:11 GMT+0800 (China Standard Time)

Something similar happens with the feed for Paris. While it does eventually finish, it takes really long to do so (30min).
I couldn't find any stops with 0 coordinates, but it still spends most of the time in floyd_warshall.
I already preprocessed the feed with gtfstidy.

In case someone wants to have a look, the feed is here:
https://data.iledefrance-mobilites.fr/api/v2/catalog/datasets/offre-horaires-tc-gtfs-idfm/files/a925e164271e4bca93433756d6a340d1

Felix Gündling · Answer 7 · Mon Mar 04 2024 05:21:59 GMT+0800 (China Standard Time)

Thank you for the feed reference! I am experimenting with a different approach to build transitive footpaths.

motis-project/nigiri#76

However, I am not sure if it will really solve the issue and if it will work and be faster what the memory usage will be.

Jonah Brüchert · Answer 8 · Mon Mar 04 2024 05:32:09 GMT+0800 (China Standard Time)

@felixguendling btw since you asked about the transitous data set a few days ago, I have finished setting up a public rsync server.
rsync -rav --progress routing.spline.de::transitous /path/to/dest

Felix Gündling · Answer 9 · Mon Mar 04 2024 17:09:56 GMT+0800 (China Standard Time)

rsync -rav --progress routing.spline.de::transitous ./transitous
rsync: [Receiver] failed to connect to routing.spline.de (130.133.110.91): Connection refused (111)
rsync: [Receiver] failed to connect to routing.spline.de (2001:470:51c5:babe::91:1): Cannot assign requested address (99)
rsync error: error in socket IO (code 10) at clientserver.c(139) [Receiver=3.2.7]

Seems like the port is closed.

Jonah Brüchert · Answer 10 · Tue Mar 05 2024 01:19:20 GMT+0800 (China Standard Time)

Sorry about that, seems like rsyncd crashed. I'll need some time to debug that

Felix Gündling · Answer 11 · Tue Mar 05 2024 01:21:32 GMT+0800 (China Standard Time)

No worries. Let me know when I can try again.

Jonah Brüchert · Answer 12 · Sat Mar 09 2024 01:03:09 GMT+0800 (China Standard Time)

Rsync should work now, please let me know if it stops working again :)

Felix Gündling · Answer 13 · Sat Mar 09 2024 19:25:58 GMT+0800 (China Standard Time)

Perfect! Thank you. Then we'll include this in our benchmark datasets.

Moritz Fischer · Answer 14 · Tue Mar 12 2024 21:38:06 GMT+0800 (China Standard Time)

rsync -rav --progress routing.spline.de::transitous ./transitous
rsync: [Receiver] failed to connect to routing.spline.de (2001:470:51c5:babe::91:1): Connection refused (111)
rsync: [Receiver] failed to connect to routing.spline.de (130.133.110.91): Connection refused (111)
rsync error: error in socket IO (code 10) at clientserver.c(139) [Receiver=3.2.7]

Rsync to the transitous dataset seems down again.

Jonah Brüchert · Answer 15 · Tue Mar 12 2024 21:47:24 GMT+0800 (China Standard Time)

I have restarted it now, unfortunately saving the coredump didn't work properly, so I still don't know what caused it

Moritz Fischer · Answer 16 · Tue Mar 12 2024 21:50:47 GMT+0800 (China Standard Time)

Thank you, now it works :)

Moritz Fischer · Answer 17 · Tue Mar 12 2024 22:16:18 GMT+0800 (China Standard Time)

But only for so long.

...
sk_zssk.gtfs.zip
      1.042.907 100%    1,11MB/s    0:00:00 (xfr#190, to-chk=16/207)
uk_great-britain.gtfs.zip
    244.252.672  62%    2,51MB/s    0:00:58  
rsync: connection unexpectedly closed (2402516896 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(231) [receiver=3.2.7]
rsync: connection unexpectedly closed (10209 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(231) [generator=3.2.7]

I tried again but it results in the same error as before.

rsync -rav --progress routing.spline.de::transitous ./transitous
rsync: [Receiver] failed to connect to routing.spline.de (2001:470:51c5:babe::91:1): Connection refused (111)
rsync: [Receiver] failed to connect to routing.spline.de (130.133.110.91): Connection refused (111)
rsync error: error in socket IO (code 10) at clientserver.c(139) [Receiver=3.2.7]

Jonah Brüchert · Answer 18 · Tue Mar 12 2024 22:28:13 GMT+0800 (China Standard Time)

Sorry :(
I hope it's better know, but I'm not super sure

Moritz Fischer · Answer 19 · Tue Mar 12 2024 22:56:58 GMT+0800 (China Standard Time)

Don't worry about it. It finished the download of all the datasets now. Thanks again.