afimb / gtfslib-python

An open source library in python for reading GTFS files and computing various stats and indicators about Public Transport networks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Blocking query with postgresql

mhugo opened this issue · comments

Trying to import twice the same gtfs feed with a different feed_id results in a blocking query the second time.

Steps to reproduce:

  1. create an empty postgresql database
  2. gtfdbloader postgresql://db --load=gtfs.zip --id=T1 => ok
  3. gtfdbloader postgresql://db --load=gtfs.zip --id=T2 => blocks on "Normalizing shapes and trips..."

The issue seems to come from https://github.com/afimb/gtfslib-python/blob/master/gtfslib/converter.py#L596 but I am not sure exactly what happens.

For information, if I change prefetch_stop_times=True, prefetch_stops=True to False, there is no problem anymore (but the import is slower). Not sure to understand ...

@laurentg any clue ?

I just tested with a sample GTFS with shapes to normalize and it works fine loading twice on a postgresql database. Can you share the GTFS you are experiencing problem with?

I do not really understand. Apparently the bug is caused by PostgreSQL choking on a simple query (100% CPU usage and never returning). The query, as generated by SqlAlchemy, is:

SELECT stop_times.feed_id AS stop_times_feed_id, stop_times.trip_id AS stop_times_trip_id, stop_times.stop_sequence AS stop_times_stop_sequence, stop_times.stop_id AS stop_times_stop_id, stop_times.arrival_time AS stop_times_arrival_time, stop_times.departure_time AS stop_times_departure_time, stop_times.interpolated AS stop_times_interpolated, stop_times.shape_dist_traveled AS stop_times_shape_dist_traveled, stop_times.timepoint AS stop_times_timepoint, stop_times.pickup_type AS stop_times_pickup_type, stop_times.drop_off_type AS stop_times_drop_off_type, stop_times.stop_headsign AS stop_times_stop_headsign, anon_1.trips_feed_id AS anon_1_trips_feed_id, anon_1.trips_trip_id AS anon_1_trips_trip_id FROM (SELECT trips.feed_id AS trips_feed_id, trips.trip_id AS trips_trip_id FROM trips WHERE trips.feed_id = 'TER1' AND trips.trip_id IN ('OCEDB018800F030015447', 'OCEDB018801F010015448', 'OCEDB018802F030015449')) AS anon_1 JOIN stop_times ON stop_times.trip_id = anon_1.trips_trip_id AND stop_times.feed_id = anon_1.trips_feed_id;

The funny thing is that:

  1. reducing batch size to low values (3) has no effect (still blocking)
  2. executing the exact same query manually works fine (using psql)
  3. changing a (to my taste unrelated) debug system parameter make PostgreSQL works:
    ALTER SYSTEM SET track_activity_query_size = 65536;

Leaving this as open for now.