OneBusAway / onebusaway-gtfs-modules

A Java-based library for reading, writing, and transforming public transit data in the GTFS format, including database support.

Home Page:https://github.com/OneBusAway/onebusaway-gtfs-modules/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

duplicate agency_id from new GTFS but different agency

jheld opened this issue · comments

If I have different database feeds from different GTFS zip imports, there is a high probability that I will have agency_ids that match between two different agencies.txt. In the case I'm considering, the agencies will actually be different. However, this won't work from a primary key perspective. Is there anything built into the CLI application that can transform this so it can be put into the database, if it happens?

@jheld Have you checked out the GTFS Transformer CLI tool? It's part of this repo.

@barbeau Doesn't seem like it can do that...at least not as built-in functionality. Likely I would first have to parse all the agencyids against what's in my current database, then come up with a set of brand new distinct ids, then apply the transformation tool to change the agencyids. I assume if I do this that the dependent files that also reference the agencyid will change.

Thoughts? Is this sane?

@jheld If you have two different GTFS zip files a.zip and b.zip, each of which have agency.txt with agency_id = 1, you could run the GTFS Transformer on a.zip and update the agency_id to 2. This would give you two GTFS datasets with agency_ids that no longer collide - at that point you can import them into any application.

Does that fix your problem?

Kind of. The issue is that I won't necessarily be loading these zips at the same time. It could be a long time between imports. The point is, in order to adequately see that there is a similarity like that, I would have to either adjust your program to detect and fix that (as a fork/subclass), or write my own external program to do a pre-process and then apply the transform. So, yeah, it fixes my problem; it's just not a baked in feature the way I'd want -- can't win them all. Thank you for taking the time to answer this!

Gotcha - I think we'd be interested in a generalized feature to detect agency_id collisions, maybe based on a text file input that contains a list of existing agency_ids? PRs happily accepted :).

Turns out we'd want to detect all sorts of *_id collisions. In my implementation, I'm running a governing python process that gets the highest int-based id in the database, and if the loop-based id is in the list, sets a next highest int id (and higher than the highest int in incoming feed). Then it applies the transform CLI, then hibernate on the transformed version.

I wouldn't mind getting this into the java land. So, you'd want me to create a new [sub] module that gets the existing agency_ids and spits them out to a file, along with the highest int-based id. Int based makes the most sense because one, it has inherent easy ordering properties, and second because the generated transformations just keep getting higher and higher per conflict.

So the user would run this, then the transform, then the hibernate?

I think that makes sense. The goal would be a modular operation that could easily be chained into any number of other workflows/CLI operations - so I'd target what really makes sense for you. Hopefully you'd use the contents of the PR in your own workflow - otherwise not sure it would make sense to add here (if we don't have any immediate users).