osm2pgsql-dev / osm2pgsql

OpenStreetMap data to PostgreSQL converter

Home Page:https://osm2pgsql.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New middle tables experiments

joto opened this issue · comments

I have opened PR #1969 which shows my current work on a new database formats for the so-called "middle" tables, ie. planet_osm_nodes/ways/rels which contain the data needed to update the database from change files.

But the middle tables can also contain useful information to do specialized processing of the OSM data. The new format makes them not only smaller but also easier to use. We'd be interested to get some feedback from the community on the new format.
We hope to make the new middle table format "official" at some point, but want to be sure we have the right format first.
(And of course we'll be supporting the old format for a while next to the new format.)

This is how the tables look like:

=> \d planet_osm_nodes
                        Table "public.planet_osm_nodes"
┌──────────────┬─────────────────────────────┬───────────┬──────────┬─────────┐
│    Column    │            Type             │ Collation │ Nullable │ Default │
├──────────────┼─────────────────────────────┼───────────┼──────────┼─────────┤
│ id           │ bigint                      │           │ not null │         │
│ lat          │ integer                     │           │ not null │         │
│ lon          │ integer                     │           │ not null │         │
│ created      │ timestamp without time zone │           │          │         │
│ version      │ integer                     │           │          │         │
│ changeset_id │ integer                     │           │          │         │
│ user_id      │ integer                     │           │          │         │
│ tags         │ jsonb                       │           │ not null │         │
└──────────────┴─────────────────────────────┴───────────┴──────────┴─────────┘
Indexes:
    "planet_osm_nodes_pkey" PRIMARY KEY, btree (id)

=> \d planet_osm_ways
                        Table "public.planet_osm_ways"
┌──────────────┬─────────────────────────────┬───────────┬──────────┬─────────┐
│    Column    │            Type             │ Collation │ Nullable │ Default │
├──────────────┼─────────────────────────────┼───────────┼──────────┼─────────┤
│ id           │ bigint                      │           │ not null │         │
│ created      │ timestamp without time zone │           │          │         │
│ version      │ integer                     │           │          │         │
│ changeset_id │ integer                     │           │          │         │
│ user_id      │ integer                     │           │          │         │
│ nodes        │ bigint[]                    │           │ not null │         │
│ tags         │ jsonb                       │           │ not null │         │
└──────────────┴─────────────────────────────┴───────────┴──────────┴─────────┘
Indexes:
    "planet_osm_ways_pkey" PRIMARY KEY, btree (id)
    "planet_osm_ways_nodes_bucket_idx" gin (planet_osm_index_bucket(nodes)) WITH (fastupdate=off)

=> \d planet_osm_rels
                        Table "public.planet_osm_rels"
┌──────────────┬─────────────────────────────┬───────────┬──────────┬─────────┐
│    Column    │            Type             │ Collation │ Nullable │ Default │
├──────────────┼─────────────────────────────┼───────────┼──────────┼─────────┤
│ id           │ bigint                      │           │ not null │         │
│ created      │ timestamp without time zone │           │          │         │
│ version      │ integer                     │           │          │         │
│ changeset_id │ integer                     │           │          │         │
│ user_id      │ integer                     │           │          │         │
│ members      │ jsonb                       │           │ not null │         │
│ tags         │ jsonb                       │           │ not null │         │
└──────────────┴─────────────────────────────┴───────────┴──────────┴─────────┘
Indexes:
    "planet_osm_rels_pkey" PRIMARY KEY, btree (id)
    "planet_osm_rels_planet_osm_index_node_members_idx" gin (planet_osm_index_node_members(members)) WITH (fastupdate=off)
    "planet_osm_rels_planet_osm_index_way_members_idx" gin (planet_osm_index_way_members(members)) WITH (fastupdate=off)

=> \d planet_osm_users
           Table "public.planet_osm_users"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │  Type   │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id     │ integer │           │ not null │         │
│ name   │ text    │           │ not null │         │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
    "planet_osm_users_pkey" PRIMARY KEY, btree (id)

This is with with -x/--extra-attributes option. Without that the created, version, changeset_id, and user_id are missing in all tables and the planet_osm_users table will not be created.

If you want to try this, check out the branch from PR #1969 and compile it. You'll get the old format by default, you need the special command line options described in the PR to use it.

The new format is also slight more efficient. The database size drops from 256 GB for the planet to 237 GB. It is much much better if you use -x/--extra-attributes. In that case database size drops from nearly 400 GB to about 260 GB.

created should be timestamptz because it is a UTC time. See https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29_to_store_UTC_times.

fastupdate=off should not be needed on modern postgresql. When we added it, the pending list was set to work_mem, which was typically large on rendering servers (128MB or so). Since PostgreSQL 10, the value is set by gin_pending_list_limit, which defaults to 4MB.

What schema will be used for the various jsonb columns?

created should be timestamptz because it is a UTC time

No. See #1785.

fastupdate=off ...

I just took those settings from the old setup. If we don't need those, we should remove them everywhere. There is an open issue #37 about fastupdates. I never got around to researching this, but having nothing against remove them.

What schema will be used for the various jsonb columns?

Tag columns just have the obvious key-value structure. For relation members I showed the structure in the PR: [{"type": "W", "ref": 123, "role": "inner"}, ...].

created should be timestamptz because it is a UTC time

No. See #1785.

We have a disagreement between your comment and the PostgreSQL wiki. I prefer the PostgreSQL wiki as a source for how to represent a UTC timestamp, as it agrees with all PostgreSQL experts I have talked to about the matter. Since we're making a breaking change for the middle, let's get it right.

I agree with "the PostgreSQL experts" that theirs is the right choice for most cases and that if you don't want to think about what the best solution for your use case is than you should do as they say. But every use case is different. Basically it comes down to what the "operations" are that we want to do with the data. In our case I believe the most often used "operation" is to look at the data and compare it with timestamps in the same database (in which case type doesn't matter) or with timestamps we get from outside the database (most importantly OSM files which always use UTC or the osm.org web site which shows also in UTC (if you mouse over the "33 days ago" message)). Having the data show up in my local time zone which is the default when opening a pgsql session is rather annoying, especially with daylight saving shifts involved. And I never care about my local time when an edit was made. I might care about the local time zone of the editor, but that information is lost anyway. What I care about is comparing timestamps to see what comes before what. The only time I might care about comparing with local time is to get something like that "edited 3 hours ago" thing, not something that will happen a lot in an osm2pgsql database, but I agree that this use case becomes a little bit more complex.

Depending on what you do if you write a program to do something with the data you might have to do a little bit more work to set the time zone, but that's okay, you are writing program anyway. The crucial part is when you do ad-hoc queries, and they are easier and more natural when storing the data without time zone. And yes, I know that you can change the time zone in your settings so that everything shows up in UTC anyway, but then I either have to remember to do this for every session or I have to put that into a config file which breaks every other case where I might have a database that contains timestamps that I do want to see in my local time. Unfortunately there is no way to say, this is a timestamp with UTC time and always show it to me in UTC time.

or with timestamps we get from outside the database (most importantly OSM files which always use UTC or the osm.org web site which shows also in UTC (if you mouse over the "33 days ago" message))

Because we're using UTC timestamps is the reason to use timestamptz! Specifically don't use timestamp (without time zone) to store UTC times.

I agree with "the PostgreSQL experts" that theirs is the right choice for most cases and that if you don't want to think about what the best solution for your use case is than you should do as they say. But every use case is different

I've considered this use case, and there's nothing unusual about it. Storing UTC time is a common use-case that is well documented.

Some practical problems

  • Say I want to find how old an object is as an interval. If I get the date of the newest node from osm.org and compare it to now(), I do SELECT now() - '2023-05-28T07:36:45Z'::timestamp; and get a negative interval. With timestamptz I get the right answer.
  • My server with osm2pgsql runs in a different time zone than my desktop. I'll get different results depending on where I query from

Your reason for wanting timestamp is 4 of the last list on https://community.spiceworks.com/topic/2454825-zone-of-misunderstanding, linked from the PostgreSQL wiki.

If you want to display timestamps in UTC in some DBs and in local time in others, set the timezone specifically for that database. This is breaking behavior for others to force your preferences on them.

My server with osm2pgsql runs in a different time zone than my desktop. I'll get different results depending on where I query from

That's an argument for my position. You get the same result in both cases with timestamp without time zone. If you have timestamp with time zone you'll get different results.

The code has been merged a while back. Closing here.