mkreilly / mapcraft_eg

This repository contains a proof of concept mapping application (it was the Bay Area Urban Geodatabase but that will be in a new repos again called petrale)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parcel file format

fscottfoti opened this issue · comments

@mkreilly I'm adding this here to have a historical record of the design decision...

I've been playing around with Tom's latest parcel file (different sizes and file formats) this morning and have a proposal and want to know what you think.

First, the data doesn't come with juris attached, at least it's not in this file. I assume that means the only way to assign is via a geometric operation. Because of that, I'm inclined to split the parcel data up by county and put it at the county level in our folder hierarchy.

As for file format, Tom was using a format which is a csv, and the geometry is in the geom field and encoded as WKB. I think that's odd - I've never seen it before, but it's kind of genius. Shapefile is not my preference because 1) it's actually 4 files 2) column names are limited to 10 characters and 3) it's opaque and is hard to get into pandas (and takes forever to load into geopandas). Although geojson solves all these, it's a very wasteful file format and takes a lot of space. This csv format is both compact and transparent so I'm for it - I can also read it into Pandas with zero steps and the Pandas csv reader is very fast. Also some counties are small enough to read and edit in Excel.

I'm inclined to keep this file format and write a script which will be included in the repo to convert the csv to shapefile. Actually I'd like to zip it up as well, in order to 1) make it smaller and 2) give it a file ending of .zip which I will store using git large file support.

As for schema, these are the columns currently included - I took out a couple that I thought weren't useful. I'm inclined to leave all the rest of these as they all seem pretty useful, even though we'll drop about half of them before running UrbanSim. I'll also have to document what I know about the fields unless Tom has started that documentation. Any thoughts on the schema before I check in the files?

gid county_id apn land_use_type_id res_type land_value improvement_value year_assessed year_built building_sqft non_residential_sqft residential_units stories tax_exempt condo_identifier geom imputation_flag development_type_id calc_area
839510 Napa 001043010000 90 other 0.0 0.0 2009.0 0.0 0.0 0.0 0.0 0106000020D00A00000100000001030000000100000005000000AA5ACB9E871F3C4155515CDF9573254194B17C228B1F3C41169FAFEF747325411462383F891F3C4162C84B1074732541DB4EE9CF851F3C413F8BCE0B95732541AA5ACB9E871F3C4155515CDF95732541 _ VAC 31.9294348034
839550 Napa 001061004000 2122 multi 74565.5 74565.5 2009.0 0.0 0.0 2.0 0.0 0.0 0106000020D00A00000100000001030000000100000005000000768C93A1C81B3C4188C8D2540D7125414FA26201BA1B3C41E0E3893607712541C50D7F1CB11B3C419DF9F9A061712541F84FE93CBF1B3C41F479DB7667712541768C93A1C81B3C4188C8D2540D712541 _ HT 676.22569872

Also if you're curious, these are the file sizes using this format - not bad at all. It takes 7 seconds to read the largest one into pandas even unzipping on the fly.

-rw-rw-r-- 1 ubuntu ubuntu 107318040 Jan 9 20:16 alameda_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 159976492 Jan 9 20:16 contra_costa_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 17929128 Jan 9 20:16 marin_parcels.zip
-rw------- 1 ubuntu ubuntu 46127392 Jan 9 20:16 napa_parcels.csv
-rw-rw-r-- 1 ubuntu ubuntu 15959116 Jan 9 20:16 napa_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 19070741 Jan 9 20:16 san_francisco_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 34959002 Jan 9 20:16 san_mateo_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 59913474 Jan 9 20:17 santa_clara_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 23305911 Jan 9 20:17 solano_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 62548116 Jan 9 20:17 sonoma_parcels.zip

Alternatively if we think we rarely need parcel shapes we could keep the attributes in a csv with centroid x and y and store the shapes in zipped shapefile. Centroid should be sufficient to do all the joins we need to do?