parcel file format
fscottfoti opened this issue · comments
@mkreilly I'm adding this here to have a historical record of the design decision...
I've been playing around with Tom's latest parcel file (different sizes and file formats) this morning and have a proposal and want to know what you think.
First, the data doesn't come with juris attached, at least it's not in this file. I assume that means the only way to assign is via a geometric operation. Because of that, I'm inclined to split the parcel data up by county and put it at the county level in our folder hierarchy.
As for file format, Tom was using a format which is a csv, and the geometry is in the geom field and encoded as WKB. I think that's odd - I've never seen it before, but it's kind of genius. Shapefile is not my preference because 1) it's actually 4 files 2) column names are limited to 10 characters and 3) it's opaque and is hard to get into pandas (and takes forever to load into geopandas). Although geojson solves all these, it's a very wasteful file format and takes a lot of space. This csv format is both compact and transparent so I'm for it - I can also read it into Pandas with zero steps and the Pandas csv reader is very fast. Also some counties are small enough to read and edit in Excel.
I'm inclined to keep this file format and write a script which will be included in the repo to convert the csv to shapefile. Actually I'd like to zip it up as well, in order to 1) make it smaller and 2) give it a file ending of .zip which I will store using git large file support.
As for schema, these are the columns currently included - I took out a couple that I thought weren't useful. I'm inclined to leave all the rest of these as they all seem pretty useful, even though we'll drop about half of them before running UrbanSim. I'll also have to document what I know about the fields unless Tom has started that documentation. Any thoughts on the schema before I check in the files?
gid | county_id | apn | land_use_type_id | res_type | land_value | improvement_value | year_assessed | year_built | building_sqft | non_residential_sqft | residential_units | stories | tax_exempt | condo_identifier | geom | imputation_flag | development_type_id | calc_area |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
839510 | Napa | 001043010000 | 90 | other | 0.0 | 0.0 | 2009.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0106000020D00A00000100000001030000000100000005000000AA5ACB9E871F3C4155515CDF9573254194B17C228B1F3C41169FAFEF747325411462383F891F3C4162C84B1074732541DB4EE9CF851F3C413F8BCE0B95732541AA5ACB9E871F3C4155515CDF95732541 | _ | VAC | 31.9294348034 | |||
839550 | Napa | 001061004000 | 2122 | multi | 74565.5 | 74565.5 | 2009.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0106000020D00A00000100000001030000000100000005000000768C93A1C81B3C4188C8D2540D7125414FA26201BA1B3C41E0E3893607712541C50D7F1CB11B3C419DF9F9A061712541F84FE93CBF1B3C41F479DB7667712541768C93A1C81B3C4188C8D2540D712541 | _ | HT | 676.22569872 |
Also if you're curious, these are the file sizes using this format - not bad at all. It takes 7 seconds to read the largest one into pandas even unzipping on the fly.
-rw-rw-r-- 1 ubuntu ubuntu 107318040 Jan 9 20:16 alameda_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 159976492 Jan 9 20:16 contra_costa_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 17929128 Jan 9 20:16 marin_parcels.zip
-rw------- 1 ubuntu ubuntu 46127392 Jan 9 20:16 napa_parcels.csv
-rw-rw-r-- 1 ubuntu ubuntu 15959116 Jan 9 20:16 napa_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 19070741 Jan 9 20:16 san_francisco_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 34959002 Jan 9 20:16 san_mateo_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 59913474 Jan 9 20:17 santa_clara_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 23305911 Jan 9 20:17 solano_parcels.zip
-rw-rw-r-- 1 ubuntu ubuntu 62548116 Jan 9 20:17 sonoma_parcels.zip
Alternatively if we think we rarely need parcel shapes we could keep the attributes in a csv with centroid x and y and store the shapes in zipped shapefile. Centroid should be sufficient to do all the joins we need to do?