UDST / spandex

Spatial Analysis and Data Extraction

Home Page:http://nbviewer.ipython.org/github/synthicity/user_meeting_2014/blob/gh-pages/spandex/spandex_demo.ipynb

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scale/synthesize to match aggregate totals

janowicz opened this issue · comments

A common step in preparing UrbanSim base year data is ensuring that disaggregate data (e.g. buildings, households, jobs) matches aggregate targets at specified geographic levels. For example, in the case of disaggregate building data, we might want to match residential unit counts by block, match median home values by tract, match building year built by zipcode, or match non-residential-sqft totals by zone. To match totals for a given geography, we either synthesize new agents to match the total by sampling/copying/allocating existing agents within the geography, or we select agents within the geography for deletion. When matching an aggregate mean/median/total of some agent attribute in a certain geography, a scale-to-match approach can also be taken.

We want to be able to:

  • Synthesize new things in the disaggregate data to match an aggregate target value. Disaggregate things are replicated to match the target, and new things are then assigned a location_id within the aggregate geography (i.e. an allocation step).
  • Scale values in the disaggregate data to match an aggregate target value.

A control_to_target function is envisioned that takes the following arguments: agent_df, controls_df, and optionally allocation_geography_df. There are similarities between what this function would do and what the existing transition model does, as examples below will show. There are also similarities between this function and an UrbanSim refinement model. If we want the code to live in UrbanSim instead of Spandex, that is fine.

The operation of this function can be illustrated by looking at fake control table examples. See below. A couple of points: The agent_type column may not be needed as the agent type is implied by the agent_df argument. The allocate_to column may not be needed because the allocate_to geography is implied by the allocation_geography_df argument.

Example 1

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
zone 1 household 1000 synthesize building residential_units residential_units
zone 2 household 3000 synthesize building residential_units residential_units
zone 3 household 2000 synthesize building residential_units residential_units

In Example 1, we match household totals by zone and allocate to buildings within the zone according to the distribution of residential units, respecting a capacity constraint. If zone 1 contains less than 1,000 households, we randomly sample the needed number of new households from the existing households in zone 1, copy them, then allocate the new households to buildings in the zone (i.e. assign a building_id). If zone 1 contains more than 1,000 households, we randomly sample existing households for deletion. The agents_df argument in this case would be a DataFrame of households with a zone_id column. The controls_df argument in this case would be the table shown above. The allocation_geography_df would be a DataFrame of buildings with a zone_id column. In the allocation step, we would respect the capacity constraint identified in the allocation_capacity_attribute column of the controls_df table (number of households assigned to a building should not exceed the number of residential units in the building).

Example 2

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
zone 1 household persons income < 40000 5600 synthesize building residential_units residential_units
zone 2 household persons income < 40000 5600 synthesize building residential_units residential_units
zone 3 household persons income >= 50000 2000 synthesize building residential_units residential_units

In Example 2, we populate both the agent_accounting_attribute and agent_filter column in the control table. This means that the target values now refers to persons, not households, and the households we sample to meet this target must pass the agent_filter. Summing household.persons for households in zone 1 where household income is less than 40,000 should result in 5,600. In other words, there are 5,600 people in zone 1 in households with household income less than 40,000.

Example 3

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
zone 1 job 500 synthesize building non_residential_sqft non_residential_sqft/250
zone 2 job 1200 synthesize building non_residential_sqft non_residential_sqft/250
zone 3 job 700 synthesize building non_residential_sqft non_residential_sqft/250

In Example 3, we want to match zonal job targets (agent_type=='job') and allocate new jobs to building weighted by non_residential_sqft. The allocation_capacity_attribute reflects the assumption that each job spot takes up 250 sq ft, and we don't want to exceed the number of job spots in the buildings being allocated to. After running control_to_target, there will be 500 jobs with zone_id 1.

Example 4

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
zone 1 job sector_id == 11 150 synthesize building non_residential_sqft + 50*residential_units non_residential_sqft/250
zone 2 job sector_id == 11 500 synthesize building non_residential_sqft + 50*residential_units non_residential_sqft/250
zone 3 job sector_id == 32 200 synthesize building non_residential_sqft + 50*residential_units non_residential_sqft/250

In Example 4, we control to job targets again, but now the agent_filter column is populated. In zone_id 1, the target of 150 applies only to jobs in sector 11. There should be 150 jobs in zone_id 1 with sector_id 11. Existing jobs in sector 11 and zone 1 are either copied or deleted to match this target. New jobs get a new, unique job_id.

Example 5

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
zone 1 building residential_units 800 synthesize parcel parcel_sqft parcel_sqft/500
zone 2 building residential_units 200 synthesize parcel parcel_sqft parcel_sqft/500
zone 3 building residential_units 350 synthesize parcel parcel_sqft parcel_sqft/500

In Example 5, we want to match residential_unit targets. Summing building.residential units for buildings in zone_id 1 should get us 800. Existing buildings with residential units are sampled, copied, and allocated if the existing zonal residential unit count is too low. Otherwise, residential buildings are sampled for deletion if the existing count is too high. We allocate new synthetic buildings to parcel, weighting the allocation by parcel_sqft and respecting the parcel_sqft/500 capacity_constraint.

Example 6

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
zone 1 building non_residential_sqft building_type_id == 5 30000 synthesize parcel parcel_sqft parcel_sqft/2
zone 2 building non_residential_sqft building_type_id == 5 85000 synthesize parcel parcel_sqft parcel_sqft/2
zone 3 building non_residential_sqft building_type_id == 5 72000 synthesize parcel parcel_sqft parcel_sqft/2

In Example 6, we match non_residential_sqft totals by zone for building_type_id 5. Note the agent_filter column.

Example 7

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
parcel 111 job 50 synthesize building non_residential_sqft non_residential_sqft/250
parcel 112 job 120 synthesize building non_residential_sqft non_residential_sqft/250
parcel 113 job 70 synthesize building non_residential_sqft non_residential_sqft/250

In Example 7, notice that the location_type is 'parcel' instead of 'zone'. We are matching parcel-level employment targets: there should be 50 jobs attached to buildings on the parcel with parcel_id 111.

Example 8

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
tract 7 household income 70000 scale_to_mean
tract 8 household income 84000 scale_to_mean
tract 9 household income 39000 scale_to_mean

In Example 8, we are scaling to match the target instead of synthesizing to match (see the how_match column). Here we scale households by tract to match the observed household mean income by tract. The average household income in tract 7 is 70,000 and we want the disaggregate data to reflect this. Notice that when scaling to match, new agents are not synthesized, so the "allocate_" columns are left blank.

Example 9

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
tract 7 building year_built building_type_id < 3 1995 scale_to_median
tract 8 building year_built building_type_id < 3 1978 scale_to_median
tract 9 building year_built building_type_id < 3 1925 scale_to_median

In Example 9, we scale building year_built to match the observed tract median year built. In tract 7, we want the median year_built of buildings with building_type_id less than 3 to be '1995'. Note 'scale_to_median' in the how_match column.

Example 10

location_type location_id agent_type agent_accounting_attribute agent_filter target how_match allocate_to allocation_weight allocation_capacity_attribute
zone 1 building non_residential_sqft 100000 scale_to_sum
zone 2 building non_residential_sqft 40000 scale_to_sum
zone 3 building non_residential_sqft 75000 scale_to_sum

In Example 10, we scale building non_residential_sqft to match a zonal target for non_residential_sqft. We want there to be 100,000 square feet of non-residential space in zone 1, and we want to match this target by scaling existing building records with non-residential sqft instead of synthesizing new building records. We scale existing values downwards or upwards depending on whether the zonal target is currently exceeded or short. Note 'scale_to_sum' in the how_match column.

In all of your examples the how_match, allocate_to, allocation_weight, and allocation_capacity_attribute columns always all the same. Is that ever not going to be the case?

When assigning, say, households to buildings, does there need to be any method or priority given to how those buildings are chosen for filling? Do we pick a bunch of buildings and put in one household per building, or fill buildings to capacity?

For a given control table, how_match, allocate_to, allocation_weight, and allocation_capacity_attribute will typically always be the same unless, for whatever reason, someone puts multiple agent_types into the same table that they want different allocation rules for (e.g. household targets in the same table as job targets). If a table is only dealing with one kind of agent (and we can make this a requirement if it's useful), those columns would typically be populated with the same thing.

We want to allow buildings to fill to capacity- respecting allocation_capacity_attribute as the maximum capacity. Randomly allocating down to building is fine in this case. If there's no more capacity left in the given geography (hopefully we wouldn't run into this because we'd control to residential unit targets first), we'd want to print a warning and either overstuff or leave unplaced. Maybe we can add another argument to indicate whether overstuffing is ok.

Are the location_type and location_id really just another type of filter?

Yes, they're like a filter on agents and alternatives in the case of synthesizing to match. In the example of buildings and zone_id 1, we'd only allocate agents to buildings where buildings.zone_id == 1, and we'd only copy agents that are already in zone_id 1. The location_type tells us which location id column to filter on. E.g. location_type could be zone or parcel, and we'd use zone_id or parcel_id to filter accordingly.

Yes, in the case of scale_to_mean/scale_to_median/scale_to_sum, location_type and location_id are a filter on agents/things to scale. E.g. if 'location_type' was tract and 'location_id' was 5, only use buildings in tract 5 when matching this mean year_built target. buildings.tract_id == 5.

How attached are we to the "control tables" idea for this? It seems like a lot of information in the tables is redundant and amenable to expression in other ways, such as function arguments. But I don't know if there is some context for putting everything in tables like your examples.

The table idea is to give clients a way to specify a set of controls for a whole region, e.g. they'd provide us with a csv table of residential unit totals by block for all blocks in the region. Good point; its true that a lot of the columns would always contain redundant information since we'd be taking the same approach in each geography. Its fine if we stripped down the tables to just a location_id and target, and all other info provided as function arguments.

When adding and removing rows to meet an aggregate target metric (like persons) you can prioritize removing/adding lots of rows by focusing on rows with small metric values or removing/adding fewer rows by focusing on rows with large metric values. Any opinion on which route is better?

No preference as to which is better, though to the extent possible it'd be great not to skew the distribution too much in one direction or the other (i.e. add all small households or add all big households).

I've gone with adding/removing large things for now, though it cycles through all the existing rows in size order. It could be randomized.

Just read through this so I may be behind your thoughts. Two things: 1) when you need to deal with imputing household or employee counts wouldn't you very much want to use the UrbanSim location choice models so you could use all the existing infrastructure to get richer HHs in more expensive buildings and employees in the correct bldg types, etc? 2) on the HU side of things, to some extent the duplication of other building within the zone (as described) will move the zone toward the control total but it seems that often there are enough buildings but their housing unit count is too low. Can your architecture also treat individual units as the agent type? Or would you use a different approach to adjust he unit count attribute (as with year built)?

1.) Definitely. But in some cases we may not want to (e.g. if a location choice model hasn't been estimated for the region yet or random allocation is good enough in some region because the location choice model is most appropriate for regional allocation or building type was not explanatory), so wanted to start with the simplest case first. Great reminder though that alternative methods of allocation should be supported- particularly the application of a location choice model. I'll add a story for this. 2.) If not-duplicating is preferred, the approach in example 10 can be used- increasing attribute values to match an aggregate total. If a units table is available, then the synthesizing of units/ allocation-to-building would work too

  1. great: I can see why both would be useful; 2) also sounds good: just wanted to remind you of our problems last time we tried this. We ended up filling truly vacant lots (that we need for infill in our scenarios) with imputed buildings. Closer inspection revealed the need for (my poorly named) reverse iceberg approach where additional units are added to numerous existing buildings to account for what seems like a frequent undercounting of units in the asssessor's role (which is a tax calculation after all)

Sounds good, thanks. The example you mention is a a very useful situation to keep in mind.