Scale/synthesize to match aggregate totals

Question

Scale/synthesize to match aggregate totals

janowicz opened this issue 10 years ago · comments

A common step in preparing UrbanSim base year data is ensuring that disaggregate data (e.g. buildings, households, jobs) matches aggregate targets at specified geographic levels. For example, in the case of disaggregate building data, we might want to match residential unit counts by block, match median home values by tract, match building year built by zipcode, or match non-residential-sqft totals by zone. To match totals for a given geography, we either synthesize new agents to match the total by sampling/copying/allocating existing agents within the geography, or we select agents within the geography for deletion. When matching an aggregate mean/median/total of some agent attribute in a certain geography, a scale-to-match approach can also be taken.

We want to be able to:

Synthesize new things in the disaggregate data to match an aggregate target value. Disaggregate things are replicated to match the target, and new things are then assigned a location_id within the aggregate geography (i.e. an allocation step).
Scale values in the disaggregate data to match an aggregate target value.

A control_to_target function is envisioned that takes the following arguments: agent_df, controls_df, and optionally allocation_geography_df. There are similarities between what this function would do and what the existing transition model does, as examples below will show. There are also similarities between this function and an UrbanSim refinement model. If we want the code to live in UrbanSim instead of Spandex, that is fine.

The operation of this function can be illustrated by looking at fake control table examples. See below. A couple of points: The agent_type column may not be needed as the agent type is implied by the agent_df argument. The allocate_to column may not be needed because the allocate_to geography is implied by the allocation_geography_df argument.

Example 1

location_type	location_id	agent_type	target	how_match	allocate_to	allocation_weight	allocation_capacity_attribute
zone	1	household	1000	synthesize	building	residential_units	residential_units
zone	2	household	3000	synthesize	building	residential_units	residential_units
zone	3	household	2000	synthesize	building	residential_units	residential_units

In Example 1, we match household totals by zone and allocate to buildings within the zone according to the distribution of residential units, respecting a capacity constraint. If zone 1 contains less than 1,000 households, we randomly sample the needed number of new households from the existing households in zone 1, copy them, then allocate the new households to buildings in the zone (i.e. assign a building_id). If zone 1 contains more than 1,000 households, we randomly sample existing households for deletion. The agents_df argument in this case would be a DataFrame of households with a zone_id column. The controls_df argument in this case would be the table shown above. The allocation_geography_df would be a DataFrame of buildings with a zone_id column. In the allocation step, we would respect the capacity constraint identified in the allocation_capacity_attribute column of the controls_df table (number of households assigned to a building should not exceed the number of residential units in the building).

Example 2

location_type	location_id	agent_type	agent_accounting_attribute	agent_filter	target	how_match	allocate_to	allocation_weight	allocation_capacity_attribute
zone	1	household	persons	income < 40000	5600	synthesize	building	residential_units	residential_units
zone	2	household	persons	income < 40000	5600	synthesize	building	residential_units	residential_units
zone	3	household	persons	income >= 50000	2000	synthesize	building	residential_units	residential_units

In Example 2, we populate both the agent_accounting_attribute and agent_filter column in the control table. This means that the target values now refers to persons, not households, and the households we sample to meet this target must pass the agent_filter. Summing household.persons for households in zone 1 where household income is less than 40,000 should result in 5,600. In other words, there are 5,600 people in zone 1 in households with household income less than 40,000.

Example 3

location_type	location_id	agent_type	target	how_match	allocate_to	allocation_weight	allocation_capacity_attribute
zone	1	job	500	synthesize	building	non_residential_sqft	non_residential_sqft/250
zone	2	job	1200	synthesize	building	non_residential_sqft	non_residential_sqft/250
zone	3	job	700	synthesize	building	non_residential_sqft	non_residential_sqft/250

In Example 3, we want to match zonal job targets (agent_type=='job') and allocate new jobs to building weighted by non_residential_sqft. The allocation_capacity_attribute reflects the assumption that each job spot takes up 250 sq ft, and we don't want to exceed the number of job spots in the buildings being allocated to. After running control_to_target, there will be 500 jobs with zone_id 1.

Example 4

location_type	location_id	agent_type	agent_filter	target	how_match	allocate_to	allocation_weight	allocation_capacity_attribute
zone	1	job	sector_id == 11	150	synthesize	building	non_residential_sqft + 50*residential_units	non_residential_sqft/250
zone	2	job	sector_id == 11	500	synthesize	building	non_residential_sqft + 50*residential_units	non_residential_sqft/250
zone	3	job	sector_id == 32	200	synthesize	building	non_residential_sqft + 50*residential_units	non_residential_sqft/250

In Example 4, we control to job targets again, but now the agent_filter column is populated. In zone_id 1, the target of 150 applies only to jobs in sector 11. There should be 150 jobs in zone_id 1 with sector_id 11. Existing jobs in sector 11 and zone 1 are either copied or deleted to match this target. New jobs get a new, unique job_id.

Example 5

location_type	location_id	agent_type	agent_accounting_attribute	target	how_match	allocate_to	allocation_weight	allocation_capacity_attribute
zone	1	building	residential_units	800	synthesize	parcel	parcel_sqft	parcel_sqft/500
zone	2	building	residential_units	200	synthesize	parcel	parcel_sqft	parcel_sqft/500
zone	3	building	residential_units	350	synthesize	parcel	parcel_sqft	parcel_sqft/500

In Example 5, we want to match residential_unit targets. Summing building.residential units for buildings in zone_id 1 should get us 800. Existing buildings with residential units are sampled, copied, and allocated if the existing zonal residential unit count is too low. Otherwise, residential buildings are sampled for deletion if the existing count is too high. We allocate new synthetic buildings to parcel, weighting the allocation by parcel_sqft and respecting the parcel_sqft/500 capacity_constraint.

Example 6

location_type	location_id	agent_type	agent_accounting_attribute	agent_filter	target	how_match	allocate_to	allocation_weight	allocation_capacity_attribute
zone	1	building	non_residential_sqft	building_type_id == 5	30000	synthesize	parcel	parcel_sqft	parcel_sqft/2
zone	2	building	non_residential_sqft	building_type_id == 5	85000	synthesize	parcel	parcel_sqft	parcel_sqft/2
zone	3	building	non_residential_sqft	building_type_id == 5	72000	synthesize	parcel	parcel_sqft	parcel_sqft/2

In Example 6, we match non_residential_sqft totals by zone for building_type_id 5. Note the agent_filter column.

Example 7

location_type	location_id	agent_type	target	how_match	allocate_to	allocation_weight	allocation_capacity_attribute
parcel	111	job	50	synthesize	building	non_residential_sqft	non_residential_sqft/250
parcel	112	job	120	synthesize	building	non_residential_sqft	non_residential_sqft/250
parcel	113	job	70	synthesize	building	non_residential_sqft	non_residential_sqft/250

In Example 7, notice that the location_type is 'parcel' instead of 'zone'. We are matching parcel-level employment targets: there should be 50 jobs attached to buildings on the parcel with parcel_id 111.

Example 8

location_type	location_id	agent_type	agent_accounting_attribute	target	how_match
tract	7	household	income	70000	scale_to_mean
tract	8	household	income	84000	scale_to_mean
tract	9	household	income	39000	scale_to_mean

In Example 8, we are scaling to match the target instead of synthesizing to match (see the how_match column). Here we scale households by tract to match the observed household mean income by tract. The average household income in tract 7 is 70,000 and we want the disaggregate data to reflect this. Notice that when scaling to match, new agents are not synthesized, so the "allocate_" columns are left blank.

Example 9

location_type	location_id	agent_type	agent_accounting_attribute	agent_filter	target	how_match
tract	7	building	year_built	building_type_id < 3	1995	scale_to_median
tract	8	building	year_built	building_type_id < 3	1978	scale_to_median
tract	9	building	year_built	building_type_id < 3	1925	scale_to_median

In Example 9, we scale building year_built to match the observed tract median year built. In tract 7, we want the median year_built of buildings with building_type_id less than 3 to be '1995'. Note 'scale_to_median' in the how_match column.

Example 10

location_type	location_id	agent_type	agent_accounting_attribute	target	how_match
zone	1	building	non_residential_sqft	100000	scale_to_sum
zone	2	building	non_residential_sqft	40000	scale_to_sum
zone	3	building	non_residential_sqft	75000	scale_to_sum

In Example 10, we scale building non_residential_sqft to match a zonal target for non_residential_sqft. We want there to be 100,000 square feet of non-residential space in zone 1, and we want to match this target by scaling existing building records with non-residential sqft instead of synthesizing new building records. We scale existing values downwards or upwards depending on whether the zonal target is currently exceeded or short. Note 'scale_to_sum' in the how_match column.

Matt Davis · Answer 1 · Tue Oct 21 2014 05:00:05 GMT+0800 (China Standard Time)

In all of your examples the how_match, allocate_to, allocation_weight, and allocation_capacity_attribute columns always all the same. Is that ever not going to be the case?

Matt Davis · Answer 2 · Tue Oct 21 2014 05:08:55 GMT+0800 (China Standard Time)

When assigning, say, households to buildings, does there need to be any method or priority given to how those buildings are chosen for filling? Do we pick a bunch of buildings and put in one household per building, or fill buildings to capacity?

Eddie Janowicz · Answer 3 · Tue Oct 21 2014 06:16:11 GMT+0800 (China Standard Time)

For a given control table, how_match, allocate_to, allocation_weight, and allocation_capacity_attribute will typically always be the same unless, for whatever reason, someone puts multiple agent_types into the same table that they want different allocation rules for (e.g. household targets in the same table as job targets). If a table is only dealing with one kind of agent (and we can make this a requirement if it's useful), those columns would typically be populated with the same thing.

Eddie Janowicz · Answer 4 · Tue Oct 21 2014 06:28:30 GMT+0800 (China Standard Time)

We want to allow buildings to fill to capacity- respecting allocation_capacity_attribute as the maximum capacity. Randomly allocating down to building is fine in this case. If there's no more capacity left in the given geography (hopefully we wouldn't run into this because we'd control to residential unit targets first), we'd want to print a warning and either overstuff or leave unplaced. Maybe we can add another argument to indicate whether overstuffing is ok.

Matt Davis · Answer 5 · Wed Oct 22 2014 03:51:20 GMT+0800 (China Standard Time)

Are the location_type and location_id really just another type of filter?

Eddie Janowicz · Answer 6 · Wed Oct 22 2014 04:17:26 GMT+0800 (China Standard Time)

Yes, they're like a filter on agents and alternatives in the case of synthesizing to match. In the example of buildings and zone_id 1, we'd only allocate agents to buildings where buildings.zone_id == 1, and we'd only copy agents that are already in zone_id 1. The location_type tells us which location id column to filter on. E.g. location_type could be zone or parcel, and we'd use zone_id or parcel_id to filter accordingly.

Yes, in the case of scale_to_mean/scale_to_median/scale_to_sum, location_type and location_id are a filter on agents/things to scale. E.g. if 'location_type' was tract and 'location_id' was 5, only use buildings in tract 5 when matching this mean year_built target. buildings.tract_id == 5.

Matt Davis · Answer 7 · Wed Oct 22 2014 04:18:28 GMT+0800 (China Standard Time)

How attached are we to the "control tables" idea for this? It seems like a lot of information in the tables is redundant and amenable to expression in other ways, such as function arguments. But I don't know if there is some context for putting everything in tables like your examples.

Eddie Janowicz · Answer 8 · Wed Oct 22 2014 04:26:00 GMT+0800 (China Standard Time)

The table idea is to give clients a way to specify a set of controls for a whole region, e.g. they'd provide us with a csv table of residential unit totals by block for all blocks in the region. Good point; its true that a lot of the columns would always contain redundant information since we'd be taking the same approach in each geography. Its fine if we stripped down the tables to just a location_id and target, and all other info provided as function arguments.

Matt Davis · Answer 9 · Fri Oct 24 2014 07:08:30 GMT+0800 (China Standard Time)

When adding and removing rows to meet an aggregate target metric (like persons) you can prioritize removing/adding lots of rows by focusing on rows with small metric values or removing/adding fewer rows by focusing on rows with large metric values. Any opinion on which route is better?

Eddie Janowicz · Answer 10 · Fri Oct 24 2014 07:54:23 GMT+0800 (China Standard Time)

No preference as to which is better, though to the extent possible it'd be great not to skew the distribution too much in one direction or the other (i.e. add all small households or add all big households).

Matt Davis · Answer 11 · Fri Oct 24 2014 08:01:41 GMT+0800 (China Standard Time)

I've gone with adding/removing large things for now, though it cycles through all the existing rows in size order. It could be randomized.

Michael Reilly · Answer 12 · Sat Oct 25 2014 03:50:21 GMT+0800 (China Standard Time)

Just read through this so I may be behind your thoughts. Two things: 1) when you need to deal with imputing household or employee counts wouldn't you very much want to use the UrbanSim location choice models so you could use all the existing infrastructure to get richer HHs in more expensive buildings and employees in the correct bldg types, etc? 2) on the HU side of things, to some extent the duplication of other building within the zone (as described) will move the zone toward the control total but it seems that often there are enough buildings but their housing unit count is too low. Can your architecture also treat individual units as the agent type? Or would you use a different approach to adjust he unit count attribute (as with year built)?

Eddie Janowicz · Answer 13 · Sat Oct 25 2014 04:28:10 GMT+0800 (China Standard Time)

1.) Definitely. But in some cases we may not want to (e.g. if a location choice model hasn't been estimated for the region yet or random allocation is good enough in some region because the location choice model is most appropriate for regional allocation or building type was not explanatory), so wanted to start with the simplest case first. Great reminder though that alternative methods of allocation should be supported- particularly the application of a location choice model. I'll add a story for this. 2.) If not-duplicating is preferred, the approach in example 10 can be used- increasing attribute values to match an aggregate total. If a units table is available, then the synthesizing of units/ allocation-to-building would work too

Michael Reilly · Answer 14 · Sat Oct 25 2014 04:40:05 GMT+0800 (China Standard Time)

great: I can see why both would be useful; 2) also sounds good: just wanted to remind you of our problems last time we tried this. We ended up filling truly vacant lots (that we need for infill in our scenarios) with imputed buildings. Closer inspection revealed the need for (my poorly named) reverse iceberg approach where additional units are added to numerous existing buildings to account for what seems like a frequent undercounting of units in the asssessor's role (which is a tax calculation after all)

Eddie Janowicz · Answer 15 · Sat Oct 25 2014 04:49:06 GMT+0800 (China Standard Time)

Sounds good, thanks. The example you mention is a a very useful situation to keep in mind.