|
Methodology - Population, Housing, and Income Estimates
First a quick overview:
In building population estimates there are several pieces needed to begin. The changes that occur in an area will be the addition of births, subtraction of deaths and the addition/subtraction of those who moved. The starting point is the 2000 Short Form (SF1) BLOCK level data set. This has the most detailed and comprehensive numbers about where the entire population of the US lives, their age and their race. To progress from the 2000 data to current year estimates, we use the US Census Bureau's (USCB) County and State level annual estimates to roll the numbers forward to the current year. But the USCB data is only available at the County and State level, so the next challenge is distributing the data down to the smaller geographies. The easiest (and sloppiest) way is to assume that all Block Groups, Tracts, and Zips in a given county are the same and they all have the same multiplier as the county change from 2000 to the current year. So if the county grew by 7% than each block group has a 7% increase. This obviously isn't the ideal way to distribute these changes.
Instead the next step is to work with actuarial tables for births and deaths by age and race, and use them to create a model of "likelihood" of dying or likelihood of having a child. This then is what creates the engine driving the increase and decrease in population growth.
The third step is to look at immigration and emigration. Where are people moving "to" and where are they moving "from". The US Postal Service keeps track of all moves as a "to" and "from" location.
Now the more detailed explanation
1. Working with the Census Bureau "estimation base" county level numbers.
This data is processed to obtain "race distribution" coefficients. However, the Census Bureau estimation base data do not include "other" race category. Also, "two or more races" category is much smaller than it is in SF1/SF3 Census data. By comparing the estimation base to SF1 county level data, it is possible to obtain some numeric ratios as to how "other race" and "two or more races" populations were distributed among the remaining races in the USCB's estimation base. These coefficients allow us to re-map the SF1 block level data and redistribute the "other race" and part of the "two or more races" population among the 6 remaining mutually exclusive races.
2. The SF1 block level data are processed with these new racial distribution coefficients. The resulting dataset is our estimation base. It includes 8 race/origin groups:
| WA | White alone |
| BA | Black alone |
| NA | Native American alone |
| AA | Asian alone |
| PA | Pacific alone |
| R2 | Two or more races |
| HS | Hispanic |
| WN | White, not Hispanic |
A few words on Census analogs. The Total Population count corresponds to the Census table P001, count P0010001. The rest correspond to Census age-race-sex tables from P012A to P012I, with the P012F (Other Race table) dropped. We do not have the "Other Race" category in the estimates even though Census 2000 does, because the USCB dropped the "Other Race" data from its estimates. They switched to 8 races in 2001 and we had to follow. It is worth mentioning that the USCB redistributed the racial counts of Other Race completely and the counts for "2 or more Races" were partially redistributed between the rest of the races in their estimates. We did the same and therefore the racial breakdown differs from the Census 2000 but fits the 2001 USCB estimates. We believe that the USCB made these changes because there are no actuarial tables for "other" or "2 or more" races so they needed to redistribute those people into one of the race categories by which they could create estimates
We are careful to go back and ensure that the totals for races are identical to the ones in Census Bureau "estimation base" data for the total as well as each of the races at the County, State and National levels. As a further check we also sum up all of the Block Groups to make sure that these numbers are also the same.
3. Having dealt with Race we then turn to Age. The USCB groups the population into 18 age groups. These range from age 0 (under 1) to age 108. The age groups are each 5 year intervals (0-4, 5-9, etc) except the ages 85 and up (85-108) are treated as a single group.
4. Now that we have the entire population broken down into age and race categories we begin building the death-birth model. With the use of Actuarial tables we calculate the statistical likelihood for any given age/race group to die or to give birth. We then apply these coefficients to the 2000 data to create an estimation base for 2001, the coefficients are reapplied to create 2002, and so on until we get to the current year.
The model includes:
- transformation of age group distribution to "exact age" distribution. The resulting data set has population groups for each single year of age from 0 to 108.
- application of death probabilities for a specific age, sex and race group.
- application of birth rates for a specific age, sex and race group. The white population is treated as a mix of white not Hispanic and Hispanic population. The mix ratio is determined from the block data.
- 1 year shift.
- collecting the annual data into 5-year buckets.
- comparison of the results with Census Bureau estimates for this year.
- the results of comparison are used to tweak birth rates and death probabilities to make the numbers of both newborn and deceased in the model to be exactly equal to Census Bureau numbers for each county. The racial distribution is also tweaked to reflect that of Census Bureau data. It puts the annual estimates in sync with USCB data as much as possible.
The results of the application of this model have the same totals as Census Bureau results on a county level.
5. The same model is applied to the results for 2006. This time, however, the "tweaking coefficients" are predicted (as we do not have any materials for comparison) from the tweaking coefficients for 2002, 2003, and 2004. The prediction algorithm is based on a linear regression approach (they actually fit the linear plot very nicely).
6. The block level data are grouped and rounded so that in fact we don't have partial people (137.2 instead of 137 people).
Methodology - Household Estimates
The household estimates were calculated from:
- the Census data on the household
- the estimated data on the households
- the Census data on the age-race-sex
- the estimated data on the age-race-sex.
GeoLytics calculated the ratios of Census household variables to Census age-race-sex data and Census housing data and then used these ratios for estimated data of the same nature to get the estimated values. The underlying assumption being that the average family size by race will not have changed dramatically in the years since the 2000 Census was compiled.
Methodology - Housing Estimates
The only way that the number of housing units (HU) changes is if new buildings are built or old ones torn down. Some houses can be built on empty lots, but if a lot of houses are built usually a whole new development gets put in. So the first thing that we did was to look at the TIGER/Line files. This is the USCB file that shows each and every street in the US and has the numbers of each housing unit. By looking at this dataset we can determine if new streets have been put in and by looking at the numbering we can determine about how many units are being built. We can also see if new numbers have been added to an existing street.
1. The TIGER/Lines records for the years 2000 and 2004 were analyzed. For each block, the sum of associated address ranges was calculated. As a result, each block was assigned a Change Coefficient (CC), a number representing the changes in the aggregate number of addresses within this block. The number is a fraction between -1 and +1. The number 0 represents a block that has not been changed within this time interval. The number +1 represents a block that did not have any addresses in 2000 and has some in 2004, and the number -1 is a block with no addresses in 2004 and has some addresses in 2000. The block changes were later summarized to BG level.
2. The Census Bureau Housing Units Estimates (at the county) for the years 2000-2004 were used to assess the number of HU per county for the year 2006 via a linear regression algorithm. The USCB estimates only go out to 2004 thus we needed to extrapolate the 2005 and then 2006 numbers.
3. For each county, the Census Bureau HU growth/decline was distributed among BGs of this county so that:
- BGs with CC = 0 did not change any HU counts
- BGs with CC not equal to 0 received some parts of the county growth on proportional basis so that BGs with CC > 0 received some HUs and BGs with CC < 0 lose some HUs. The results vary from small changes (mostly, a few percent is a typical change) to some pretty dramatic changes of 3-5 times (rarely). These obviously are where large housing complexes went in and dramatically changed the number of housing units in the block group.
Once we had the change in the number of Housing Units we can then look at the other housing variables such as of number of rooms, vacancy status, tenure (own vs. rent) status, etc. People all live in either a household or a group quarter (military barracks, college dorms, nursing homes, prisons, mental institutions, half-way homes, etc). The group quarters were left stable so the changes in population were then accounted for in the changes in Housing Units that had now been calculated. So for example, if the housing units stayed the same but the population numbers dropped than the vacancy status would go up.
The sum of all changes for all BGs in a county is equal to the Census Bureau HU county growth estimates.
|