10.2 Reading and Writing Files
10.2.4 Extended Example: Reading PUMS Census Files
The U.S. Census Bureau makes census data available in the form of Public Use Microdata Samples (PUMS). The termmicrodatahere means that we are dealing with raw data and each record is for a real person, as opposed to statistical summaries. Data on many, many variables are included.
The data is organized by household. For each unit, there is first a Household record, describing the various characteristics of that household, followed by one Person record for each person in the household. Charac- ter positions 106 and 107 (with numbering starting at 1) in the Household record state the number of Person records for that household. (The num- ber can be very large, since some institutions count as households.)
To enhance the integrity of the data, character position 1 contains H or P to confirm that this is a Household or Person record. So, if you read an H record, and it tells you there are three people in the household, then the following three records should be P records, followed by another H record;
if not, you’ve encountered an error.
As our test file, we’ll take the first 1,000 records of the year 2000 1 per- cent sample. The first few records look like this:
H000019510649 06010 99979997 70 631973
15758 59967658436650000012000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0000 0 0 0 0 0 00000000000000000000000000000
00000000000000000000000000
P00001950100010923000420190010110000010147050600206011099999904200000 0040010000 00300280 28600 70 9997 9997202020202020220000040000000000000006000000 00000 00 0000 00000000000000000132241057904MS 476041-20311010310 07000049010000000000900100000100000100000100000010000001000139010000490000
H000040710649 06010 99979997 70 631973
15758 599676584365300800200000300106060503010101010102010 01200006000000100001 00600020 0 0 0 0 0000 0 0 0 0 0 02000102010102200000000010750 02321125100004000000040000
P00004070100005301000010380010110000010147030400100009005199901200000 0006010000 00100000 00000 00 0000 0000202020202020220000040000000000000001000060 06010 70 9997 99970101004900100000001018703221 770051-10111010500 40004000000000000000000000000000000000000000000000000000004000000040000349 P00004070200005303011010140010110000010147050000204004005199901200000 0006010000 00100000 00000 00 0000 000020202020 0 0200000000000000000000000050000 00000 00 0000 000000000000000000000000000000000000000000-00000000000
000 0 0 0 0 0 0 0 0 00000000349
H000061010649 06010 99979997 70 631973
15758 599676584360801190100000200204030502010101010102010 00770004800064000001 1 0 030 0 0 0 0340 00660000000170 0 06010000000004410039601000000 00021100000004940000000000
The records are very wide and thus wrap around. Each one occupies four lines on the page here.
We’ll create a function calledextractpums()to read in a PUMS file and create a data frame from its Person records. The user specifies the filename and lists fields to extract and names to assign to those fields.
We also want to retain the household serial number. This is good to have because data for persons in the same household may be correlated and we may want to add that aspect to our statistical model. Also, the household data may provide important covariates. (In the latter case, we would want to retain the covariate data as well.)
Before looking at the function code, let’s see what the function does.
In this data set, gender is in column 23 and age in columns 25 and 26. In the example, our filename ispumsa. The following call creates a data frame consisting of those two variables.
pumsdf <- extractpums("pumsa",list(Gender=c(23,23),Age=c(25,26)))
Note that we are stating here the names we want the columns to have in the resulting data frame. We can use any names we want—say Sex and Ancientness.
Here is the first part of that data frame:
> head(pumsdf) serno Gender Age
2 195 2 19
3 407 1 38
4 407 1 14
5 610 2 65
6 1609 1 50
7 1609 2 49
The following is the code for theextractpums()function.
1 # reads in PUMS file pf, extracting the Person records, returning a data
2 # frame; each row of the output will consist of the Household serial
3 # number and the fields specified in the list flds; the columns of
4 # the data frame will have the names of the indices in flds
5
6 extractpums <- function(pf,flds) {
7 dtf <- data.frame() # data frame to be built
8 con <- file(pf,"r") # connection
9 # process the input file
10 repeat {
11 hrec <- readLines(con,1) # read Household record
12 if (length(hrec) == 0) break # end of file, leave loop
13 # get household serial number
14 serno <- intextract(hrec,c(2,8))
15 # how many Person records?
16 npr <- intextract(hrec,c(106,107))
17 if (npr > 0)
18 for (i in 1:npr) {
19 prec <- readLines(con,1) # get Person record
20 # make this person's row for the data frame
21 person <- makerow(serno,prec,flds)
22 # add it to the data frame
23 dtf <- rbind(dtf,person)
24 }
25 }
26 return(dtf)
27 }
28
29 # set up this person's row for the data frame
30 makerow <- function(srn,pr,fl) {
31 l <- list()
32 l[["serno"]] <- srn
33 for (nm in names(fl)) {
34 l[[nm]] <- intextract(pr,fl[[nm]])
35 }
36 return(l)
37 }
38
39 # extracts an integer field in the string s, in character positions
40 # rng[1] through rng[2]
41 intextract <- function(s,rng) {
42 fld <- substr(s,rng[1],rng[2])
43 return(as.integer(fld))
44 }
Let’s see how this works. At the beginning ofextractpums(), we create an empty data frame and set up the connection for the PUMS file read.
dtf <- data.frame() # data frame to be built con <- file(pf,"r") # connection
The main body of the code then consists of arepeatloop.
repeat {
hrec <- readLines(con,1) # read Household record if (length(hrec) == 0) break # end of file, leave loop
# get household serial number serno <- intextract(hrec,c(2,8))
# how many Person records?
npr <- intextract(hrec,c(106,107))
if (npr > 0)
for (i in 1:npr) { ...
} }
This loop iterates until the end of the input file is reached. The latter condition will be sensed by encountering a zero-length Household record, as seen in the preceding code.
Within therepeatloop, we alternate reading a Household record and reading the associated Person records. The number of Person records for the current Household record is extracted from columns 106 and 107 of that record, storing this number innpr. That extraction is done by a call to our functionintextract().
Theforloop then reads in the Person records one by one, in each case forming the desired row for the output data frame and then attaching it to the latter viarbind():
for (i in 1:npr) {
prec <- readLines(con,1) # get Person record
# make this person's row for the data frame person <- makerow(serno,prec,flds)
# add it to the data frame dtf <- rbind(dtf,person) }
Note howmakerow()creates the row to be added for a given person. Here the formal arguments aresrnfor the household serial number,prfor the given Person record, andflfor the list of variable names and column fields.
makerow <- function(srn,pr,fl) { l <- list()
l[["serno"]] <- srn for (nm in names(fl)) {
l[[nm]] <- intextract(pr,fl[[nm]]) }
return(l) }
For instance, consider our sample call:
pumsdf <- extractpums("pumsa",list(Gender=c(23,23),Age=c(25,26)))
Whenmakerow()executes,flwill be a list with two elements, namedGender andAge. The stringpr, the current Person record, will haveGenderin column 23 andAgein columns 25 and 26. We callintextract()to pull out the desired numbers.
Theintextract()function itself is a straightforward conversion of charac- ters to numbers, such as converting the string"12"to the number 12.
Note that, if not for the presence of Household records, we could do all of this much more easily with a handy built-in R function:read.fwf(). The name of this function is an abbreviation for “read fixed-width formatted,”
alluding to the fact that each variable is stored in given character positions of a record. In essence, this function alleviates the need to write a function likeintextract().