Extended Example: Text Concordance

Web search and other types of textual data mining are of great interest today. Let’s use this area for an example of R list code.

We’ll write a function calledfindwords()that will determine which words are in a text ﬁle and compile a list of the locations of each word’s occur- rences in the text. This would be useful for contextual analysis, for example.

Suppose our input ﬁle,testconcord.txt, has the following contents (taken from this book!):

The [1] here means that the first item in this line of output is item 1. In this case, our output consists of only one line (and one item), so this is redundant, but this notation helps to read voluminous output that consists of many items spread over many lines. For example, if there were two rows of output with six items per row, the second row would be labeled [7].

In order to identify words, we replace all nonletter characters with blanks and get rid of capitalization. We could use the string functions presented in Chapter 11 to do this, but to keep matters simple, such code is not shown here. The new ﬁle,testconcorda.txt, looks like this:

the here means that the first item in this line of output is item in this case our output consists of only one line and one item so this is redundant but this notation helps to read voluminous output that consists of many items spread over many lines for example if there were two rows of output with six items per row the second row would be labeled

Then, for instance, the worditemhas locations 7, 14, and 27, which means that it occupies the seventh, fourteenth, and twenty-seventh word positions in the ﬁle.

Here is an excerpt from the list that is returned when our function findwords()is called on this ﬁle:

> findwords("testconcorda.txt") Read 68 items

$the

[1] 1 5 63

$here [1] 2

$means [1] 3

$that [1] 4 40

$first [1] 6

$item [1] 7 14 27

The list consists of one component per word in the ﬁle, with a word’s component showing the positions within the ﬁle where that word occurs.

Sure enough, the worditemis shown as occurring at positions 7, 14, and 27.

Before looking at the code, let’s talk a bit about our choice of a list structure here. One alternative would be to use a matrix, with one row per word in the text. We could userownames()to name the rows, with the entries within a row showing the positions of that word. For instance, rowitemwould consist of 7, 14, 27, and then 0s in the remainder of the row. But the matrix approach has a couple of major drawbacks:

• There is a problem in terms of the columns to allocate for our matrix.

If the maximum frequency with which a word appears in our text is, say, 10, then we would need 10 columns. But we would not know that ahead of time. We could add a new column each time we encountered a new word, usingcbind()(in addition to usingrbind()to add a row for the word itself). Or we could write code to do a preliminary run through the input ﬁle to determine the maximum word frequency. Either of these would come at the expense of increased code complexity and possibly increased runtime.

• Such a storage scheme would be quite wasteful of memory, since most rows would probably consist of a lot of zeros. In other words, the matrix would besparse—a situation that also often occurs in numerical analysis contexts.

Thus, the list structure really makes sense. Let’s see how to code it.

1 findwords <- function(tf) {

2 # read in the words from the file, into a vector of mode character

3 txt <- scan(tf,"")

4 wl <- list()

5 for (i in 1:length(txt)) {

6 wrd <- txt[i] # ith word in input file

7 wl[[wrd]] <- c(wl[[wrd]],i)

8 }

9 return(wl)

10 }

We read in the words of the file (wordssimply meaning any groups of let- ters separated by spaces) by callingscan(). The details of reading and writing files are covered in Chapter 10, but the important point here is thattxtwill now be a vector of strings: one string per instance of a word in the file. Here is whattxtlooks like after the read:

> txt

[1] "the" "here" "means" "that" "the"

[6] "first" "item" "in" "this" "line"

[11] "of" "output" "is" "item" "in"

[16] "this" "case" "our" "output" "consists"

[21] "of" "only" "one" "line" "and"

[26] "one" "item" "so" "this" "is"

[31] "redundant" "but" "this" "notation" "helps"

[36] "to" "read" "voluminous" "output" "that"

[41] "consists" "of" "many" "items" "spread"

[46] "over" "many" "lines" "for" "example"

[51] "if" "there" "were" "two" "rows"

[56] "of" "output" "with" "six" "items"

[61] "per" "row" "the" "second" "row"

[66] "would" "be" "labeled"

The list operations in lines 4 through 8 build up our main variable, a list wl(forword list). We loop through all the words from our long line, withwrd being the current one.

Let’s see what happens with the code in line 7 wheni = 4, so thatwrd =

"that"in our example ﬁletestconcorda.txt. At this point,wl[["that"]]will not yet exist. As mentioned, R is set up so that in such a case,wl[["that"]] = NULL, which means in line 7, we can concatenate it! Thuswl[["that"]]will become the one-element vector (4). Later, wheni = 40,wl[["that"]]will become (4,40), representing the fact that words 4 and 40 in the ﬁle are both"that". Note how convenient it is that list indexing can be done through quoted strings, such as inwl[["that"]].

An advanced, more elegant version of this code uses R’ssplit()function, as you’ll see in Section 6.2.2.

Preview of Some Important R Data Structures

Extended Example: Regression Analysis of Exam Grades