5.4 Applying Functions to Data Frames
5.4.3 Extended Example: Aids for Learning Chinese Dialects
Standard Chinese, often referred to as Mandarin outside China, is officially termedputonghuaorguoyu. It is spoken today by the vast majority of people in China and among many ethnic Chinese outside China, but the dialects, such as Cantonese and Shanghainese, still enjoy wide usage too. Thus, a Chi- nese businessman in Beijing who intends to do business in Hong Kong may find it helpful to learn some Cantonese. Similarly, many in Hong Kong may wish to improve their Mandarin. Let’s see how such a learning process might be shortened and how R can help.
The differences among the dialects are sometimes startling. The charac- ter for “down,”下, is pronouncedxiain Mandarin,hain Cantonese, andwu in Shanghainese. Indeed, because of these differences, and differences in grammar as well, many linguists consider these tongues separate languages rather than dialects. We will call themfangyan(meaning “regional speech”) here, the Chinese term.
Let’s see how R can help speakers of one fangyan learn another one.
The key is that there are often patterns in the correspondences between the fangyans. For instance, the initial consonant transformationx→hseen in 下in the previous paragraph (xia→ha) is common, arising also in charac- ters such as香(meaning “fragrant”), pronouncedxiangin Mandarin and heungin Cantonese. Note, too, the transformationiang→eung for the non–
initial consonant portions of these pronounciations, which is also common.
Knowing transformations such as these could speed up the learning curve considerably for the Mandarin-speaking learner of Cantonese, which is the setting we’ll illustrate here.
We haven’t mentioned the tones yet. All the fangyan are tonal, and sometimes there are patterns there as well, potentially providing further learning aids. However, this avenue will not be pursued here. You’ll see that our code does need to make some use of the tones, but we will not attempt to analyze how tones transform from one fangyan to another. For simplicity, we also will not consider characters beginning with vowels, characters that have more than one reading, toneless characters, and other refinements.
Though the initial consonantxin Mandarin often maps toh, as seen previously, it also often maps tos,y, and other consonants. For example, the characterxie,謝, in the famous Mandarin termxiexie(for “thank you”) is pronouncedjein Cantonese. Here, there is anx→jtransformation on the consonant.
It would be very helpful for the learner to have a list of transformations and their frequencies of occurrence. This a job made for R! The function mapsound(), shown a little later in the chapter, does exactly this. It relies on some support functions, also to be presented shortly.
To explain whatmapsound()does, let’s devise some terminology, illus- trated by thex→hexample earlier. We’ll callxthesource value, withh,s, and so on being themapped values.
Here are the formal parameters:
• df: A data frame consisting of the pronunciation data of two fangyan
• fromcolandtocol: Names indfof the source and mapped columns
• sourceval: The source value to be mapped, such asxin the preceding example
Here is the head of a typical two-fangyan data frame,canman8, that would be used fordf:
> head(canman8)
Ch char Can Man Can cons Can sound Can tone Man cons Man sound Man tone
1 一 yat1 yi1 y at 1 y i 1
2 丁 ding1 ding1 d ing 1 d ing 1
3 七 chat1 qi1 ch at 1 q i 1
4 丈 jeung6 zhang4 j eung 6 zh ang 4
5 上 seung5 shang3 s eung 5 sh ang 3
6 下 ha5 xia4 h a 5 x ia 4
The function returns a list consisting of two components:
• counts: A vector of integers, indexed by the mapped values, showing the counts of those values. The elements of the vector are named according to the mapped values.
• images: A list of character vectors. Again, the indices of the list are the mapped values, and each vector consists of all the characters that corre- spond to the given mapped value.
To make this concrete, let’s try it out:
> m2cx <- mapsound(canman8,"Man cons","Can cons","x")
> m2cx$counts
ch f g h j k kw n s y 15 2 1 87 12 4 2 1 81 21
We see thatxmaps toch15 times, tof 2 times, and so on. Note that we could have calledsort()tom2cx$countsto view the mapped images in order, from most to least frequent.
The Mandarin-speaking learner of Cantonese can then see that if he wishes to know the Cantonese pronunciation of a word whose Mandarin romanized form begins withx, the Cantonese almost certainly begins with hors. Little aids like this should help the learning process quite a bit.
To try to discern more patterns, the learner may wish to determine in which charactersxmaps toch, for example. We know from the result of the preceding example that there are six such characters. Which ones are they?
That information is stored inimages. The latter, as mentioned, is a list of vec- tors. We are interested in the vector corresponding toch:
> head(m2cx$images[["ch"]])
Ch char Can Man Can cons Can sound Can tone Man cons Man sound Man tone
613 嗅 chau3 xiu4 ch au 3 x iu 4
982 尋 cham4 xin2 ch am 4 x in 2
1050 巡 chun3 xun2 ch un 3 x un 2
1173 徐 chui4 xu2 ch ui 4 x u 2
1184 循 chun3 xun2 ch un 3 x un 2
1566 斜 che4 xie2 ch e 4 x ie 2
Now, let’s look at the code. Before viewing the code formapsound() itself, let’s consider another routine we need for support. It is assumed here that the data framedfthat is input tomapsound()is produced by merg- ing two frames for individual fangyans. In this case, for instance, the head of the Cantonese input frame is as follows:
> head(can8) Ch char Can
1 一 yat1
2 乙 yuet3
3 丁 ding1
4 七 chat1
5 乃 naai5
6 九 gau2
The one for Mandarin is similar. We need to merge these two frames intocanman8, seen earlier. I’ve written the code so that this operation not only combines the frames but also separates the romanization of a charac- ter into initial consonant, the remainder of the romanization, and a tone number. For example,ding1is separated intod,ing, and1.
We could similarly explore transformations in the other direction, from Cantonese to Mandarin, and involving the nonconsonant remainders of characters. For example, this call determines which characters haveeung as the nonconsonant portion of their Cantonese pronunciation:
> c2meung <- mapsound(canman8,c("Can cons","Man cons"),"eung") We could then investigate the associated Mandarin sounds.
Here is the code to accomplish all this:
1 # merges data frames for 2 fangyans
2 merge2fy <- function(fy1,fy2) {
3 outdf <- merge(fy1,fy2)
4 # separate tone from sound, and create new columns
5 for (fy in list(fy1,fy2)) {
7 # 2, and tones in row 3
8 saplout <- sapply((fy[[2]]),sepsoundtone)
9 # convert it to a data frame
10 tmpdf <- data.frame(fy[,1],t(saplout),row.names=NULL,
11 stringsAsFactors=F)
12 # add names to the columns
13 consname <- paste(names(fy)[[2]]," cons",sep="")
14 restname <- paste(names(fy)[[2]]," sound",sep="")
15 tonename <- paste(names(fy)[[2]]," tone",sep="")
16 names(tmpdf) <- c("Ch char",consname,restname,tonename)
17 # need to use merge(), not cbind(), due to possibly different
18 # ordering of fy, outdf
19 outdf <- merge(outdf,tmpdf)
20 }
21 return(outdf)
22 }
23
24 # separates romanized pronunciation pronun into initial consonant, if any,
25 # the remainder of the sound, and the tone, if any
26 sepsoundtone <- function(pronun) {
27 nchr <- nchar(pronun)
28 vowels <- c("a","e","i","o","u")
29 # how many initial consonants?
30 numcons <- 0
31 for (i in 1:nchr) {
32 ltr <- substr(pronun,i,i)
33 if (!ltr %in% vowels) numcons <- numcons + 1 else break
34 }
35 cons <- if (numcons > 0) substr(pronun,1,numcons) else NA
36 tone <- substr(pronun,nchr,nchr)
37 numtones <- tone %in% letters # T is 1, F is 0
38 if (numtones == 1) tone <- NA
39 therest <- substr(pronun,numcons+1,nchr-numtones)
40 return(c(cons,therest,tone))
41 }
So, even the merging code is not so simple. And this code makes some simplifying assumptions, excluding some important cases. Textual analysis is never for the faint of heart!
Not surprisingly, the merging process begins with a call tomerge(), in line 3. This creates a new data frame,outdf, to which we will append new columns for the separated sound components.
The real work, then, involves the separation of a romanization into its sound components. For that, there is a loop in line 5 across the two input data frames. In each iteration, the current data frame is split into sound components, with the result appended tooutdfin line 19. Note the com- ment preceding that line regarding the unsuitability ofcbind()in this
The actual separation into sound components is done in line 8. Here, we take a column of romanizations, such the following:
yat1 yuet3 ding1 chat1 naai5 gau2
We split it into three columns, consisting of initial consonant, remainder of the sound, and tone. For instance,yat1will be split intoy,at, and1.
This is a very natural candidate for some kind of “apply” function, and indeedsapply()is used in line 8. Of course, this call requires that we write a suitable function to be applied. (If we had been lucky, there would have been an existing R function that worked, but no such good fortune here.) The function we use issepsoundtone(), starting in line 26.
Thesepsoundtone()function makes heavy use of R’ssubstr()(forsub- string) function, described in detail in Chapter 11. In line 31, for example, we loop until we collect all the initial consonants, suchch. The return value, in line 40, consists of the three sound components extracted from the given romanized form, the formal parameterpronun.
Note the use of R’s built-in constant,letters, in line 37. We use this to sense whether a given character is numeric, which means it’s a tone. Some romanizations are toneless.
Line 8 will then return a 3-by-1 matrix, with one row for each of the three sound components. We wish to convert this to a data frame for merg- ing withoutdfin line 19, and we prepare for this in line 10.
Note that we call the matrix transpose functiont()to put our informa- tion into columns rather than rows. This is needed because data-frame stor- age is by columns. Also, we include a columnfy[,1], the Chinese characters themselves, to have a column in common in the call tomerge()in line 19.
Now let’s turn to the code formapsound(), which actually is simpler than the preceding merging code.
1 mapsound <- function(df,fromcol,tocol,sourceval) {
2 base <- which(df[[fromcol]] == sourceval)
3 basedf <- df[base,]
4 # determine which rows of basedf correspond to the various mapped
5 # values
6 sp <- split(basedf,basedf[[tocol]])
7 retval <- list()
8 retval$counts <- sapply(sp,nrow)
9 retval$images <- sp
10 return(retval)
11 }
Recall that the argumentdfis the two-fangyan data frame, output from merge2fy(). The argumentsfromcolandtocolare the names of the source and mapped columns. The stringsourcevalis the source value to be mapped. For concreteness, consider the earlier examples in whichsourcevalwasx.
The first task is to determine which rows indfcorrespond tosourceval. This is accomplished via a straightforward application ofwhich()in line 2.
This information is then used in line 3 to extract the relevant subdata frame.
In that latter frame, consider the form thatbasedf[[tocol]]will take in line 6. These will be the values thatxmaps to—that is,ch,h, and so on. The purpose of line 6 is to determine which rows ofbasedfcontain which of these mapped values. Here, we use R’ssplit()function. We’ll discusssplit()in detail in Section 6.2.2, but the salient point is thatspwill be a list of data frames: one forch, one forh, and so on.
This sets up line 8. Sincespwill be a list of data frames—one for each mapped value—applying thenrow()function viasapply()will give us the counts of the numbers of characters for each of the mapped values, such as the number of characters in which the mapx→choccurs (15 times, as seen in the example call).
The complexity of the code here makes this a good time to comment on programming style. Some readers may point out, correctly, that lines 2 and 3 could be replaced by a one-liner:
basedf <- df[df[[fromcol]] == sourceval,]
But to me, that line, with its numerous brackets, is harder to read.
My personal preference is to break down operations if they become too complex.
Similarly, the last few lines of code could be compacted to another one-liner:
list(counts=sapply(sp,nrow),images=sp)
Among other things, this dispenses with thereturn(), conceivably speeding up the code. Recall that in R, the last value computed by a function is auto- matically returned anyway, without areturn()call. However, the time savings here are really small and rarely matter, and again, my personal belief is that including thereturn()call is clearer.
6
FACTORS AND TABLES
Factors form the basis for many of R’s powerful operations, including many of those performed on tabular data. The motivation for factors comes from the notion of nominal, or categorical, variables in statistics. These values are nonnumerical in nature, corresponding to categories such as Democrat, Republican, and Unaffil- iated, although they may be coded using numbers.
In this chapter, we’ll begin by looking at the extra information con- tained in factors and then focus on the functions used with factors. We’ll also explore tables and common table operations.