GIS and Evidence-Based Policy Making - Chapter 8 pps

8 Pattern Identification in Public Health Data Sets: The Potential Offered by Graph Theory Peter A. Bath, Cheryl Craigs, Ravi Mah eswaran, John Raymond, and Peter Willett CONTENTS 8.1 Introduction 159 8.1.1 Background 160 8.1.2 Computational Chemistry and Graph Theory 161 8.2 Methods 162 8.2.1 Program 162 8.2.2 Data 162 8.2.2.1 Geographical Area 162 8.2.2.2 Deprivation 163 8.2.2.3 Standardized Long-Term Limiting Illness for People Aged Less Than 75 164 8.2.2.4 Adjacency Information 165 8.2.3 Storage of Information 165 8.2.4 Queries 166 8.2.4.1 Query Patterns 166 8.2.4.2 Query Data File 167 8.3 Results 169 8.4 Discussion 172 Acknowledgments 175 References 175 8.1 Introduction Pattern identification is an important issue in public health, and current methods are not designed to deal with identifying complex geographical patterns of illness and disease. Graph theory has been used successfully within the field of chemoinformatics to identify complex user-defined patterns, ß 2007 by Taylor & Francis Group, LLC. or substructures, within molecules in databases of two-dimensional (2D) and three-dimensional (3D) chemical structures. In this paper we describe a study in which one graph theoretical method, the maximum common substructure (MCS) algorithm, which has been successful in identifying such patterns, has been adapted for use in identifying geographical patterns in public health data. We describe how the RASCAL (RApid Similarity CALculator) program (Raymond and Willett, 2002; Raymond et al., 2002a,b), which uses the MCS method, was utilized for identifying user-specified geographical patterns of socioeconomic deprivation and long-term limiting illness. The paper illus- trates the use of this method, presents the results from searches in a large database ofpublic health data, and then discusses the potential of graph theory for use in searching for geographical-based information. 8.1.1 Background The need to identify patterns of illness and disease is not uncommon in public health, for example the identification of disease clusters and tendencies toward clustering, such as outbreaks of communicable disease (e.g., tuber- culosis), and hig her than expected prevalence=incidence of diseases (e.g., childhood leukemia). The basic building blocks or units for such patterns may be individuals or geographical units, but the key factor is the association between units in terms of time, space, or other complex links. However, searching for patterns of disease using geographical-based data can help not only to identify disease clusters in a geographical area but also can be helpful in seeking to identify potential causes of such outbreaks, which may be geographical features themselves or be characteristics of a geographical area. Cluster detection, particularly the identification of geographical disease clusters, has been the subject of intensive research within public health and geographical information sciences (Openshaw et al., 1988; Knox, 1989; Besag and Newell, 1991; Alexander and Cuzick, 1992; Kulldorff, 1999). Within the domain of public health and spatial epidemiology, Besag and Newell (1991) classified tests for disease clustering into two groups. The first comprises general or nonsp ecific tests that examine the tendency for diseases to cluster. The second group comprises specific tests that assess clustering around predefined points, e.g., nuclear installations, or assess the locational structure of clusters. Among the better-known cluster detection methods are Openshaw’s Geographical Analysis Machine (Openshaw et al., 1988), Kulldorff’s spatial scan statistic (Kulldorff, 1999), Knox’s test (Knox, 1989), and Besag and Newell’s method (Besag and Newell, 1991). Issues related to clustering and cluster detection are discuss ed in detail in recent compre- hensive publications in the subject area (Lawson et al., 1999; Elliott et al., 2000). The methods described, however, are all concerned with statistical probability and estimation of effect size. They were not designed to handle complex pattern searching queries, and there are currently no satisfactor y methods available for this purpose. In the domain of geographical information science, the ability of current software systems to recognize the relationship between neighboring areas is ß 2007 by Taylor & Francis Group, LLC. determined by whether the software has the property of topology, and in particular the branch of topology called pointset topology. Pointset topology is concerned with the concepts of sets of points, their neighborhood, and nearness (Worboys, 1995). It is this concept that allows for the analysis of contiguous areas. Many current GIS, such as ArcView 3.2 (2002), do not have this property and so cannot deal with contiguous problems such as identifying complex geographical patterns involving neighboring areas. More sophisticated software such as ArcInfo7, however, has topological properties and in theory can identify complex patterns of adjacent neigh- bors (ArcInfo 8.2, 2002). However, three major difficulties are associated with this type of searching. The first problem is that any complex geographical pattern search must be programmed into the software separately, which is time-consuming and requires a high level of programming expertise. The other two problems are that the resulting programs are computationally very intensive and generate very large result files. In this paper, we describe early work in developing and using techniques that are successfully used in computational chemistry for identifying geographical patterns in public health data. 8.1.2 Computational Chemistry and Graph Theory In the field of computational chemistry, sophisticated techniques have been developed for the efficient storage and retrieval of various types of chemical information. Highly specified, sophisticated, and flexible searches can be carried out within large databases of molecular structures using techniques derived from graph theory, a branch of mathematics. Graph-theoretical methods of storing 2D and 3D chemical structures have been developed within the Chemoinformatics Research Group in the Department of Infor- mation Studies at the University of Sheffield (Willett, 1995, 1999). Graph theory is used to describe a set of objects, or nodes, and the relationships, or edges, between the nodes. In computational chemistry, nodes are used to represent the atoms in chemical structures. The edges represent the bonds in 2D chemical structure representations and inter- atomic distances in 3D chemic al structure representations of the molecule. The resulting graph is called a connection table and contains a list of all the (non-hydrogen) atoms within the structure and their relationships to each other, in terms of bonds (2D) or distances (3D ) (Willett, 1995, 1999). Thus, information about molecules can be stored on databases and retrieved using algorithms developed to identify identical structures (called isomorphism). There are three types of isomorphism used to compare pairs of graphs: . Graph isomorphism, used to check whether two graphs are identical . Subgraph isomorphism, used to check whether one graph is com- pletely contained within another graph . Maximum common subgraph isomorphism, used to identify the larg- est subgraph common to a pair of graphs ß 2007 by Taylor & Francis Group, LLC. Algorithms using these types of isomor phism have been developed and used successfully within chemistry to represent and search large files of 2D and 3D structures. The principle of representing information in terms of nodes and edges is not, however, exclusive to computational chemistry and has been used in other areas. If one considers the map of the London Under- ground as an example of a geographical map, it can be regarded as a graph, with the nodes of the graph representing the stations, and edges representing connecting stations; for example, Russell Square and Covent Garden are on the same underground line, the Piccadilly line. Most other geographical maps or spatially distributed data could be represented in this way. The aim of the study was to assess the ability of the graph-theoretical methods, used in computational chemistry, to identify a series of increasingly complex patterns of geographical areas that are of interest in public health. We were particularly interested in identifying areas of deprivation and areas of deprivation that have poor health. We briefly describe the MCS algorithm and the structure of the data files that were developed for searching the geographical data. After presenting the results of the searches, we discuss the utility of the method for identifying geographical patterns for public health. 8.2 Methods 8.2.1 Program The RASCAL program, which is an example of a maximum common subgraph isomorphism method, has been used previously within chemoinfomatics, was modified to enable the program to be used with geographically based public health data, so that the nodes were geographical area and the edges were the association between these areas. Just as the chemical structures can have information associated with them, such as atomic type, geographical areas can also have information associated with them, such as deprivation, census variables, and mortality and morbidity information. The modified program had previously been validated using a test data set (Bath et al., 2002a). The modified RASCAL program can identify all geographical pattern s within the area of interest that match a predefined geographical pattern, in terms of variable criteria and area adjacency. The program requires two distinct pieces of information about each geographical area: variable information that will be used in the selection criteria and information about which areas are neighboring. 8.2.2 Data 8.2.2.1 Geographical Area The geographical area used in the study was the area previously covered by the Trent Region Health Authori ty, which includes South Yorkshire, Derby- shire, Leicestershire, Nottinghamshire, Lincolnshire, and South Humberside ß 2007 by Taylor & Francis Group, LLC. (Figure 8.1). The areas of interest were the 10,665 enumeration districts (EDs) that make up Trent region. EDs are the lowest level of census geography in England and Wales representing on average 200 households in 1991. Information on two census-derived variables was used in the study: deprivation and standardized long-term limiting illness ratio for people aged under 75 years (SLTLI<75). 8.2.2.2 Deprivation The Townsend Material Deprivation Index (Townsend et al., 1988) was calculated for each ED within the Trent region and this index was used to assign each ED with a deprivation quintile variable. The Townsend Material Deprivation Index is a composite score made up of the summation of four standardized variables taken from the 1991 Census small area statistics (SAS). The census variables are: unemployment, overcrowding, lack of owner occupied accommodation, and lack of car ownership. This index was chosen because previous studies have suggested that it is a reasonable measure for explaining material disadvantage (Morris and Carstairs, 1991). A high positive score indicates relatively high levels of deprivation within an area whereas a high negative score indicates relatively high levels of affluence within an area. The Townsend Material Deprivation Index was calculated for each ED within Trent, standardized to Trent. In total, 195 EDs could not be allocated Barnsley South Humber Sheffield Lincolnshire North Nottinghamshire North Derbyshire South Derbyshire Leicester Nottingham Rotherham Doncaster FIGURE 8.1 Map of Trent region showing the enumeration districts for the 1991 census. (From 1991 Census: Digitised Boundary Data (England and Wales).) ß 2007 by Taylor & Francis Group, LLC. a deprivation score because of missing values in one or more of the census variables, generally low counts and suppression thresholds built into the census tables (Dale and Marsh, 1993). These EDs were given a deprivation quintile value of 99. The remaining 10,470 EDs were equally assigned a deprivation quintile on the basis of their Townsend score. A quintile value of 5 indicated those EDs within the top 20% most deprived areas, and a quintile value of 1 indicated those EDs within the top 20% most affluent, relative to Trent. Figure 8.2 shows the map of Trent region shaded into quintiles on the basis of the Townsend deprivation score. Because of their relative ly small size and large number individual EDs are difficult to distinguish for the whole of Trent. To show individual EDs more clearly, an area within the south=center of Sheffield has been selected. The maps of Sheffield center show that the more deprived areas are pre- dominantly to the northeast of the map, within the wards of Castle, Manor, Park, Sharrow, and Netherthorpe, which surround the south of the city center. 8.2.2.3 Standardized Long-Term Limiting Illness for People Aged Less Than 75 Long-term limiting illness was also taken from the 1991 Census SAS. The indirect standardization method was used, standardizing each ED by age and sex to Trent region for all persons aged less than 75 years. The ED-ba sed population estimates used in the standardization were taken from the Estimating with Confidence Project, which adjusted for the underenumera- tion that occurred in the 1991 Census (Simpson et al., 1995). A value of 100 signifi es that the observed number of persons with limiting long-term illness under 75 years is equivalent to the number of persons expected, taking into account the age-specific rates of Trent region overall. The Trent deprivation quintiles (No. of EDs) Standardized to trent region 1 (2094) (2094) (2094) (2094) (2094) (195) 2 3 4 5 Missing values FIGURE 8.2 Maps showing the Townsend deprivation quintile for each ED within the Trent region and an inner-city area of Sheffield (striped areas signify missing data). (From 1991 Census: Digitised Boundary Data (England and Wales); 1991 Census: Small Area Statistics (England and Wales).) ß 2007 by Taylor & Francis Group, LLC. resu lting SLTLI < 75 val ues were then assigne d to q uintiles with the 20% lowest values ass igned a quintile value of 1 and the highes t 20% ass igned a value of 5. The SLTLI < 75 for 194 EDs could not be calcul ated because of conf identiality issu es in the Census SAS tables (Da le and Marsh, 1993). These EDs were given a val ue of 99. Figure 8.3 shows the SLTLI < 75 quin tiles for Trent region and for the selected area with in Shef field. The hig her SLTLI < 75 sco res can again be seen pre dominan tly within the north east of the map, sur rounding the city center to the south . 8.2.2 .4 Adja cency Informati on As wel l as each ED havin g a depriva tion quintile and an SLTLI < 75 val ue, each ED also has informati on about its neighbo ring EDs. The EDs were eac h assign ed a numb er bet ween 1 and 10 ,665. For each ED a list of neighbo ring ED numbers was reco rded. 8.2.3 Storag e of Inform ation All the informati on relati ng to each ED was stored on one space-s eparated text file. The file contain ed three parts. Part 1 hel d, on one line, the total number of EDs, the max imum number of neighbo ring EDs, and the numbe r of variables. Part 2 held, for each ED, one line containing the ED number, ED name, the deprivation quintile, and the SLTLI<75 value. Part 3 held, for each ED, one line co ntaining their ED number and the ED number for each neighboring (or adjacent) ED. Table 8.1 shows an extract from the data file, showin g part 1 and parts 2 and 3 for the ED 38PMFF03. Standardized long-term limiting illness Ratio < 75 years (No. of EDs) Standardized to Trent 1 − 66.12 (2094) (2094) (2094) (2094) (2095) (194) 2 − 66.13 &<84.14 3 − 84.14 &<103.3 4 − 103.3 &<131 5 − 131+ Missing values FIGURE 8.3 Maps showing SLTLI<75 quintiles for the EDs in the Trent region and an inner-city area of Sheffield (striped areas signify missing data). (From 1991 Census: Digitised Boundary Data (England and Wales); 1991 Census: Small Area Statistics (England and Wales).) ß 2007 by Taylor & Francis Group, LLC. Part 1 in Table 8.1 shows there were 10,665 EDs within the data file, a maximum of 22 neighboring EDs to any one central ED and two variables. Part 2 shows that the ED 38PMFF03 was numbered 10,000 and had a deprivation quintile of 4 and an SLTLI<75 quintile of 4. Part 3 shows the numbers of the six neighboring EDs. Because the maxim um number of neighboring EDs was 22, the modified RAS CAL program expected 22 numbers to follow each ED number in part 3. The ED 38PMFF03 had only six neighboring EDs, so 16 zeroes are included to ensure that the ED had the 22 expected values. 8.2.4 Queries 8.2.4.1 Query Patterns Figure 8.4 sho ws the quer y pattern s that were used to identify geogr aphi cal patterns within the Trent region. These queries were developed to provide a range of pattern sizes and arrangement of deprived EDs of potential interest within the query pattern. Query 1 is a fairly simple pattern looking for a central ED adjacent to three EDs, all with a deprivation quintile within the top 20% most deprived. Query 2 has a central ED adjacent to four EDs, all with deprivation quintiles within the top 20% most deprived and with the top 20% highest levels of SLTLI<75. Query 3 is looking for a pattern of EDs forming a chain of five, all with deprivation quintiles within the top 20% most deprived and with SLTLI<75 within the top 20% highest scores. Thus, although queries 2 and 3 both contain the same number of EDs, i.e., five, they represent very different shapes of patterns. For example, Query 2 could represent a tight cluster of deprived EDs and deprivation and poor health concentrated in a given area, whereas Query 3 could represent a chain of deprived EDs alongside, or bordering, a geographical feature, such as a road or river. Differentiating between clusters of deprivation and chains of deprivation in relation to geographical features in this way could be of value in under- standing the local impact of deprivation and health for planning health-care and social-care services. Query 4 is similar to Query 3 but seeks to identify chains of nine EDs. Query 5 is looking for a more complicated pattern of nine EDs all with deprivation quintiles within the top 20% most deprived and with the top 20% highest levels of SLTLI<75. Thus, similar to queries 2 and 3, both the queries 4 and 5 had the same number of nodes, i.e., nine, but represented different shapes of patterns that could be linked with geographical features. TABLE 8.1 Extract from the ED Information Data File 10,665 22 2 (part1) 10,000 38PMFF03 4 4 (part2) 10,000 9,998 9,999 10,001 10,002 10,003 10,004 0 0 0 0000000000000(part3) ß 2007 by Taylor & Francis Group, LLC. 8.2.4.2 Query Data File The data files for each of the queries were set up in a similar way to that of the ED data file but with two extra parts. Part 1 held, on one line, the total number of quer y nodes, the maximum number of neig hboring query nodes, Query 1 Criteria: AII EDs within the top 20% deprived Pattern Query node 2 Query node 1 Query node 4 Query node 3 Query 2 Criteria: AII EDs within the top 20% deprived and SLTLI<75 within top 20% highest scores Criteria: AII EDs within the top 20% deprived and SLTLI<75 within top 20% highest scores Criteria: AII EDs within the top 20% deprived and SLTLI<75 within top 20% highest scores Criteria: AII EDs within the top 20% deprived and SLTLI<75 within top 20% highest scores Pattern Query node 2 Query node 1 Query node 5 Query node 5 Query node 5 Query node 6 Query node 7 Query node 7 Query node 8 Query node 8 Query node 9 Query node 9 Query node 4 Query node 3 Query 4 Pattern Query node 2 Query node 1 Query node 1 Query node 5 Query node 4 Query node 4 Query node 3 Query node 3 Query 3 Pattern Query 5 Pattern Query node 1 Query node 6 Query node 2 Query node 2 Query node 3 Query node 4 FIGURE 8.4 Diagrams showing query patterns and selection criteria. ß 2007 by Taylor & Francis Group, LLC. and the number of variables. Part 2 held, for each query node, one line containing the query node number, query node name, and deprivation quintile. Part 3 held, for each query node, one line containing the query node number and the query node number for each neighboring query node. Parts 4 and 5 allowed queries to be set up with ranges rather than absolute numbers. Part 4 held, for each query node, one line containing their query code number and a tolerance value percentage for the deprivation quintile. Part 5 held, for each query node one line containing their query code number and a tolerance direction for the deprivation tolerance value, which allowed tolerance values to be set around the deprivation quintile value, or set the tolerance value one way only, i.e., greater than or less than. The query data file for Query 1 is displayed in Table 8.2. Part 1 of Table 8.2 states that there were four query nodes, a maximum of three connections, and one variable. Part 2 states that the four query nodes are called Q1, Q2, Q3, and Q4, with the query node numbers 1, 2, 3, and 4, respectively. All the query nodes have a deprivation quintile 5. Part 3 shows the connections within the pattern. It states that query node 1 is connected to query nodes 2–4, while query nodes 2–4 are only connected to quer y node 1. Part 4 states that all the query node deprivation values have a tolerance of 1%. In Part 5, all the EDs have a tolerance direction of 0 indicat- ing that the tolerance is either side of the deprivation quintile, that is the deprivation quintile for each query node can be between 4.95 and 5.05. The query data files for query numbers 2–5 follow a similar pattern to the data file for Query 1. The modified RASCAL program was used to run each of these queries against the Trent ED data file. TABLE 8.2 Data File for Query 1 431 (part 1) 1Q15 (part 2) 2Q25 3Q35 4Q45 1234(part 3) 2100 3100 4100 11 (part 4) 21 31 41 10 (part 5) 20 30 40 ß 2007 by Taylor & Francis Group, LLC. [...]... 1527 713 88 2 661 552 1 181 350 n=a n=a n=a 1 2 3 4 5 FIGURE 8. 5 Map showing the results from Query 1 for the inner-city area of Sheffield (From 1991 Census: Digitised Boundary Data (England and Wales).) FIGURE 8. 6 Map showing the results from Query 2 for the inner-city area of Sheffield (From 1991 Census: Digitised Boundary Data (England and Wales).) ß 2007 by Taylor & Francis Group, LLC FIGURE 8. 7 Map... Francis Group, LLC FIGURE 8. 9 Map showing the results from Query 5 for the inner-city area of Sheffield (From 1991 Census: Digitised Boundary Data (England and Wales).) forming a chain of nine, is a subset of the EDs identified in Query 3, which form a link of five EDs Comparing Figures 8. 7 and 8. 8 shows that the shaded area within Firth Park in Figure 8. 7 is not shaded in Figure 8. 8 This indicates that... districts, and Paul White and Paul Brindley for helpful discussions Census output is Crown copyright and is reproduced with the permission of the Controller of HMSO and the Queen’s Printer for Scotland This work is based on data provided with the support of the ESRC and JISC and uses boundary material which is copyright of the Crown and the ED-Line consortium Re fer enc es Alexander, F.E and Cuzick,... disease clusters In Geographical and Environmental Epidemiology, edited by Elliott, P., Cuzick, J., English, D., and Stern, R., pp 2 38 250 (Oxford: Oxford University Press) ArcInfo Version 8. 2 Available from ESRI GIS and Mapping Software Redlands, CA (http:= =www.esri.com=, accessed on 25 May 2002) Arcview Version 3.2 Available from ESRI GIS and Mapping Software Redlands, CA (http:= =www.esri.com=,... deprivation quintile and SLTLI . 159 8. 1.1 Background 160 8. 1.2 Computational Chemistry and Graph Theory 161 8. 2 Methods 162 8. 2.1 Program 162 8. 2.2 Data 162 8. 2.2.1 Geographical Area 162 8. 2.2.2 Deprivation 163 8. 2.2.3 Standardized. health and geographical information sciences (Openshaw et al., 1 988 ; Knox, 1 989 ; Besag and Newell, 1991; Alexander and Cuzick, 1992; Kulldorff, 1999). Within the domain of public health and spatial. number and the ED number for each neighboring (or adjacent) ED. Table 8. 1 shows an extract from the data file, showin g part 1 and parts 2 and 3 for the ED 38PMFF03. Standardized long-term limiting

Định dạng
Số trang	18
Dung lượng	2,38 MB