Data Analysis Machine Learning and Applications Episode 2 Part 3 pps

306 Mario Larch and Janette Walde When the null hypothesis is tested whether the spatial correlation parameter U is zero in the presence of spatial correlation in the error term the Wald test has a very good power although the power is higher with the sparser weighting matrix. The latter characteristic is a general feature for all conducted tests. The power for the spatial error parameter in the presence of a non-zero spatial lag parameter is lower. However, the power of the Wald test in this circumstances is (much) greater than the power achievable by using Lagrange Multiplier tests. In Figure 1c the best performing LM test is plotted, i.e. LM A . All LM tests relying on OLS residuals fail seriously to detect the true DGP. Comparable to the performance of the Wald test based on MLE estimates is the Wald test based on GMM estimates but only in detecting the significant lag parameter in the presence of a significant spatial error parameter. In the reverse case the Wald test using GMM estimates is much worse. As a further model selection approach the performance of information criteria is analyzed. The performance of the classical Akaike information criterion and the bias corrected AIC are almost identical. In Figure 1d the share of cases in which AIC/AIC c identifies the correct DGP is plotted on the y-axis. All information criteria fail in more than 15% of the cases to identify the correct more parsimonious model, i.e. SARAR(1,0) or SARAR(0,1) instead of SARAR(1,1). However, in the remaining experiments (U = 0.05, ,0.2orO = .05, ,0.2) AIC/AIC c is comparable to the performance of the Wald test. BIC performs better than AIC/AIC c to detect SARAR(1,0) or SARAR(0,1) instead of SARAR(1,1) but much worse in the remaining experiments. In order to be able to propose a general procedure for model selection the approach must also be suitable if the true DGP is SARAR(1,0) or SARAR(0,1). In this case the Wald test based on the general model has again the appropriate size and a very good power. Further the sensitivity on different weighting matrices is less se- vere. However, the power is smallest for the test with the null hypothesis H 0 : O = 0 and with distance as weighting scheme W 2 . The Wald test using GMM estimates is again comparable when testing for the spatial lag parameter but worse when testing for the spatial error parameter. Not significantly different from the power function of the Wald test based on the general model are both LM statistics based on OLS residuals. However, in this case LM A fails to identify the correct DGP. The Wald test outperforms the information criteria regarding the identification of SARAR(1,0) or SARAR(0,1). If OLS is the DGP, the correct model is chosen only about two thirds of the time by AIC/AIC c but comparably often to Wald by BIC. If SARAR(1,0) is the data generating process all information criteria perform poorer than the Wald test independent of the underlying weighting scheme. If the SARAR(0,1) is the data generating process, BIC is worse than the Wald test, and AIC/AIC c has a slightly higher performance for small values of the spatial parameter but is outperformed by the Wald test for higher values of the spatial parameters. For the sake of completeness it is noted that no valid model selection can be conducted using likelihood ratio tests. Lag or Error? - Detecting the Nature of Spatial Correlation 307 0 0.05 0.1 0.15 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 U,O power a) SARAR(1,1): MLE Wald W 1 W 1 , O=0.5 W 1 W 1 , U=0.5 W 2 W 2 , O=0.5 W 2 W 2 , U=0.5 0 0.05 0.1 0.15 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 U,O power b) SARAR(1,1): GMM opt.inst. Wald W 1 W 1 , O=0.5 W 1 W 1 , U=0.5 W 2 W 2 , O=0.5 W 2 W 2 , U=0.5 0 0.05 0.1 0.15 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 U,O power c) SARAR(1,1): LM A W 1 W 1 , O=0.5 W 1 W 1 , U=0.5 W 2 W 2 , O=0.5 W 2 W 2 , U=0.5 0 0.05 0.1 0.15 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 U,O correct model choice [%] d) SARAR(1,1): AIC W 1 W 1 , O=0.5 W 1 W 1 , U=0.5 W 2 W 2 , O=0.5 W 2 W 2 , U=0.5 Fig. 1. a) Power of the Wald test based on the general model and MLE estimates. b) Power of the Wald test based on the general model and GMM estimates. c) Power of the Lagrange Multiplier test using LM A as test statistic. d) Correct model choice of the better performing information criterion (AIC/AIC c ). To conclude, we find that the ’general to specific’ approach is the most suitable procedure to identify the correct data generating process (DGP) regarding Cliff-Ord type spatial models. Independent whether the true DGP is a SARAR(1,1), SARAR(1,0), SARAR(0,1), or just a regression model without any spatial correlation, the general model should be estimated and the Wald tests conducted. The chance to identify the true DGP is than higher compared to the alternative model choice criteria based on the LM tests, LR tests or on information criteria like AIC, AIC c or BIC. References ANSELIN, L. (1988a): Lagrange Multiplier Test Diagnostics for Spatial Dependence and Spa- tial Heterogeneity. Geographical Analysis, 20, 1–17. ANSELIN, L. (1988b): Spatial Econometrics: Methods and Models. Kluwer Academic Pub- lishers, Boston. ANSELIN, L., BERA, A., FLORAX, R. and YOON, M. (1996): Simple Diagnostic Tests for Spatial Dependence, Regional Science and Urban Economics, 26, 77–104. 308 Mario Larch and Janette Walde ANSELIN, L., FLORAX, R. and REY, S. (2004): Advances in Spatial Econometrics. Springer, Berlin. BELITZ, C. and LANG, S. (2006), Simultaneous selection of variables and smoothing parameters in geoadditive regression models. In H J. Lenz, and R. Decker (Eds.):Advances in Data Analysis. Springer, Berlin-Heidelberg, forthcoming. DUBIN, R. (2003): Robustness of Spatial Autocorrelation Specifications: Some Monte Carlo Evidence, Journal of Regional Science, 43, 221–248. FLORAX, R.J., and DE GRAAFF, T. (2004): The Performance of Diagnostic Tests for Spatial Dependence in Linear Regression Models: A Meta-Analysis of Simulation Studies. In: L. Anselin, R.J. Florax, and S.J. Rey (Eds.): Advances in Spatial Econometrics - Method- ology, Tools and Applications. Springer, Berlin-Heidelberg, 29-65. FLORAX, R.J., and REY, S.J. (1995): The Impacts of Misspecified Spatial Interaction in Lin- ear Regression Models. In: L. Anselin, R.J. Florax, and S.J. Rey (Eds.): New Directions in Spatial Econometrics. Springer, Berlin-Heidelberg, 111-135. FLORAX, R.J., FOLMER, H., and REY, S.J. (2003): Specification Searches in Spatial Econo- metrics: The Relevance of HendryŠs Methodology, Regional Science and Urban Eco- nomics, 33, 557–579. HENDRY, D. (1979): Predictive Failure and Econometric Modelling in Macroeconomics: The Transactions Demand for Money. In: P. Ormerod (Ed.): Economic Modelling. Heine- mann, London, 217-242. KELEJIAN, H., and PRUCHA, I. (1999): A Generalized Moments Estimator for the Autore- gressive Parameter in a Spatial Model, International Economic Review, 40, 509–533. KELEJIAN, H., and PRUCHA, I. (2006): Specification and Estimation of Spatial Autore- gressive Models with Autoregressive and Heteroskedastic Disturbances, unpublished manuscript, University of Maryland. LEE, L. (2003): Best Spatial Two-Stage Least Squares Estimators for a Spatial Autoregressive Model with Autoregressive Disturbances, Econometric Reviews, 22, 307–335. MCMILLEN, D. (2003): Spatial Autocorrelation or Model Misspecification?, International Regional Science Review, 26, 208–217. Segmentation and Classification of Hyper-Spectral Skin Data Hannes Kazianka 1 , Raimund Leitner 2 and Jürgen Pilz 1 1 University of Klagenfurt, Institute of Statistics, Alpen-Adria-Universität Klagenfurt, Universitätsstraße 65-67, 9020 Klagenfurt, Austria {hannes.kazianka, juergen.pilz} @uni-klu.ac.at 2 CTR Carinthian Tech Research AG, Europastraıe 4/1, 9524 Villach, Austria Raimund.Leitner@ctr.at Abstract. Supervised classification methods require reliable and consistent training sets. In image analysis, where class labels are often assigned to the entire image, the manual genera- tion of pixel-accurate class labels is tedious and time consuming. We present an independent component analysis (ICA)-based method to generate these pixel-accurate class labels with minimal user interaction. The algorithm is applied to the detection of skin cancer in hyper- spectral images. Using this approach it is possible to remove artifacts caused by sub-optimal image acquisition. We report on the classification results obtained for the hyper-spectral skin cancer data set with 300 images using support vector machines (SVM) and model-based discriminant analysis (MclustDA, MDA). 1 Introduction Hyper-spectral images consist of several, up to hundred, images acquired at different - mostly narrow band and contiguous - wavelengths. Thus, a hyper-spectral image contains pixels represented as multidimensional vectors with elements indicating the reflectivity at a specific wavelength. For a contiguous set of narrow band wavelengths these vectors correspond to spectra in the physical meaning and are equal to spectra measured with e.g. spectrometers. Supervised classification of hyper-spectral images requires a reliable and consistent training set. In many applications labels are assigned to the full image instead of to each individual pixel even if instances of all the classes occur in the image. To obtain a reliable training set it may be necessary to label the images on a pixel by pixel basis. Manually generating pixel-accurate class labels requires a lot of effort; cluster-based automatic segmentation is often sensitive to measurement errors and illumination problems. In the following we present a labelling strategy for hyper-spectral skin cancer data that uses PCA, ICA and K-Means clustering. For the classification of unknown images, we compare support vector machines and model-based discriminant analysis. 246 Hannes Kazianka, Raimund Leitner and Jürgen Pilz Section 2 describes the methods that are used for the labelling approach. The classification algorithms are discussed in Section 3. In Section 4 we present the segmentation and classification results obtained for the skin cancer data set and Section 5 is devoted to discussions and conclusions. 2 Labelling Hyper-spectral data are highly correlated and contain noise which adversely affects classification and clustering algorithms. As the dimensionality of the data equals the number of spectral bands, using the full spectral information leads to computational complexity. To overcome the curse of dimensionality we use PCA to reduce the dimensions of the data, and inherently also unwanted noise. Since different features of the image may have equal score values for the same principal component, an addi- tional feature extraction step is proposed. ICA makes it possible to detect acquisition artifacts like saturated pixels and inhomogeneous illumination. Those effects can be significantly reduced in the spectral information giving rise to an improved segmentation. 2.1 Principal Component Analysis (PCA) PCA is a standard method for dimension reduction and can be performed by sin- gular value decomposition. The algorithm gives uncorrelated principal components. We assume that those principal components that correspond to very low eigenvalues contribute only to noise. As a rule of thumb, we chose to retain at least 95% of the variability which led to selecting 6-12 components. 2.2 Independent Component Analysis (ICA) ICA is a powerful statistical tool to determine hidden factors of multivariate data. The ICA model assumes that the observed data, x, can be expressed as a linear mixture of statistically independent components, s. The model can be written as x = As where the unknown matrix A is called the mixing matrix. Defining W as the unmixing matrix we can calculate s as s = Wx. As we have already done a dimension reduction, we can assume that noise is neg- ligible and A is square which implies W = A −1 . This significantly simplifies the estimation of A and s. Providing that no more than one independent component has Gaussian distribution, the model can be uniquely estimated up to scalar multipliers. There exists a variety of different algorithms for fitting the ICA model. In our work we focused on the two most popular implementations which are based on maximisation of non-Gaussianity and minimisation of mutual information respectively: Fas- tICA and FlexICA. Segmentation and Classification of Hyper-Spectral Skin Data 247 FastICA The FastICA algorithm developed by Hyvärinen et al. (2002) uses negentropy, J (y), as a measure of Gaussianity. Since negentropy is zero for Gaussian variables and always nonnegative one has to maximise negentropy in order to maximise non- Gaussianity. To avoid computation problems the algorithm uses an approximation of negentropy: If G denotes a nonquadratic function and we want to estimate one independent component s we can approximate J(y) ≈[E { G(y) } −E { G(Q) } ] 2 , where Q is a standardised Gaussian variable and y is an estimate of s. We adopt to use G(y)=logcoshy since this has been shown to be a good choice. Maximisation di- rectly leads to a fixed-point iteration algorithm that is 20−50 times faster than other ICA implementations. To estimate several independent components a deflationary orthogonalisation method is used. FlexICA Mutual information is a natural measure of information that members of a set of random variables have on the others. Choi et al. (2000) proposed an ICA algorithm that attempts to minimise this quantity. All independent components are estimated simultaneously using a natural gradient learning rule with the assumption that the source signals have the generalized Gaussian distribution with density q i (y i )= r i 2V i *(1/r i ) exp  − 1 r i     y i V i     r i  . Here r i denotes the Gaussian exponent which is chosen in a flexible way depending on the kurtosis of the y i . 2.3 Two-Stage K-Means clustering From a statistical point of view it may be inappropriate to use K-means clustering since K-means cannot use all the higher order information that ICA provides. There are several approaches that avoid using K-means, for example Shah et al. (2005) proposed the ICA mixture model (ICAMM). However, for large images this algorithm fails to converge. We developed a 2-stage K-means clustering strategy that works particularly well with skin data. The choice of 5 resp. 3 clusters for the K-means algorithm has been determined empirically for the skin cancer data set. 1. Drop ICs that contain a high amount of noise or correspond to artifacts. 2. Perform K-means clustering with 5 clusters. 3. Those clusters that correspond to healthy skin are taken together into one cluster. This cluster is labelled as skin. 4. Perform a second run of K-means clustering on the remaining clusters (inflamed skin, lesion, etc.). This time use 3 clusters. Label the clusters that correspond to the mole and melanoma centre as mole and melanoma. The remaining clusters are considered to be ‘regions of uncertainty’. 248 Hannes Kazianka, Raimund Leitner and Jürgen Pilz 3 Classification This section describes the classification methods that have been investigated. The preprocessing steps for the training data are the same as in the segmentation task: Dimension reduction using PCA and feature extraction performed by ICA. Using the Bayesian Information Criterion (BIC), the data were reduced to 6 dimensions. 3.1 Mixture Discriminant Analysis (MDA) MDA assumes that each class j can be modelled as a mixture of R j subclasses. The subclasses have a multivariate Gaussian distribution with mean vector z jr , r = 1, ,R j , and covariance matrix 6, which is the same for all classes. Hence, the mixture model for class j has the density m j (x)= | 2S6 | − 1 2 R j  r=1 S jr exp  −  x −z jr  6 −1  x −z jr  2  , where S jr denote the mixing probabilities for the j-th subclass,  R j r=1 S jr = 1. The parameters T =  z jr ,6,S jr  can be estimated using an EM-algorithm or, as Hastie et al. (2001) suggest, using optimal scoring. It is also possible to use flexible discriminant analysis (FDA) or penalized discriminant analysis (PDA) in combination with MDA. The major drawback of this classification approach is that, similar to LDA which is also described in Hastie et al. (2001), the covariance matrix is fixed for all classes and the number of subclasses for each class has to be set in advance. 3.2 Model-based Discriminant Analysis (MclustDA) MclustDA, proposed by Fraley et al. (2002), extends MDA in a way that the covariance in each class is parameterized using the eigenvalue decomposition 6 r = O r D r A r D T r , r = 1, ,R j . The volume of the component is controlled by O r , A r defines the shape and D r is responsible for the orientation. The model selection is done using the BIC and the maximum likelihood estimation is performed by an EM-algorithm. 3.3 Support Vector Machines (SVM) The aim of support vector machines is to find a hyperplane that optimally separates two classes in a high-dimensional feature space induced by a Mercer kernel K (x,z). In the L 2 -norm case the Lagrangian dual problem is to find O ∗ that solves the following convex optimization problem: max O m  i=1 O i − 1 2 m  i=1 m  j=1 O i O j y i y j  K (x i ,x j )+ 1 C G ij  s.t. m  i=1 O i y i = 0, O i ≥ 0, Segmentation and Classification of Hyper-Spectral Skin Data 249 where x i are training points belonging to classes y i . The cost parameter C and the kernel function have to be chosen to suit to the problem. It is also possible to use different cost parameters for unbalanced data as was suggested by Veropoulos et al. (1999). Although SVMs were originally designed as binary classifiers, there exists a variety of methods to extend them to k > 2 classes. In our work we focused on one- against-all and one-against-one SVMs. The one-against-all formulation trains each class against all remaining classes resulting in k binary SVMs. The one-against-one formulation uses k(k−1) 2 SVMs, each separating one class from one another. 4 Results A set of 310 hyper-spectral images (512 ×512 pixels and 300 spectral bands) of malign and benign lesions were taken in clinical studies at the Medical University Graz, Austria. They are classified as melanoma or mole by human experts on the basis of a histological examination. However, in our survey we distinguish between three classes, melanoma, mole and skin, since all these classes typically occur in the images. The segmentation task is especially difficult in this application: We have to take into account that melanoma typically occurs in combination with mole.To reduce the number of outliers in the training set we define a ‘region of uncertainty’ as a transition region between the kernels of mole and melanoma and between the lesion and the skin. 4.1 Training Figures 1(b) and 1(c) display the first step of the K-Means strategy described in Sec- tion 2.3. The original image displayed in Figure 1(a) shows a mole that is located in the middle of a hand. For PCA-transformed data, as in Figure 1(b), the algorithm performs poorly and the classes do not correspond to lesion, mole and skin regions (left and bottom). Even the lesion is in the same class together with an illumination problem. If the data is also transformed using ICA, as in Figure 1(c), the lesion is already identified and there exists a second class in the form of a ring around the lesion which is the desired ‘region of uncertainty’. The other classes correspond to wrinkles on the hand. Figure 1(d) shows the second K-Means step for the PCA transformed data. Although the second K-Means step makes it possible to separate the lesion from the illumination problem it can be seen that the class that should correspond to the kernel of the mole is too large. Instances from other classes are present in the kernel. The second K-Means step with the ICA preprocessed data is shown in Figure 1(e). Not only the kernel is reliably detected but there also exists a transition region consisting of two classes. One class contains the border of the lesion. The second class separates the kernel from the remaining part of the mole. We believe that the FastICA algorithm is the most appropriate ICA implementation 250 Hannes Kazianka, Raimund Leitner and Jürgen Pilz (a) (b) (c) (d) (e) Fig. 1. The two iteration steps of the K-Means approach for both PCA ((b) and (d)) and ICA ((c) and (e)) are displayed together with the original image (a). The different gray levels indicate the cluster the pixel has been assigned to. for this segmentation task. The segmentation quality for both methods is very similar, however the FastICA algorithm is faster and more stable. To generate a training set of 12.000 pixel spectra per class we labelled 60 mole images and 17 melanoma images using our labelling approach. The pixels in the training set are chosen randomly from the segmented images. 4.2 Classification In Table 1 we present the classification results obtained for the different classifiers described in Section 3. As a test set we use 57 melanoma and 253 mole images. We use the output of the LDA classifier as a benchmark. LDA turns out to be the worst classifier for the recognition of moles. Nearly one half of the mole images are misclassified as melanoma. On the other hand LDA yields excellent results for the classification of melanoma, giving rise to the presumption that there is a large bias towards the melanoma class. With MDA we use three subclasses in each class. Although both MDA and LDA keep the covariance fixed, MDA models the data as mixture of Gaussians leading to a significantly higher recognition rate compared to LDA. Using FDA or PDA in combination with MDA does not im- prove the results. MclustDA performs best among these classifiers. Notice however, that BIC overestimates the number of subclasses in each class which is between 14 and 21. For all classes the model with varying shape, varying volume and varying Segmentation and Classification of Hyper-Spectral Skin Data 251 Table 1. Recognition rates obtained for the different classifiers Pre-Proc. Class MDA MclustDA LDA FlexICA Mole 84.5% 86.5% 56.1% Melanoma 89.4% 89.4% 98.2% FastICA Mole 84.5% 87.7% 56.1% Melanoma 89.4% 89.4% 98.2% Pre-Proc. Class OAA-SVM OAO-SVM unbalanced SVM FlexICA Mole 72.7% 69.9% 87.7% Melanoma 92.9% 94.7% 89.4% FastICA Mole 71.5% 69.9% 87.3% Melanoma 92.9% 94.7% 89.4% orientation of the mixture components is chosen. This extra flexibility makes it possible to outperform MDA even though only half of the training points could be used due to memory limitations. Another significant advantage of MclustDA is its speed, taking around 20 seconds for a full image. Since misclassification of melanoma into the mole class is less favourable than misclassification of mole into the melanoma class, we clearly have unbalanced data in the skin cancer problem. According to Veropoulos et al. (1999) we can choose C melanoma > C mole = C skin . We obtain the best results using the polynomial kernel of degree three with C melanoma = 0.5andC mole = C skin = 0.1. This method is clearly superior when compared with the other SVM approaches. For the one-against-all (OAA-SVM) and the one-against-one (OAO-SVM) formulation we use Gaussian kernels with C = 2 and V = 20. A drawback of all the SVM classifiers, however, is that training takes 20 hours (Centrino Duo 2.17GHz, 2GB RAM) and classification of a full image takes more than 2 minutes. We discovered that different ICA implementations have no significant impact on the quality of the classification output. FlexICA performs slightly better for the unbalanced SVM and one-against-all-SVM. FastICA gives better results for MclustDA. For all other classifiers the performances are equal. 5 Conclusion The combination of PCA and ICA makes it possible to detect both artifacts and the lesion in hyper-spectral skin cancer data. The algorithm projects the correspond- ing features on different independent components; dropping the independent components that correspond to the artifacts and applying a 2-stage K-Means clustering leads to a reliable segmentation of the images. It is interesting to note that for the mole images in our study there is always one single independent component that carries the information about the whole lesion. This suggests very simple segmentation in the case where the skin is healthy: keep the single independent component that contains the desired information and perform the K-Means steps. For melanoma [...]... method='asoundex') lname 11 525 6 WESTERHEIDE 20 0001 BESTEWEIDE 20 00 02 WESTERWELLE asoundex.lname W 236 B 233 W 236 3. 3 Candidate selection candidates (data1 , data2 , method, selvars1, selvars2, key1, key2, ) provides an interface to the functions crossproduct(), blocking(), sortedneighbour(), and stringranking() Candidate record pairs from data frames data1 and data2 are created and filtered according to... the datasets with 10,000 record pairs, sorted neighbourhood by region, age group, and sex reduces the list of candidate pairs to 1, 024 , and blocking by Soundex code of last name retains only 83 candidates > candidates (data1 =d1.prep, data2 =d2.prep, method='blocking',selvars1='asoundex.lname') > candidates (data1 =d1.prep, data2 =d2.prep, method='sorted', selvars1=c ('region','agegroup','sex'), k=10) 3. 4... Mining, Learning and Knowledge Extraction Aveiro, Portugal, August 20 07 Accepted DENK, M., FROESCHL, K.A., HACKL, P and RAINER, N (Eds.) (20 04): Special Issue on Data Integration and Record Matching, Austrian J Statistics, 33 DENK, M., HACKL, P and RAINER, N (20 05): String Matching Techniques: An Empirical Assessment Based on Statistics Austria’s Business Register Austrian J Statistics, 34 (3) , 23 5 25 0... sensitivity of 95% is possible at the cost of 20 % false-positives References ABE, S (20 05): Support Vector Machines for Pattern Classification Springer, London CHOI, S., CICHOCKI, A and AMARI, S (20 00): Flexible Independent Component Analysis Journal of VLSI Signal Processing, 26 (1 /2) , 25 -38 FRALEY, C and RAFTERY, A (20 02) : Model-Based Clustering, Discriminant Analysis, and Density Estimation Journal of the... fname.d1 1 GERALD 2 PAOLO 3 WEIFENG fname.d2 SELJAMI SELJAMI SELJAMI jaro.fname 0. 531 75 0 .39 524 0. 428 57 c.asound.lname 0.00000 0.00000 0.00000 simple.visit1 0.00000 0.00000 1.00000 3. 5 Scoring scoring (data, profile, method, label, wtype, ) estimates matching scores for the candidate pairs in data frame data from the specified similarity profile according to the selected method and appends the resulting... Association, 97, 611– 631 HASTIE, T., TIBSHIRANI, R and FRIEDMAN, J (20 01): The Elements of Statistical Learning Springer, New York HYVÄRINEN, A., KARHUNEN, J and OJA, E (20 01): Independent Component Analysis Wiley, New York SHAH, C., ARORA, M and VARSHNEY, P (20 04): Unsupervised classification of hyperspectral data: an ICA mixture model based approach International Journal of Remote Sensing, 25 , 481–487 VEROPOULOS,... 70 match the survey sample and the remaining 30 were drawn at random I.e., the true matching status of all 10,000 record pairs is known The data snippet shows a small subset of the first dataset fname 711 GERALD 13 PAOLO 164988 WEIFENG sex m m m agegroup 41-50 41-50 19 -30 country Austria Italy other visit1 1 1 0 visit2 1 1 1 3 .2 Data preparation preparation (data, variable, method, label, ) provides an... transition advocates and enables a smooth transition between the outdoor and indoor space in a house People do not like instant transition It makes them feel uncomfortable, and the house ugly Fig 3 Entrance transition pattern Alexander (20 02b) says: 1 Patterns contain life 2 Patterns support each other: the life and existence of one pattern influences the life and existence of another pattern 3 Patterns are... binning, equal frequency binning) 3 .2 The Divide et Impera Pattern A second pattern which is widely used in data mining is Divide et Impera It can also be described in the Coplien pattern form: 1 Context: A data mining problem is too large/complicated to be solved in one step 2 Problem: Structuring of the task 3 32 Boris Delibaši´ , Kathrin Kirchner and Johannes Ruhland c 3 Forces: It is not possible to... Denk (20 06) or Neiling (20 04) for an overview), most of them concentrating on particular stages of the process, such the author’s 33 6 Michaela Denk SAS implementation of a metadata framework for record linkage procedures (Denk (20 02) ) Moreover, commercial as well as ‘governmental’ software (especially from national statistical institutes) is available (for a survey cf Herzog et al (20 07) or Gill (20 01)) . asoundex.lname 11 525 6 WESTERHEIDE W 236 20 0001 BESTEWEIDE B 233 20 00 02 WESTERWELLE W 236 3. 3 Candidate selection candidates (data1 , data2 , method, selvars1, selvars2, key1, key2, ) provides an interface. retains only 83 candidates. > candidates (data1 =d1.prep, data2 =d2.prep, method='blocking',selvars1='asoundex.lname') > candidates (data1 =d1.prep, data2 =d2.prep, method='sorted',. Maryland. LEE, L. (20 03) : Best Spatial Two-Stage Least Squares Estimators for a Spatial Autoregressive Model with Autoregressive Disturbances, Econometric Reviews, 22 , 30 7 33 5. MCMILLEN, D. (20 03) :

Định dạng
Số trang	25
Dung lượng	660,03 KB