Using Models to Classify Data

Tk(u) = (1− |u|)I(|u| ≤1), (11.45) and the Epanechnikov kernel:

Ek(u) = 3

4(1−u2)I(|u| ≤1), (11.46) whereI(|u| ≤1) is an indicator function which is one if the absolute value of u is less than or equal to one, and zero otherwise. The most used unbounded kernel function is the Gaussian or normal function which has the following form:

Nk(u) = 1

√2πe−12u2. (11.47)

The ranges of these kernels are fixed, a fact that is clearest for the bounded kernels.

However, it is generally desirable to be able to alter the range of observations used in the kernel smoothing process. This is done by including a bandwidth parame- ter, λ, in the smoothing formula used to give ˆXt, the smoothed value of the raw observationXt, as shown in Equation 11.48:

kλ(u) = 1 λk

u λ

. (11.48)

This can then be used to give a smoothed value, ˆXt, as follows:

Xˆt= ∑Ts=1kλ(t−s)Xs

∑Ts=1kλ(t−s) . (11.49) The denominator ensures that the kernel weights sum to one – whilst the area under the curve for any kernel is equal to one, the weights when applied to discrete data may not be, so an adjustment is required. See Example 11.5.

11.5 Using Models to Classify Data

So far, the focus has been on explaining the value of an observation using the characteristics of an individual or firm. However, observations are sometimes in the form of categories rather than values, for example whether an individual is alive or dead, or whether a firm is solvent or insolvent. In this case, different types of models need to be used to analyse the data.

Example 11.5 The following table gives the central mortality rates for UK centenarian males from 1990 to 2005. What smoothed mortality rates are found if an Epanechnikov kernel with a bandwidth of 3 years is used?

Year Central Mortality Rate (m100,t)

1990 0.5230

1991 0.5009

1992 0.5085

1993 0.5626

1994 0.4659

1995 0.4841

1996 0.5646

1997 0.4912

1998 0.5137

1999 0.5249

2000 0.5484

2001 0.4739

2002 0.5221

2003 0.5634

2004 0.4842

2005 0.5293

With a bandwidth of 3 years, the first year for which a smoothed rate can be calculated is 1992. Combining the structure for the Epanechnikov kernel in Equation 11.46 with the general structure for a kernel smoothing function gives:

m100,1992=

∑1994t=1990

1 3×3

1−

1992−t 3

2 m100,t

∑1994t=1990

1 3×3

1−

1992−t 3

2 ,

This gives a smoothed value for ˆm100,1992 of 0.5151. Continuing this process gives the following values for ˆm100,t:

11.5 Using Models to Classify Data 251 Year Central Smoothed

Mortality Central Rate (m100,t) Mortality Rate ( ˆm100,t)

1990 0.5230 –

1991 0.5009 –

1992 0.5085 0.5151

1993 0.5626 0.5081

1994 0.4659 0.5123

1995 0.4841 0.5106

1996 0.5646 0.5081

1997 0.4912 0.5169

1998 0.5137 0.5233

1999 0.5249 0.5156

2000 0.5484 0.5173

2001 0.4739 0.5220

2002 0.5221 0.5189

2003 0.5634 0.5182

2004 0.4842 –

2005 0.5293 –

The raw and smoothed data are shown graphically below:

0.4 0.5 0.6

1990 1995 2000 2005

+ + +

+ +

Year

CentralMortalityRate

11.5.1 Generalised Linear Models

A generalised linear model (GLM) is a type of model used to link a linear regression model, such as that described in least squares regression, and a dependent variable that can take only a limited range of values. Rather than being calculated

using a least squares approach, the method of maximum likelihood is more likely to be employed to fit such models.

The most common use for a GLM is when the dependent variable can take only a limited number of values, and in the simplest case there are only two options.

For example, a firm can either default on its debt or not default; an individual can die or survive; an insurance policyholder can either claim or not claim. If trying to decide which underlying factors might have an impact on the option chosen, it is first necessary to give the options values of 0 and 1 and to define them in terms of some latent variable. So, ifZi is the event that is of interest (credit default, death, insurance claim and so on) for company or individuali, the relationship between Ziand a latent variableYiis:

Zi=

0 ifYi≤0

1 ifYi>0. (11.50)

The vectorYcontains values ofYi for eachi. This is then described in terms of a matrix of independent variables,X, and the vector of coefficients,βββ. It is possible to extend this to allow for more than two categories. In this case:

Zi=

⎧⎪

⎪⎪

⎨

⎪⎪

⎪⎩

0 ifYi≤α1 1 ifα1<Yi≤α2

2 ifα2<Yi≤α3

... ...

N−1 ifαN−1<Yi≤αN

N ifYi>αN,

(11.51)

where−∞<α1<α2<ããã<αN<∞.

However, as mentioned above some sort of link function is needed to convert the latent variable into a probability. Two common link functions are:

• probit; and

• logit.

The Probit Model

The probit model uses the cumulative distribution function for the standard normal distribution, Φ(x). If there are two potential outcomes, then the probit model is formulated as follows:

Pr(Zi=1|Xi) =Φ(Xiβββ), (11.52) whereXiis the vector of independent variables for company or individuali. Since Φreturns the cumulative normal distribution function – which is bounded by zero and one – when given any value between−∞and∞, it is a useful function for using

11.5 Using Models to Classify Data 253 unbounded independent variables to explain an observation such as a probability that falls between zero and one.

It is possible to extend the probit model to allow for more than two choices, the result being an ordered probit model.

The Logit Model

The logit model uses the same approach but rather than using the cumulative normal distribution it uses the logistic function to ensure thatZifalls between zero and one. For two potential outcomes, this has the following form:

Pr(Zi=1|Xi) = eXiβββ

1+eXiβββ. (11.53)

The logistic function is symmetrical and bell-shaped, like the normal distribution, but the tails are heavier.

As with the probit model, it is possible to extend the logit model to allow for more than two choices, the result being an ordered logit model.

11.5.2 Survival Models

Probit and logit models – in common with many types of model – tend to consider the rate of occurrence in each calendar year, or for each year of age. For example, when using a probit model to describe the drivers of mortality, the model could be applied separately for each year of age using data over a number of years. When record keeping was more limited and only an individual’s age was known, then there were few alternatives to such an approach. However, dates of birth and death are now recorded and accessible, meaning that survival models can also be applied.

Survival models were developed for use in medical statistics, and the most ob- vious uses are still in relation to human mortality. However, there is no reason why such models cannot be used to model lapses, times until bankruptcy or other time-dependent variables.

In relation to mortality, a survival model looks at tpx, the probability that an individual agedxwill survive for a further periodtbefore dying. Importantly, if an underlying continuous-time mortality function is defined, then the exact dates of entry into a sample and subsequent death can be allowed for.

The survival probability for an individual can be defined in terms of the force of mortality, μx – the instantaneous probability of death for an individual aged x, quoted here as a rate per annum – as follows:

tpx=e−&0tμx+sds. (11.54)

This leaves two items to be decided:

• the form ofμx; and

• the drivers ofμx.

There are a number of forms that μx+smight take, but a simple model isμx= eα+βx, also known as the Gompertz model (Gompertz, 1825). The next stage is to determine values forα andβ. Ideally, these would be calculated separately for each individualnofN. For example:

αn=a0+ ∑M

m=1

Im,nam, (11.55)

and:

βn=b0+∑M

m=1

Im,nbm, (11.56)

wherea0 and b0 are the ‘baseline’ levels of risk, amand bm are the additions required for risk factorm, and Im,n is an indicator function which is equal to one if the risk factor mis present for individual nand zero otherwise. For example, a1 andb1might be the additional loadings required if an individual was male,a2and b2the additional loadings for smokers, and so on.

The next stage is to combine the survival probabilities into a likelihood function, and to adjust the values of the parameters to maximise the joint likelihood of the observations.

Unless a population is monitored until all lives have died, the information on mortality rates will be incomplete to the extent that data will be right-censored.

Furthermore, unless individuals are included from the minimum age for the model, data will be left-truncated. These two features should be taken into account when a model is fitted.

Whilst this approach has advantages over GLMs in that the exact period of survival can be modelled without the need to divide information into year-long chunks, there are some shortcomings. In particular, logit and probit models can allow for complex relationships between risk factors and ages, whilst the survivor model approach required any age-related relationship to be parametric. Even parametric relationships that are more complex than linear ones can be difficult to allow for.

It is worth considering the use of a GLM to determine the approximate shapes of any relationships that the factors have with age before deciding on the form of a survival model.

11.5.3 Discriminant Analysis

Discriminant analysis is an approach that takes the quantitative characteristics of a number of groups, G, and weights them in such a way that the results differ as

11.5 Using Models to Classify Data 255 much as possible between the groups. Its most well-known application was for the Altman’s Z-score (Altman, 1968). There are a number of ways of performing discriminant analysis, but most approaches – including those discussed here – re- quire the assumption that the independent variables are normally distributed, either within each group (as in Fisher’s linear discriminant) or in aggregate (as in linear discriminant analysis).

Discriminant analysis can be carried out for any number of groups; however, the most relevant in financial risk involve considering only two. In this regard is it helpful to start with the original technique described by Fisher (1936).

Fisher’s Linear Discriminant

Fisher’s linear discriminant was originally demonstrated as a way to distinguish between different species of flower using various measurements of specimens’

sepals and petals. However, a more familiar financial example might be the use of discriminant analysis to distinguish between two groups of firms, one that becomes insolvent and one that does not. These firms form the training set used to parametrise the model. Each firm will have exposure toM risk factors relating to levels of earnings, leverage and so on. For firmnofNthese financial measures are given byX1,n,X2,n,...,XM,n. The discriminant function for that firm is:

dn=β1X1,n+β2X2,n+ããã+βMXM,n, (11.57) whereβ1,β2,...,βM are coefficients that are the same for all firms. In particular, the values of the coefficients are chosen such that the difference between the values ofdnis as great as possible between the groups of solvent and insolvent firms, but as small as possible within each group. Histograms of poorly discriminated and well-discriminated data are shown in Figures 11.6 and 11.7 respectively.

Using this approach, the distance between two groups is (d¯1−d¯2)2 where ¯d1

and ¯d2 are the average values of dn for each of the two groups. The term ¯dg is often referred to as the ‘centroid’ of groupg. A vector ¯X1 can be defined as the average values ofX1,n,X2,n,...,XM,nfor the first group and ¯X2can be defined as the corresponding vector for the second group. If the vector of coefficients,β1,...,βM, is also defined asβββ, then it is clear that ¯d1=βββX¯1, ¯d2=βββX¯2, and so(d¯1−d¯2)2= (βββX¯1−βββX¯2)2.

The variability within each of the groups requires the calculation of a covariance matrix between X1,n,X2,n,...,XM,n for the first group, ΣΣΣ1, and for the second group, ΣΣΣ2. The total variability within the groups can then be calculated as βββΣΣΣ1βββ+βββΣΣΣ2βββ. This means that to both maximise the variability between groups

Frequency Discriminant Function

Figure 11.6 Histogram of Poorly Discriminated Data

Frequency Discriminant Function

Figure 11.7 Histogram of Well Discriminated Data

whilst minimising it within groups, the following function needs to be maximised:

DF= (βββX¯1−βββX¯2)2

βββΣΣΣ1βββ+βββΣΣΣ2βββ. (11.58) The numerator here gives a measure of the difference between the two centroids, which must be as large as possible, whilst the denominator gives a measure of the difference between the discriminant functions within each group, which should be as small as possible.

Since the data only make up a sample of the total population,ΣΣΣ1 and ΣΣΣ2 must be estimated from the data asS1 for the first group and S2 for the second. If the estimator ofβββthat provides the best separation under Fisher’s approach isbF, then

11.5 Using Models to Classify Data 257 bFcan be estimated as:

bF= (S1+S2)−1(X¯1−X¯2). (11.59) This vector can also be used to determine the threshold score between the two groups,dc:

dc=bF(X¯1+X¯2)/2. (11.60) This means that if the data described above are regarded as training data, then when another firm is examined the information on its financial ratios can be used to decide whether it is likely to become insolvent or not, based on whether the calculated value ofdnfor a new firmnis above or belowdc.

However, sometimes even the best discrimination cannot perfectly distinguish between different groups. In this case, there exists a ‘zone of ignorance’ or ‘zone of uncertainty’ within which the group to which a firm belongs is not clear. This is shown in Figure 11.8. The zone of ignorance can be determined by inspection of the training set of data. For example, assume ¯d1 lies below dc, whilst ¯d2 lies above it. If there are firms from group 1 whose discriminant values lie above dc and firms from group 2 whose values are below it, then the zone of ignorance could be classed as the range between the lowest discriminant value for a group 2 firm up to the highest value for a group 1 firm. Furthermore, if there were sufficient observations a more accurate confidence interval could be constructed.

The zone of ignorance can also be defined in terms of confidence intervals if it is defined in terms of the statistical distribution assumed, in particular if normality is assumed. Let ¯d1 again lie belowdc, whilst ¯d2 lies above it. Then calculate the standard deviations of the values of dn for each of the two groups, sd¯1 and sd¯2

respectively. For a confidence interval ofα, the zone of ignorance could be defined as follows:

d¯2−sd¯2Φ−1(1−α)to

d¯1+sd¯1Φ−1(1−α) if ¯d1+sd¯1Φ−1(1−α)>d¯2−sd¯2Φ−1(1−α) 0 if ¯d1+sd¯1Φ−1(1−α)≤d¯2−sd¯2Φ−1(1−α).

(11.61)

Example 11.6 A group of policyholders has been classified into ‘low net worth’ (LNW) and ‘high net worth’ using Fisher’s Linear Discriminant. If the LNW group discriminant functions have a mean of 5.2 and a standard deviation of 1.1, whilst the HNW group discriminant functions have a mean of 8.4 and a standard deviation of 0.6, where is the zone of ignorance using a one-tailed confidence interval of 1%?

α (1−α) Zone of Ignorance

Frequency Discriminant Function

Figure 11.8 Zone of Ignorance (α=0.01)

In this dataset, ¯d1and ¯d2are equal to 5.2 and 8.4 respectively, whilstsd1 and sd2 are 1.1 and 0.6. The upper-tail critical value of the normal distribution with a confidence interval of 1% is 2.326. The lower limit of the zone of ignorance is therefore:

8.4−(0.6×2.326) =7.004, whilst the upper limit is:

5.2+ (1.1×2.326) =7.759.

The zone of ignorance is therefore 7.004 to 7.759, and any individual whose discriminant function falls in this range cannot be classified within the confidence interval given above.

Linear Discriminant Analysis

One of the advantages of Fisher’s linear discriminant is that it is relatively light on the assumptions required. In particular, an assumption of normally distributed observations is needed only to measure the probability of misclassification. However, if some further assumptions are made then a simpler approach can be used. This approach is linear discriminant analysis (LDA).

The main simplifying assumption is that the independent variables for the two groups have the same covariance matrix, so ΣΣΣ1 =ΣΣΣ2 =ΣΣΣ. This means that the

11.5 Using Models to Classify Data 259 function to be maximised becomes:

DLDA=(βββX¯1−βββX¯2)2

βββΣΣΣβββ . (11.62) IfΣΣΣis estimated from the data asS, then the estimator ofβββ that provides the best separation under the LDA approach isbLDAwhich can be estimated as:

bLDA=S−1(X¯1−X¯2). (11.63) The calculation of the threshold score,dc, and the zone of ignorance is the same as for Fisher’s linear discriminant.

Multiple Discriminant Analysis

It is possible to extend this approach to more than two classes. In this case rather than considering the distance of the centroids from each other, the distance of the centroids from some central point is used. If the average value ofdn for alln is d, then the distance of the centroid for each group¯ g from this point is ¯d−d¯g. Considering the independent variables, a vector ¯Xcan be defined as the average values of X1,n,X2,n,...,XM,n for all firms, whilst a vector ¯Xg can be defined as the average values ofX1,n,X2,n,...,XM,nfor groupg. The covariance of the group averages of these observations can be defined as:

ΣΣΣG= 1 G

∑G g=1

(X¯ −X¯g)(X¯ −X¯g). (11.64) This means that a new function needs to be maximised to give maximum separation between groups whilst minimising separation within groups:

DMDA=βββΣΣΣGβββ

βββΣΣΣβββ . (11.65)

11.5.4 Thek-Nearest Neighbour Approach

One of the main purposes of discriminant analysis is to find a way of scoring new observations to determine the group to which they belong. However, another approach is to use a non-parametric approach, and to consider which observations lie ‘nearby’. This is thek-nearest neighbour (kNN) approach. It involves considering the characteristics of a number of individuals or firms that fall into one of two groups. These firms or groups form the training set used to parametrise the model.

As before, these could easily be solvent and insolvent firms. When a new firm is considered, its distance from a number (k) of neighbours is assessed using some approach, and the proportion of these neighbours that have subsequently become

+ +

+ + +

+ +

Solvent Firms Insolvent Firms Candidate Firm

MeasureX

MeasureY

Figure 11.9 k-Nearest Neighbour Approach

insolvent gives an indication of the likelihood that this firm will also fail. ThekNN approach is shown graphically in Figure 11.9.

The most appropriate measure of distance whenMcharacteristics are being considered is the Mahalanobis distance, discussed in Chapter 10 in the context of test- ing for multivariate normality. The Mahalanobis distance between a new firmY and one of the existing firmsXn measured usingmcharacteristics of those firms, wherem=1,2,...,M, is:

DXn =

(Y−Xn)S−1(Y−Xn). (11.66) In this expression,YandXnare column vectors of lengthMcontaining the values of theMcharacteristics such as leverage, earnings cover and so on. The matrixS contains estimates of the covariances between the two firms for these characteristics, calculated using historical data.

The Mahalanobis distance from firmY must be calculated for allN firmsXnto see which theknearest neighbours are. The score is then calculated based on the combination of the group to whichXnbelongs and the distance ofXnfromY. Say, for example, an insolvent firm is given a score of one and a solvent firm is given a score of zero,kis taken to be 6 and firmsX1 toX6have the smallest Mahalanobis distances. In this case, the score for firmX is:

kNNY= ∑6n=1I(Xn)/DXn

∑6n=11/DXn (11.67)

whereI(Xn)is an indicator function which is one ifXnis insolvent and zero otherwise.

In the same way that there are a number of ways of calculating the distances between firms, there are also a number of ways of determining the optimal value of

11.5 Using Models to Classify Data 261 k. One intuitively appealing approach is to calculate the score for all firms whose outcome is already known using a range of values of k. For each firm,kNNXi is calculated using Equation 11.67 but excluding the Xn for n=i. For each i, the statistic[kNNXi−I(Xi)]2 is calculated. These are summed over alli=1,2,...,N, with the total being recorded for each value ofk. The value ofkused – the number of nearest neighbours – is the one that minimises∑Ni=1[kNNXi−I(Xi)]2. However, this process can involve calculating a huge number of distances, so if this process is being used for example to assess a commercial bank’s borrowers it can quickly become unwieldy.

11.5.5 Support Vector Machines

Another approach to classifying data is to find the best way of separating two groups of data using a line (for two variables), plane (for three variables) or hyperplane (for more than three variables). The functions used to separate data in this way are known as support vector machines (SVMs).

Linear SVMs

A linear SVM uses a straight line – or its higher-dimensional alternative – to separate two groups according to two or more measures. Consider again the two groups of solvent and insolvent firms, and two variables such as leverage and earnings cover.

In Figure 11.10 the two groups can clearly be divided by a single line. However, more than one line can divide the points into two discrete groups. Which is the best dividing line? One approach is to use tangents to each dataset. If pairs of parallel tangents are considered, then the best separating line can be defined as the line midway between the most separated parallel tangents.

This criterion can be extended into higher dimensions, and expressed in mathe- matical terms. Consider a column vectorXhgiving the co-ordinates of a point on a hyperplane. For a firmn, these co-ordinates could be the values ofM financial ratios, each one corresponding to a dimension. A hyperplane can be defined as:

βββXh+β0=0, (11.68)

where βββ is the vector of M parameters and β0 is a constant. The value of the expressionβββXn+β0can be evaluated for any vector of observations,Xn, for firm n. These firms constitute the training set used to parametrise the model. IfβββXn+ β0>0, for all firms in one group andβββXn+β0<0 for the other for a vector of parametersβββ, then Equation 11.68 can be said to be a separating hyperplane. To simplify this, a functionJ(Xn)can be defined such thatJ(Xn) =1 if firmnbelongs to the first group, whilstJ(Xn) =−1 if it belongs to the second group. This means

Different Definitions of Operational Risk

The Generalised Extreme Value Distribution