DISTRIBUTIONS AND UNIVARIATE ANALYSIS - Dữ liệu lớ- 123docz.net

with Munir Ahmed

Learning Objectives

1. Define a sample

2. Calculate average and standard deviations 3. Transform variables

4. Display distribution of a variable 5. Create a histogram

6. Plot change in variables over time

Key Concepts

• Sample

• Variable types

• Average

• Expected values

• Standard deviation

• Distributions

• Histogram

• Control chart plot

• Minimum observations

Chapter at a Glance

This chapter introduces the concept of distributions. The idea that a variable has a distribution is fundamental to statistics. A distribution allows us to have expectations about future samples of the variable. It allows us to calculate an average or a standard deviation so we can describe the distribution in a few parameters. Although the concepts introduced in this chapter may seem trivial, they are the building blocks of statistical thinking. Without these ideas,

Introduction

All statistical analyses follow an organized set of steps. A first step in analyses begins with the examination of univariate data, which we describe in this chapter. Univariate data analysis involves the evaluation of one variable at a time. Analysis of this type helps healthcare managers, clinicians, and research- ers assess patterns. Examples of how managers or clinicians can perform univariate data analysis follow:

• Counting the number of nursing home patients who fell in a given year

• Assessing the most common time of day for medication errors at a hospital

• Counting the number of children admitted with an influenza diagnosis in the month of January

Variables

A variable measures the presence of a concept, the degree to which an object is available, or the extent of a person’s characteristic. Sometimes variables are referred to as attributes, clues, or characteristics. Here are several example variables used in healthcare:

• Cost of a hospital stay – Patients’ age

– Number of patients staying in the emergency room for more than six hours

• Patient satisfaction

– Score from 1 to 10 reflecting very dissatisfied to very satisfied

• Disease status

– Presence of diabetes – Presence of hypertension

– Count of symptoms of depression

Variables have at least two levels, and some have many levels. The values of a variable change from one level to another. A variable never has only one value for all occasions when it has the same value for all occasions.

We refer to the variable as a constant. For example, if we are looking at emergency department (ED) visits, the constant variable could be one ED, but the nonconstant variable would be the time of visit.

If a variable can assume only two levels, then the variable is binary.

An example of a binary variable is the occurrence of an adverse event such as wrong-side surgery. The procedure has been done either correctly or incor- rectly. If the event has occurred it will have a value of 1; otherwise, it will be 0. Sometimes, binary variables are referred to as indicators. If a variable has a countable set of levels, the variable is referred to as discrete or categorical.

An example of a categorical variable is race. The following are examples of categorical variables commonly used in healthcare.

• Race

– American Indian

– Black or African American – White

– Latino or Hispanic – Asian

• Insurance status – Medicaid – Medicare – Private

• Marital status – Never married – Divorced – Married – Widowed

If variable levels show an order, then the variable is ordinal. A variable can be both ordinal and categorical. For example, “age in decades” is an ordinal categorical variable. In this example, we are categorizing patients as in their thirties, forties, or fifties, but we are not using their exact age. By contrast, if the variable can assume any real number in a range, then the variable is termed continuous.

An example of a continuous variable is cost. A continuous variable is also an interval variable, meaning that the values of the variable represent the magnitude of the presence of the variable. Cost is an interval variable because an operation that costs $12,000 is twice as expensive as an operation that costs $6,000. Count of anything is an interval scale; count of patients with adverse events is an interval scale, even though the variable itself is binary.

A ratio variable is an interval variable for which 0 is considered valid—

none of the variable is present. For example, the number of patients signing up for Medicaid can be considered a ratio variable, as a 0 would mean none

signed up. The number of days since an admission has begun, known as length of stay, is a continuous variable. Ordinal scales cannot be averaged because they do not show the magnitude of the variable. There are exceptions to this rule, however; sometimes ordinal scales are treated as if they are interval scales. Satisfaction with care is typically rated on an ordinal scale but treated as if the scale was interval—thus we see reports of average satisfaction.

Some variables have what are called nominal levels. Unlike continuous variables, such variables have levels that are not in any particular order, simply representing different concepts. For example, racial categories (e.g., white, black, Asian) are nominal, meaning that racial categories are not ranked and do not measure a quantity—they just name a race.

In healthcare, a set of variables is typically used to measure outcomes of care. These include cost of care, access to care, satisfaction with care, mortality, and morbidity. Cost of care is a continuous variable. Above- or below-average cost is a binary representation of cost. If above-average cost has a value of 1 and below-average cost has a value of 0, we can count the percentage of patients who have above-average cost. Access to care is sometimes measured in days between appointments; count of days is obviously an interval scale. Mortality is binary, as a patient is either dead or alive.

Probability of death is typically referred to as a patient’s prognosis or severity of illness. Any probability is also an interval scale. Morbidities are typically calculated on an ordinal scale. (For example, the Barthel index breaks extent of function into different disabilities, with each disability rated as 0, 5, 10, or 15. Fifteen indicates complete disability and zero indicates no disability.) Strictly speaking, morbidity scales are ordinal and cannot be averaged, but again the literature includes exceptions in which averages of ordinal scales are reported.

Sample

A population of interest is a group of individuals who share a key feature (e.g., diabetic patients or employees who are 40 years old). Often it is not possible to examine all members of a population, and thus it is convention to evalu- ate a sample, which is an organized subset of the population. Statistics are calculated on the sample, and if a sample is representative of the entire population, the calculated statistic can be generalized to the population. A sample is judged to be representative of the entire population if it reflects various subgroups in the population proportionally. For instance, if the proportion of persons in a population having income greater than $100,000 per year is

20 percent, one would need to ensure that the sample had about the same percentage of persons with that income level. A large sample may not be representative, and care must be taken to organize the sample bias to avoid introducing bias.

Statisticians’ sampling choices are influenced by several factors. In analysis of data from electronic health records (EHRs), typically all patients in the population are included. This is called a complete sample. If we are working with a complete sample, findings reflect the population. Use of a complete sample is ideal for drawing conclusions because everyone in the population is included, but when a large amount of data is analyzed, there is more computation, and analysis can be costly.

One can also randomly sample the data. During experiments, random sampling allows the statistician to assign patients randomly to experimental and control groups. This random assignment increases the probability that experimental and control groups do not differ in characteristics other than the medical intervention used. In analysis of data from EHRs, a random sample of patients may be taken in order to reduce computational difficulties resulting from too many data points.

An important method of sampling is called stratification. In stratified sampling, patients with certain characteristics are oversampled so rare conditions can be more present in the sample. Adaptive sampling is a method of sampling in which initial samples are used to determine the size and parameters of subsequent samples. A convenience sample is a subset of the population chosen because it was easily available. Statistics calculated from a convenience sample usually do not generalize to the entire population.

Sample elements are typically patients. It is generally assumed that patients are independent sample elements; that is, the disease and treatment decision of one patient does not depend on the conditions of another patient.

However, there are exceptions to this assumption; if sick siblings are brought to the pediatrician, they are seen back-to-back and diseases of one might inform the diagnosis of another. Another example is a patient with a conta- gious disease, as his illness will affect the probability of subsequent patients contracting the condition.

This book refers to a variable as X (typically capital letters are used for variables and small letters for the value of the variable). If there are multiple variables, they may be named X1,X2 . . .Xm. In a sample of n cases, a variable assumes n values, one for each case. We show the value of the variable X in the sample as xi, where the index “i” refers to the value of x in the ith case in the sample. Throughout the book, the index “i” is reserved to indicate the ith case for variable X.

Average

In a sample of data, variables may have many different values. One way to understand the general tendency of a variable is to average the data. An average is a single value that is representative of all values that a variable can take. In most contexts, the term average refers to a very specific measure: the arithmetic mean, or simply the mean. The arithmetic mean is defined as the sum of all observations divided by their number. Thus, in order to calculate the arithmetic mean, we add up all values of our variable in the sample and then divide that sum by the total number of observations in our sample.

When calculated from a population, the arithmetic mean is called the popula- tion mean (à); when calculated from a sample, it is called the sample mean (X ). Clearly this statistic cannot—and should not—be calculated for nominal or ordinal data. For any variable X that is measured on an interval or ratio scale, the sample arithmetic mean X can be calculated from n observations using the following formula:

= ∑

X X

n i .

The expression for arithmetic mean discussed so far gives the same weight to all observations in the data set. When all observations do not have equal weight, a weighted arithmetic mean should be calculated. For the n observations of variable X, the weighted arithmetic mean is the sum product of values xi and their corresponding weights, wi, divided by the sum of wi. So

∑

= =∑

w X w

i i

i n

i i 1n

When the sample is not representative of the target population, weights can be used to correct the bias. For example, let us assume that a hospital manager wants to understand satisfaction with service in her ED. To get to readily available data, the manager relies on patient reviews posted to the internet over the course of a month. After the initial examination of comments on satisfaction, the ED clinicians actively ask patients to leave more patient reviews. In the first month, reviewers left 20 comments; in the subsequent month, 30 comments were left. The hospital manager would like to weight the data so that month 2 would have the same number of comments as month 1.

Exhibit 4.1 displays the data. To calculate the weighted values, we have to multiply comments left in month 2 by the weight 0.67, calculated from dividing 20 by 30. This process ensures that the total in the weighted column is 20, which is the same as month 1. After this weighting, 6.67 positive comments were left in month 2, which exceeds the five positive comments left in month 1. So we have received proportionally more positive comments in month 2 than in month 1.

Expected Values

The expected value of a variable is closely related to the idea of averages. If the values of a variable are mutually exclusive and exhaustive, then the expected value of a variable is the sum of the product of the probability of observing a particular level of the variable times its value, which looks something like

∑ ( )

) (

= = =

X E X p x x

Expected value of i X i i.

The formula for expected value is similar to weighted average, where weights are replaced with the probability of the event and the sum of the probabilities is 1. A common expected value used in analysis of healthcare data is the case mix index. The case mix index is typically calculated over diagnosis related groups (DRGs). Given that the DRGs are mutually exclusive and exhaustive, the expected value can be used to measure the case mix of a hospital. We can calculate the case mix index by summing the product of a hospital’s probability of seeing a particular DRG and national estimates of length of stay of the DRG. The equation looks like

∑ ( )

Case mix index i Probability of DRG × Length of stay for DRG .i i Note that while the probabilities reflect the experience of the hospital, the length of stay reflects the national experience. Because the length of stay is

Month 1 Month 2 Weight

Weighted Month 2

Positive 5 10 0.67 6.67

Negative 15 20 0.67 13.33

Total 20 30 20.00

EXHIBIT 4.1 Comments Posted to the Internet

which the hospital is seeing patients who require longer lengths of stay. Thus, it provides an estimate of the difficulty and severity of the patients seen in the hospital.

We can see the concept of expected values through data from a hypothetical insurer. This insurer wants to anticipate the expected cost across different groups of potential members. The expected cost is calculated as the sum of the product of the probability of the member joining the insurer and the cost he may incur. The insurer has eight types of patients who are likely to use the insurance, and each type has a different probability of joining and a different cost (see exhibit 4.2). The average cost for all eight categories is $6,125, but this cost is not weighted for the likelihood that the patients will join the insurance. Once weighted, the average cost increases to $7,669 because the more expensive persons seem to be more likely to enroll in the insurance.

In our hypothetical example, the combinations (rows in exhibit 4.2) are assumed to be mutually exclusive and exhaustive. Thus the probability of the events adds up to 1. The average cost can be readily corrected by weighting each type of member. For example, suburban males aged 60 and over have a 20 percent chance of enrolling in the insurance and a $6,000 cost. The total contribution of this type of member to the weighted average cost is 0.20 × $6,000 = $1,200. In contrast, a suburban female between the ages of 40 and 60 has a 5 percent chance of enrolling in the insurance pack- age and thus will cost $1,000 per year. As a result, this group’s contribution

Description

Probability of

Joining Cost

Weighted Cost

Urban, 60+ years, male 0.2 $10,000 $2,000

Urban, 60+ years, female 0.2 $10,000 $2,000

Urban, 40–60 years, male 0.1 $8,000 $800

Urban, 40–60 years, female 0.1 $7,000 $700

Suburban, 60+ years, male 0.2 $6,000 $1,200

Suburban, 60+ years, female 0.05 $5,000 $250

Suburban, 40–60 years, male 0.05 $2,000 $100

Suburban, 40–60 years, female 0.1 $1,000 $100

Average $6,125 $7,150

EXHIBIT 4.2 Cost and Likelihood of Joining an Insurance Plan

to the weighted average is 0.05 × $1,000 = $50. In this manner, the cost is weighted by the likelihood that the person will enroll.

Standard Deviation

This section describes the concept and measurement of standard deviation.

A variable, by definition, has values that vary across subjects in a sample.

Standard deviation is an estimate of the variation in the values of a variable around that variable’s measure of center (or average). Two variables can have the same average, but the observations surrounding that one value may be spread more widely than for the other. In exhibit 4.3, all three plots have the same average but different dispersion.

Dispersion matters a great deal, as we are not telling the complete story if we report only averages. Consider the average hospitalization costs calculated for two different populations: a group of patients who have been hospitalized and members of an insurance plan who have not been hospitalized. Compared to members of the health plan, patients in the hospital have a tight dispersion around the average hospitalization cost. Because a portion of the members are never hospitalized, they will have no hospitalization costs.

In contrast, all patients in the hospital will have some hospital costs, so there will be no zeros. The dispersion around the average hospitalization cost will be larger for members than for patients.

The mean describes the center of data, but the variability in the data is also important. Exhibit 4.3 shows three sets of data. All have the same mean.

0.06 0.05 0.04 0.03 0.02 0.01 0

EXHIBIT 4.3 Dispersion Around Same Average

The solid line shows a spread of data around the mean. The dotted line shows a large spread around the same mean. The dashed line has the same average but a larger spread.

To measure the spread of data we start with deviation. Deviation is the difference between the ith observation and the mean: (Xi – μ). In this formula, μ is the mean of the population, which is often unknown. The mean of the sample, as described earlier, is shown as X and can be calculated by summing the observations in the sample and dividing by the count of the observations. Think of deviation as how far off course we are if we deviate from our main road (the average) and end up at the observation point. We can deviate to the right of the mean or to the left, so the difference between the mean and the observation can be positive or negative.

If we want a single measure of the spread around the mean, we need to calculate a value across all deviations. One way is to sum the deviations, but doing so would be problematic. Some positive deviations will cancel out other negative deviations and create a false image that there is not much deviation across the sample points. To avoid this, we first square the deviations and then sum the square of deviations. The measure of spread calculated in this way is called variance. It is defined as the average of the square of deviations for every point in the population. It is calculated as

∑

σ2 =n1 in=1(Xi−μ)2.

The standard deviation is the square root of the variance and can be calculated from the observations in the sample. The standard deviation is useful when considering how close the data are to the mean of the sample:

∑ ( )

= −

s X X

n 1 .

i n 1 i

To estimate the standard deviation, first calculate the deviations for each point. In exhibit 4.4, we have three observations: 5, 2, and 5. The mean is 4. The deviations from the mean are 1, −2, and 1. Note that the total sum of deviations is always 0. Once the deviations have been calculated, we square the deviations. The sum of the squared deviations is 6. Last, we divide the sum by 1 minus the number of observations and take the square root of the result. The sum is 6, and there are 3 observations, so the division yields 2 and the square root of 2 is 1.7.

Standard deviation is a measure of spread of data around the mean.

Usually 70 percent of the data fall within one standard deviation of the