Health research method in Medicine

identify and define the population to be studied. ¨ identify and describe common methods of sampling ¨ discuss problems of bias that should be avoided when selecting a sample ¨ list the factors to consider when deciding on sample size ¨ decide on the sampling methods and sample size most appropriate for the research design they are developing

Study designs

Learning objectives

At the end of this session, participants should be able to: ¨ recognize and list the various types of descriptive studies ¨ understand the advantages and disadvantages of cross sectional studies ¨ know and understand the principles of planning and implementing a case- control study ¨ know and understand the potential biases associated with a case-control study ¨ know and understand the advantages and disadvantages of case-control study ¨ describe a cohort study design and indicate its strengths and weaknesses ¨ give a research question, and design an appropriate cohort study to investigate the problem ¨ describe a RCT design and indicate its strengths and weaknesses ¨ describe potential sources of bias in RCTs

Introduction

A study may involve different study designs Study design characteristics include type of data (qualitative vs quantitative), the type of comparisons (with or without control group), the type of setting or unit of analysis chosen, etc Therefore, the selection of a research strategy is the core of a research design and is probably the single most important decision the investigator has to make This section deals on the different types of epidemiological research designs.

Selection of study design

Depending on the existing state of knowledge about a problem that is being studied, different types of questions may be asked which require different study designs Some examples are given in the fallowing table:

Table 1: Research questions and study designs

State of knowledge of the problem

Types of research questions Types of study design

Knowing that a problem exists but knowing little about its characteristics or possible causes

What is the nature/magnitude of the problem ã Who is affected? When and where? ã How do the affected people behave? ã What do they know, believe, and think about the problem?

Descriptive Case studies Cross-sectional surveys Qualitative methods

Suspecting that certain factors contribute to the problem

Are certain factors associated with the problem? (e.g Is lack of school sex education related to high incidence of STD?)

Cross-sectional comparative Case-control studies Cohort studies Having established that certain factors are associated with the problem, desiring to establish the extent to which a particular factor causes or contributes to the problem

What is the cause of the problem?

Will the removal of a particular factor prevent or reduce the problem (e.g stopping Khat, stopping smoking, providing safe water)

Experimental or quasi- experimental study designs

Having sufficient knowledge about cause to develop and assess an intervention which would prevent, control or solve the problem

What is the effect of a particular intervention/strategy? (e.g new drug, special educational programme)

Experimental or quasi- experimental study designs

The type of study design chosen depends on (see examples in Table 1): ¨ The type of problem ¨ The knowledge already available about the problem and ¨ Resources available for the study

Observational versus Experimental (Intervention) studies

Observational study design is the more common approach in public health for testing hypotheses The investigator can only observe the occurrence of disease in people who are already segregated into groups on the basis of some exposure In this kind of study, allocation into groups on the basis of exposure to a factor is not under the control of the investigator

The experimental (intervention) study is an epidemiologic design that can provide data of high quality The distinguishing characteristic of experimental study design is that the investigators themselves allocate the exposure

Although an experiment is an important step in establishing causality, it is often neither feasible nor ethical to subject human beings to risk factors in etiological studies Therefore, experimental studies are not commonly done.

Observational studies

Observational studies are classified into two as descriptive and analytical studies The following sections provide detailed descriptions

When an epidemiological study is not structured formally as an analytical or experimental study, i.e when it is not aimed specifically to test a hypothesis, it is called a descriptive study

Descriptive studies characterize the occurrence and distribution of problems by time, place and person The wealth of material obtained in most descriptive studies allows the generation of hypotheses, which can then be tested by analytical or experimental designs

A descriptive study assesses morbidity or mortality in a population and the occurrence and distribution in population groups according to (1) characteristics of persons, (2) characteristics of place, and (3) characteristics of time

The numbers of events (mortality or morbidity) are enumerated and the population at risk identified Rates, ratios and proportions are calculated as measures of the probability of events One must be careful to use the right measurements and the right ‘denominators’ when assessing these measures of probability

The case report is the type of descriptive study that gives a detailed report of single patient

Classical example: In 1941 Gregs (An Australian Ophthalmologist) reported a new syndrome of congenital cataract linked to rubella in the mother during pregnancy

Clinical observation such as this can give the first clues in the identification of a new disease and the effect of an exposure

A case series is a descriptive study that reports a series of cases of a specific condition, or a series of treated cases These represent the numerator of disease occurrence, and should not be used to estimate risks

Example: In the 1940s, Alton Ochenser, USA, observed that virtually all of the patients on whom he was operating for lung cancer gave a history of cigarette

6 smoking Based on his case series observation he hypothesized that cigarette smoking was linked with lung cancer

In classical infectious disease epidemiology, a case series is often used as an early means of identifying the presence of epidemic

Ecological descriptive studies: when the unit of observation is an aggregate

(e.g family, clan or school) or an ecological unit (a village, town or country) the study becomes an ecological descriptive study

As mentioned earlier, hypothesis testing is not generally an objective of the descriptive study However, in some cross-sectional surveys, and ecological studies some hypothesis testing may be appropriate

Descriptive cross-sectional studies or community (population) surveys: cross-sectional studies entail the collection of data on, as the term implies, a cross-section of the population, which may comprise the whole population or a proportion (sample) of it Many cross-sectional studies do not aim at testing a hypothesis about an association, and are thus descriptive They provide a prevalence rate at a particular point in time (point prevalence) or over a period of time (period prevalence) The study population at risk is the denominator for these prevalence rates Included in this type of descriptive study are surveys in which the distribution of a disease, disability, and nutritional status is assessed This design may also be used in health systems research to describe

‘prevalence’ by certain characteristics – pattern of health service utilization and compliance – or in opinion surveys A common procedure used in family planning and in other services is the KAP survey (survey of knowledge, attitudes and practice)

Trend studies: data may be collected at different points in time, and changes in the pattern are analyzed Though different study subjects are studied at each time, each sample can represent the same type of population

It should be noted that trend studies often involve a rather long period of data collection In most cases, the same researcher does not personally collect the data used in a trend study, but instead conducts a secondary analysis of data collected over time by several other observers or routinely collected data

Table 2: Advantages and disadvantages of cross sectional studies

Advantages Disadvantages ¨ They are relatively quick and inexpensive ¨ Often a good first step for a cohort study ¨ Provide prevalence information ¨ Researcher has control over the selection of study subjects ¨ Researcher has control over the measurements used ¨ Can study several factors or outcomes at the one time ¨ Often provides early clues for hypothesis generation ¨ Does not allow the true temporal sequence of exposure and outcome to be ascertained, therefore unable to shed light on cause and effect associations ¨ Potential bias in measuring exposure ¨ Potential sampling and/or survivor bias ¨ Not feasible for rare conditions ¨ Does not yield incidence or true relative risk

An example of a cross-sectional study

An indigenous malaria transmission in the outskirts of Addis Ababa, Akaki Town and its environs

Adugna Woyessa 1 , Teshome Gebre-Micheal 2 , Ahmed Ali 3

Abstract Background: In recent years malaria is becoming endemic in highland areas beyond its previously known upper limit of transmission Assessment of the situation of the disease in such areas is necessary in order to institute appropriate control activities

Objectives: The objectives of the study were to determine the prevalence of malaria, the parasite species involved and Anopheles species responsible in local malaria transmission

Methods: A systematic sampling technique was used to select survey households

Blood films were collected monthly between October and December 1999 from all household members by a trained and experienced laboratory technician Larval and adult mosquitoes were monthly collected using different methods from September 1999 to October 2000

Results: Among 2136 examined blood films, 78(3.7%) of them were malaria positive of which 54(69%) were due to Plasmodium vivax and 24 (31%) due to P falciparum Anopheles gambiae s l (presumably An arabiensis) and An christyi were the dominant man-biting species, with the former being the major vector in the area Both these species were found to be more of exophagic and active in the early evening, unlike An pharoensis, which showed an endophagic tendency

Conclusion: This study indicated that indigenous transmission of malaria occurs in the study area Transmission is reckoned to be maintained by low density of vector species for short period of time under favorable conditions Therefore, the acquisition of communal immunity is interrupted by long duration of non-malaria season leading to the occurrence of recurrent malaria epidemics [Ethiop.J.Health Dev 2004;18(1):2-7]

Observational studies, where the primary goal of a study is establishing a relationship (association) between a ‘risk factor’ (etiological agent) and an outcome (disease), it is termed analytical Analytical studies always require having a comparison group

The basic approach in analytical studies is to develop a specific, testable hypothesis, and to design the study to control any extraneous variables that could potentially confound the observed relationship between the studied factor and the disease The approach varies according to the specific strategy used as described below for case-control and cohort studies

Case-control study design is design where by people diagnosed as having a disease (cases) are compared with persons who do not have the disease (controls) to determine if the two groups differ in the proportion of persons exposed to a specific factor or factors

Experimental Studies

The experimental study, or clinical trial, is an epidemiologic design that can provide data of high quality As in a cohort study, individuals are enrolled on the basis of their exposure status: however, the distinguishing characteristic of an experimental study design is that the investigators themselves allocate the exposure

The experimental study is the best epidemiological study design to prove causation It can be viewed as the final or definitive step in the research process The experimenter (investigator) has control of the subjects, the intervention, outcome measurements, and sets the conditions under which the experiment is conducted In particular, the investigator determines who will be exposed to the intervention and who will not This selection is done in such a way that the comparison of outcome measure between the exposed and unexposed groups is as free of bias as possible

In health research, we are often interested in comparative experiment, where one or more groups with specific interventions is compared with a group unexposed to interventions (clinical trials) or exposed to the best treatment currently available The effect of the new interventions on one or more outcome variables is compared between the groups by the use of statistical procedures Two types of comparative experiments, the randomized clinical trial (RCT) and the community intervention trial (CIT) are discussed in this section

FIGURE 3 FLOW CHART OF AN EXPERIMENT

Experimental (study) population Inclusion/exclusion criteria

1.6.1 The randomized clinical trial (RCT)

The most commonly encountered experiment in health science research, and the research strategy by which evidence of effectiveness is measured, is the randomized, controlled, double blind clinical trial, commonly known as the RCT Clinical trials may be done for various purposes Some of the common types of clinical trial (according to purpose) are: a prophylactic trials, e.g immunization, contraception; b therapeutic trials, e.g drug treatment, surgical procedure; c safety trials, e.g side effects of oral contraceptives and injectables; d risk-factor trials, e.g proving the etiology of a disease by inducing it with the putative agent in animals, or withdrawing the agent (e.g smoking) through cessation

Therapeutic trials may be conducted to test efficacy (e.g does a therapeutic agent work in an ideal, controlled situation?) or to test effectiveness (e.g after having established efficacy, if the therapy is introduced to the population at large, will it be effective when having to deal with other co-interventions, confounding, contamination, etc.?)

The intervention in a clinical trial may include: ¨ drugs for prevention, treatment or palliation; ¨ clinical devices, such as intrauterine devices; ¨ surgical procedures, rehabilitation procedures; ¨ medical counseling; ¨ diet, exercise, change of other lifestyle habits; ¨ hospital services, e.g integrated versus non-integrated, acute vs chronic care; ¨ risk factors; ¨ communication approaches, e.g face-to-face communication vs pamphlets; ¨ different categories of health personnel, e.g doctors versus nurses; ¨ treatment regimens, e.g once-a-day dispensation versus three times a day

The major difference between Randomized Clinical Trials and Community Intervention Trials is that the randomization is done on communities rather than individuals The classic example of a community intervention trial would be that of testing a vaccine Some communities will be randomly assigned to receive the

18 vaccine, while other communities will either not be vaccinated, or will be vaccinated with a placebo Another example would be a test of whether the introduction of iron-fortified salt in the community would reduce the incidence of anemia in the community Communities selected for entry to the study have to be similar as much as is possible, especially since only a small number of communities will be entered

Very often, blinding is not possible in these types of studies, and contamination and co-interventions become serious problems Contamination occurs when individuals from one of the experimental groups receive the intervention from the other experimental group For example, in the study of iron-fortified salt, some of the members of the community receiving non-fortified salt might hear about the fortified salt, and may acquire it from the other community (The reverse is also possible) This is particularly so if the communities are geographically close

Table 5: Advantages and Disadvantages of the experimental approach

- The ability to manipulate or assign the exposure

- The ability to randomize subjects to experimental and control groups

- The ability to control confounding and eliminate sources of spurious association

- The ability to ensure temporality

- The ability to replicate findings

- Lack of reality In most human situations, it is impossible to randomize all risk factors except those under examination

- Ethical problems In human experimentation, people are either deliberately exposed to risk factors (in etiological studies) or treatment is deliberately withheld from cases (intervention trials)

- Difficulties in manipulating the independent variable

- Non-representativeness of samples Many experiments are carried out on captive populations or volunteers, who are not necessarily representative of the population at large

- Experiments in hospitals (where the experimental approach is most feasible and is frequently used) suffer from several sources of selection bias

Example of a Randomized Clinical Trial

Clinical efficacy of three common treatments in acute otitis externa in primary care: randomized controlled trial

Frank A M van Balen,W Martijn Smit, Nicolaas P A Zuithoff, Theo J M Verheij

Abstract Objective: To compare the clinical efficacy of ear drops containing acetic acid, corticosteroid and acetic acid, and steroid and antibiotic in acute otitis externa in primary care

Participants: 213 adults with acute otitis externa

Main outcome measures: Primary outcome: duration of symptoms (days) according to patient diaries Secondary outcome: cure rate according to general practitioner completed questionnaires and recurrence of symptoms between days 21 and 42

Results: Symptoms lasted for a median of 8.0 days (95% confidence interval 7.0 to 9.0) in the acetic acid group, 7.0 days (5.8 to 8.3) in the steroid and acetic acid group, and 6.0 days (5.1 to 6.9) in the steroid and antibiotic group The overall cure rates at seven, 14, and 21 days were 38%, 68%, and 75%, respectively

Compared with the acetic acid group, significantly more patients were cured in the steroid and acetic acid group and steroid and antibiotic group at day 14 (odds ratio 2.4, 1.1 to 5.3, and 3.5, 1.6 to 7.7, respectively) and day 21 (5.3, 2.0 to 13.7, and 3.9, 1.7 to 9.1, respectively)

Recurrence of symptoms between days 21 and 42 occurred in 29% (50/172) of patients and was seen significantly less in the steroid and acetic acid group (0.3, 0.1 to 0.7) and steroid and antibiotic group (0.4, 0.2 to 1.0) than in the acetic acid group

Conclusions: Ear drops containing corticosteroids are more effective than acetic acid ear drops in the treatment of acute otitis externa in primary care Steroid and acetic acid or steroid and antibiotic ear drops are equally effective

Do this exercise in groups, then one group will present the answer, then there will be discussion

You are asked to design an RCT to evaluate the effect of new anti AIDS drug Discuss the issues involved in selecting the study sample and implementing the study

Summary points on study designs

There are two types of epidemiological research designs ã Observational

The randomized clinical trial (RCTs) Community intervention trials (CITs)

The distinctive feature of the descriptive study design is that its primary concern is with description rather than with the testing of hypotheses or proving causality

Descriptive studies include: ã Case reports ã Case series ã Cross sectional studies or community surveys ã Ecological descriptive studies

Observational studies, where establishing a relationship (association) between a ‘risk factor’ (etiological agent) and an outcome (disease) is the primary goal, are termed analytical In this type of study, hypothesis testing is the primary tool of inference

Types of Analytical studies ã Case-control study

Sampling Methods and Sample Size

Learning Objectives

At the end of this session participants should be able to: ¨ identify and define the population to be studied ¨ identify and describe common methods of sampling ¨ discuss problems of bias that should be avoided when selecting a sample ¨ list the factors to consider when deciding on sample size ¨ decide on the sampling methods and sample size most appropriate for the research design they are developing

What is sampling

Most research studies involve the observation of a sample from some predefined population of interest The conclusions drawn from the study are often based on generalizing the results observed in the sample to the entire population from which the sample was drawn Therefore, the accuracy of the conclusions will depend on how well the samples have been collected, and especially on how representative the sample is of the population In this chapter, we will discuss the major issues that a researcher has to face in selecting an appropriate sample

Sampling is a process of choosing a section of the population for observation and study.

Why sampling?

There are several reasons why samples are chosen for a study, rather than studying the entire population First and foremost, a researcher wants to minimize the costs (financial and otherwise) of collecting the data, processing and reporting on the results If a reasonable picture of a population can be obtained by observing only a section of it, the researcher economizes by choosing such a section of the population Obviously, when a sample is observed, the total information will be less than if one were to observe the entire population

A major advantage of sampling over complete enumeration is the fact that the available resources can be better spent in refining the measuring instruments and methods so that the information collected is accurate (valid and reliable) Some information, such as monitoring of the body burden of toxic metals in the

22 population, which may require specialized equipment and staff, cannot be collected from the entire population A sample in such cases would provide a reasonable picture of the population status

When we draw a sample from a population we will be confronted with the following questions:

- What is the group of people (study population) from which we want to draw a sample?

- How many people do we need in our sample?

- How will these people be selected?

The study population has to be clearly defined for example, according to age, sex, and residents Apart from persons, a study population may consist of villages, institutions, records, etc

Example of Study Population and Study Units

Problem Study population Study unit

Malnutrition related to weaning in district X

All children 6-24 months of age in District X

One child between 6-24 months in District X High dropout rates in primary schools in District

All primary schools in District Y

One primary school in District Y

Inappropriate record- keeping for leprosy patients registered in Hospital Z

All records on leprosy patients in hospital Z

One record on a leprosy patient registered in hospital Z

The primary concern in selecting an appropriate sample is that the sample should be representative of the population Every variable of interest should ideally have the same distribution in the sample as in the population from which the sample is chosen This requires knowledge of the variables and their distribution in the population, which of course is why we are doing the study in the first place! Therefore, it is not often possible to ensure the representativeness of the population However, statisticians have come up with ways in which we can give a reasonable guarantee of representativeness We will discuss some of these methods briefly in this section

A REPRESENTATIVE SAMPLE has all the important characteristics of the population from which it is drawn

There are two types of sampling methods: non-probability (convenience, quota sampling) and probability sampling methods The non-probability sampling methods are inappropriate if the aim is to measure variables and generalize findings obtained from a sample to the total study population For this purpose probability sampling methods should be used

Many clinic-based studies use convenience samples

CONVENIENCE SAMPLING is a method in which for convenience sake the study units that happen to be available at the time of data collection are selected in the sample

A researcher wants to study the attitudes of villagers towards family planning services provided by a MCH clinic He decides to interview all adult patients who visit the out patient clinic during one particular day This is more convenient than taking a random sample of people in the village, and it gives a useful first impression

A drawback of convenience sampling is that the sample may be quite un- representative of the population you want to study Some units may be over- selected, others under selected or missed altogether It is impossible to adjust for such a distortion - if you need to be representative you have to use another sampling method

Quota Sampling is a method that ensures that a certain number of sample units from different categories with specific characteristics appear in the sample so that all these characteristics are represented

In this method the investigator interviews as many people in each category of study unit as he can find until he has filled his quota

If a sampling frame does exist or can be compiled, probability sampling methods can be used With these methods, each study unit has an equal or at least a

How sampling

known probability of being selected in the sample The following probability sampling methods will be discussed: ¨ Simple random sampling ¨ Systematic sampling ¨ Stratified sampling ¨ Cluster sampling ¨ Multi-stage sampling

PROBABLITY SAMPLING involves random selection procedures to ensure that each unit of the sample is chosen on the basis of chance All units of the study population should have an equal, or at least a known chance of being included in the sample

This is the most common and the simplest of the sampling methods In this method, the subjects are chosen from the population with equal probability of selection One may use a random number table (see ANNEX 1 and 2), or use techniques such as putting the names of people into a hat and selecting the appropriate number of names blindly Recently, computer programs have been developed to draw simple random samples from a given population; this will be dealt in module 3 The simple random sample has the advantages that it is easy to administer, is representative of the population in the long run, and the analysis of data using such a sampling scheme is straightforward

In SYSTEMATIC sampling individuals are chosen at regular intervals (for example every fifth) from the sampling frame Ideally we randomly select a number to tell us where to start selecting individuals from the list

A systematic sample is to be selected from 1200 students of a school The sample size selected is 100 The sampling fraction is:

The sampling interval is, therefore, 12 The number of the first student to be included in the sample is chosen randomly, for example by blindly picking one out of twelve pieces of paper, numbered 1- 12 If number 6 is picked, then every twelfth student will be included in the sample, starting with student number 6, until 100 students are selected: then numbers selected would be 6, 18, 30, 42, etc

Systematic sampling is usually less time consuming and easier to perform than simple random sampling However, there is a risk of bias, as the sampling interval may coincide with a systematic variation in the sampling frame For instance, if we want to select a random sample of days on which to count clinic attendance, systematic sampling with a sampling interval of 7 days would be inappropriate, as all study days would fall on the same day of the week, which might, for example, be market day

When the size of the sample is small and we have some information about the distribution of a particular variable (e.g gender: 50% male, 50% female), it may be advantageous to select simple random samples from within each of the subgroups defined by that variable By choosing half the sample from males and half from females, we assure that the sample is representative of the population with respect to gender When confounding is an important issue (such as in case-control studies), stratified sampling will reduce potential confounding by selecting homogeneous subgroups

If it is important that the sample includes representative groups of study units with specific characteristics (for example, residents from urban and rural areas, or different age groups), then the sampling frame must be divided into groups or strata, according to these characteristics Random or systematic samples of predetermined size will then have to be obtained from each group (stratum) This is called Stratified Sampling

Stratified sampling is only possible when we know what proportion of the study population belongs to each group we are interested in

An advantage of stratified sampling is that we can take a relatively large sample from small group in our study population This allows us to get sample that is big enough to enable us to draw valid conclusions about a relatively small group without having to collect an unnecessarily large (and hence expensive) sample of the other, larger groups However, in doing so, we are using unequal sampling fractions, and it is important to correct for this when generalizing our findings to the whole study population

A survey is conducted on household water supply in a district comprising 20,000 households, of which 20% are urban and 80% rural It is suspected that in urban areas the access to safe water sources is much more satisfactory A decision is made to include 100 urban households (out of 4000, which gives a 1 in 40 sample) and 200 rural households (out of 6,000, which gives a 1 in 80 sample) Because we know the sampling fraction for both strata, the access to safe water for all the district households can be calculated

The selection of groups of study units (clusters) instead of the selection of study units individually is called CLUSTER SAMPLING

In many administrative surveys, studies are done on large populations which may be geographically quite dispersed To obtain the required number of subjects for the study by a simple random sample method will require large costs and will be inconvenient In such cases, clusters may be identified (e.g households) and random samples of clusters will be included in the study; then every member of the cluster will also be part of the study This introduces two types of variations in the data – between clusters and within clusters – and this will have to be taken into account when analyzing data

In a study of knowledge, attitudes, and practices related to family planning in rural communities of a region, a list is made of all the villages Using this list, a random sample of villages is chosen and all the adults in the selected villages are interviewed

Many studies, especially large nationwide surveys, will incorporate different sampling methods for different groups, and may be done in several stages In experiments, or common epidemiological studies such as case-control or cohort studies, this is not a common practice

In a study of utilization of pit latrines in a district, 150 homesteads are to be visited for interviews with family members as well as for observations on types and cleanliness of latrines The district is composed of six wards and each ward has between six and nine villages The following four stage sampling procedure could be performed:

1.Select three wards out of the six by simple random sampling

2.For each ward, select five villages by simple random sampling (15 villages in total)

3.For each village select ten households Because simply choosing households in the center of the village would produce a biased sample, the following systematic sampling procedure is proposed: ¨ Go to the center of the village ¨ Choose a direction in random way: spin a bottle on the ground and choose the direction the bottleneck indicates ¨ Walk in the chosen direction and select every third or every fifth household (depending on the size of the village) until you have the ten you need If you reach the boundary of the village and you still do not have ten households return to the center of the village, walk in the opposite direction and continue to select your sample in the same way until you get ten If there is nobody in a chosen household, take the next nearest one

4.Decide beforehand whom to interview (for example the head of the household, if present, or the oldest adult who lives there and who is available.)

Table 6: The main advantages and disadvantages of cluster- and multi- stage sampling are that:

Advantages Disadvantages ¨ a sampling frame of individual units is not required for the whole population

Initially a sampling frame of clusters is sufficient Only within the clusters that are finally selected do we need to list and sample the individual units ¨ The sample is easier to select than a simple random sample of similar size, because the individual units in the sample are physically together in groups instead of scattered all over the study population ¨ compared to simple random sampling, there is a larger probability that the final sample will not be representative of the total study population The likelihood of the sample not being representative depends mainly on the number of clusters selected in the first stage The larger the number of clusters, the greater the likelihood that the sample will be representative

The main determinant of the sample size is how accurate the results need to be

Learning objectives

At the end of this section participants should be able to: ¨ identify the sources of health data in the community where they work, ¨ describe various data collection techniques and state their uses and limitations ¨ identify the limitations and strength of routine data sources ¨ state the benefits of using a combination of different data collection techniques ¨ state various sources of bias in data collection and ways of preventing bias ¨ promote the collection of accurate data by members of their health team

Data collection techniques allow us to systematically collect information about our subjects of study and about the settings in which they occur

In the collection of data, we have to be systematic If data are collected haphazardly, it will be difficult to answer research questions in any conclusive way.

Data collection techniques

¨ Using available information (record review) ¨ Observing ¨ Interviewing ¨ Administering written questionnaires ¨ Focus group discussions ¨ Other data collection techniques

There is a large amount of data that has already been collected by others Locating these sources and retrieving the information is a good starting point in any data collection effort Some sources of such data are listed below: ¨ Mortality reports ¨ Morbidity reports ¨ Epidemic reports ¨ Reports of laboratory utilization (including laboratory test results) ¨ Reports of individual case investigations ¨ Reports of epidemic investigations

30 ¨ Special surveys (e.g., hospital admissions, disease registers, and serologic surveys) ¨ Demographic data

Analysis of health services data, census data, unpublished reports, publications in libraries or in offices at the various levels of health and health related services, may be a study in itself In order to retrieve the data from available sources, the researcher will have to design an instrument such as a checklist or compilation sheet In designing such instruments, it is important to inspect the layout of the source documents from which the data is to be extracted and design the data compilation sheet so that the items of data can be transferred in the order in which the items appear in the source document This will save time and reduce error

The assessment of the health status of the community is the basis for planning an evaluation of the health services Useful information needed for making decisions can often be obtained from routinely available data, even though these are not accurate or complete enough for detailed or elaborate analysis We shall consider in this section what information you can obtain on the frequency and distribution of morbidity, mortality and their causes from routine sources

Now do the exercise 3.1 below on uses and limitations of routine data (Can be done in a group or individually, general discussion at the end)

What do you think are the uses and limitations of the hospital-related sources of information shown in the table below? Think in terms of the people served and the levels of health care provided and then write down your ideas in the spaces provided in the table When you have done this, turn over the page and compare with the explanation provided on the next page

Source of data Uses Limitations

Health center and hospital returns

In-patient and outpatient records Immunization reports

Health center and hospital returns: health center and hospital returns are likely to be accurate with respect to disease diagnosis but the data may only relate to the area served by the hospital Time-based data, such as length of stay, and organizational information, such as staffing or the distance patients travel to the hospital, can also be used

In-patient and outpatient records: Analysis of hospital records can provide high quality information on the most important causes of major illness in a community But to be useful as an indicator of the health status of the population you must make allowances for the fact that patients treated in hospital are not representative of the general population in the area People from remote areas, infants and the elderly, for example, will be under-represented In some countries, many if not most, seriously ill patients never reach hospital

Out patient records: seen in hospitals, health centers, health posts and clinics often provide much ill defined data Diagnostic data are usually given in terms of the chief complaint Those coming for immunizations or other preventive services may be included with those who come because of illness The patients who are seen are again probably not representative of the general population: although coverage of the population may be greater than with a hospital because of greater geographical distribution, the people who live near a facility or who can afford the time to come will be over-represented However, these records do provide information about the usage of outpatient facilities and the most frequent complaints and may help you to understand the pattern of disease in your community

Immunization: useful to compare the number of births with the number of children immunized, this can give an indication of the coverage of any immunization programme

Childhood diseases: MCH clinics are one of the best sources of data on childhood diseases such as measles and malnutrition and, over a period of months or years, are reasonably accurate MCH records, alone, are not enough as they are only a source of data on births and on deaths in children under five years Use other sources of data to obtain a more representative picture MCH records can also be used to measure the workload of the MCH workers

Routine data ¨ Fail to include a great deal of important illness and disability In particular, much of the chronic illness due to tropical diseases such as schistosomiasis, leprosy, blindness, under nutrition and crippling due to birth trauma or polio, will not be detected from routine records ¨ Relate only to numerator data

OBSERVATION is a technique which involves systematically selecting, watching and recording behaviors and characteristics of living beings, objects or phenomena

Observation of human behaviors is a much used data collection technique It can be undertaken in two different ways: ¨ Participant observation: the observer takes part in the situation he or she observes ¨ Non-participant observation: the observer watches the situation, openly or concealed, but does not participate

Observations are usually complementary to other data collection techniques They can give additional, more accurate information on behavior or people than interviews or questionnaires: questionnaires may be incomplete because we forget to ask certain questions and informants may forget or be unwilling to mention certain things Observations can therefore check on information collected (especially on sensitive topics such as alcohol or drug use, or stigmatization of leprosy, TB, epilepsy or AIDS patients) Observation can also be a primary source of information

Observations of human behaviors can form part of any type of study, but as they are time consuming they are most often used in small-scale studies Observations can also be made on objects For example, the presence or absence of latrine and its state of cleanliness may be observed

An INTERVIEW is a data collection technique that involves oral questioning of respondents, either individually or as group

Answers to the questions posed during an interview can be recorded by writing them down Interviews can be conducted with varying degrees of flexibility The two extremes, high and low degree of flexibility, are described below: a High degree of flexibility:

A structured or loosely structured method of asking questions can be used for interviewing individuals as well as groups of key informants

A flexible method of interviewing is useful if a researcher has as yet little understanding of the problem or situation under investigation It is frequently applied in exploratory studies and also used during case studies

Example: Interviews using an interview schedule, to ensure that all issues are discussed, but allowing flexibility in timing and the order in which the questions are asked The interviewer may ask additional questions on the spot in order to gain as much useful information as possible Questions are open ended: the respondent is unrestricted in what and how he answers b) Low degree of flexibility :

Less flexible methods of interviewing are useful when the researcher is relatively knowledgeable about expected answers or when the number of respondents being interviewed is relatively large

Example: Interviews using a questionnaire with a fixed list of questions in a standard sequence, which have mainly fixed or pre-categorized answers

A SELF-ADMINISTERED QUESTIONNAIRE: is a data collection tool in which written questions are presented to be answered by the respondents in written form

Bias in Information Collection and its possible causes

BIAS in information collection is a distortion which results in the information not being representative of the true situation

Bias in information collection can occur as a result of:

For example, questionnaires with ¨ fixed or closed questions on topics about which too little is known; ¨ open ended questions without guidelines on how to ask (or to answer) them; ¨ vaguely phrased questions; or ¨ questions placed in an illogical order or weighing scales which are not standardized

These sources of bias can be prevented by carefully planning the data collection process and by pre-testing the data collection tools

Observer bias can easily occur when conducting observation or utilizing loosely structured group or individual interviews There is a risk that the data collector will only see or hear things in which he or she is interested or will miss information that is critical to the research Observation protocols and guidelines for conducting loosely structured interviews should be prepared, and training and practice should be provided to data collectors in using both these tools Moreover it is highly recommended that data collector work in pairs when using

36 flexible research techniques and discuss and interpret the data immediately after collecting it

If a large proportion of the population under study refuses to cooperate (non- response) or if the sampling procedure used in the study is not adequate, this results in selection bias This type of the bias affects the representativeness of the study and will be discussed at length in other sections

Information bias may occur while abstracting information from records or statistics Many times, medical records are incomplete or incomprehensible

This poses some problems if you want to use these records in your research

Another example of information bias is called recall (or memory) bias This form of bias is related to the inconsistencies in the memory of informants

3.3.5 Effect of the Interview(er) on the Informant

This is a possible factor in all interview situations The informant may mistrust the intention of the interview and move away from certain questions or give misleading answers Such bias can be reduced by adequately introducing the purpose of the study to informants, by taking sufficient time for the interview, and by assuring informants that the data collected will remain confidential

It is also important to be careful in the selection of interviewers In a study soliciting the reasons for the low utilization of local health service, for example, one should not ask health workers of the health center concerned to interview the population Their use as interviewers would certainly influence the results of the study

By being aware of these potential biases it is possible, to a certain extent, to prevent them If the researcher does not fully succeed, it is important to report honestly in what ways the data may be biased.

Importance of Combining Different Data Collection Techniques

Different data collection techniques can complement each other A skillful use of a combination of different techniques can maximize the quality of the data collected and reduce the chance of bias

Example: To determine the extent of the malnutrition problem in your area, you could make use of: ¨ Growth charts and the existing health center records of malnourished children in the area; ¨ Focus group discussions (FGDs) with several groups of mothers and/or in- depth interviews with a small group of mothers to find out how they feed their young children ¨ A household survey, testing the relevant findings of the exploratory study on larger scale

Exercise 3.2: (Can be done in a group or individually, general discussion at the end)

Name three ways in which data about the age of individuals could be inaccurate Answer this based on your experience Classify your answers into those inaccuracies that may be due to the people questioned or due to the observer

Possible Sources of inaccuracy about age

People ¨ In areas where older age is highly respected, people will add to their age ¨ Where there is no tradition for counting age by years, events have to be used For adults and children, therefore, the data of their birth, or their marriage, or their first child’s birth has to be related to an event

Observer ¨ An inaccurate observer may round off an age to the nearest five years or may routinely suggest ‘40’ or ‘25’ years Therefore, the ages which are registered in this way are of very little value

Exercise 3.3 (Can be done in a group or individually, general discussion at the end)

What could you do in your area to promote the collection of reliable data? Write down your own ideas before reading the suggestions in the following page

There are ways in which you can make the data collected more reliable:

Training: train all the members of your health team/data collectors to collect accurate data, to avoid bias and to record carefully and emphasize that you will check the accuracy of their work

Use of different sources: take the information from a number of different sources If you then compared the data from the different sources you might well be able to identify inconsistencies and thus inaccuracies

Pre-testing: pre-testing is a try out of the questionnaire Pre-testing is carried out on a small number of respondents who are comparable to the sample of correspondents but are not part of it

Supervision: regular supervision during the data collection process

Few data are wholly accurate The degree of inaccuracy that can be tolerated cannot be expressed in figures Therefore, ¨ Be clear about the data you really want ¨ Decide how best this can be collected most accurately ¨ Explain the reasons and methods carefully to the members of your team ¨ Check from time to time on the data which are being collected ¨ Let your colleagues and field workers know how the data they have collected has helped your work

When everyone is ready, one group will present the answers, there will be a discussion

Two health-related problems for which studies must be developed are described below For each problem you are asked to state: ¨ what type(s) of study you would propose ¨ from whom (or from what) you would collect the data required for each study; and ¨ what data collection techniques you would use

1 In your region you have noticed an increase in the defaulter rate from tuberculosis treatment You decide to study the reason and you also want to know the local socio-cultural aspects of the disease to improve the treatment outcome of tuberculosis in your region

2 You have recently been appointed as a Woreda Research Officer in a remote Woreda of your region The government wants to improve health services in this area You want to collect information that will contribute to the development of the plan

Summary points on Data collection ¨ The following are the methods of data collection: § Using available information (records) § Observing § Interviewing § Administering written questionnaires § Focus group discussions ¨ BIAS in information collection is a distortion, which results in the information not being representative of the true situation ¨ Possible sources of bias during data collection: § Defective instruments § Observer Bias § Selection bias § Information bias § Effect of the Interview on the Informant ¨ Data collection can be improved by: § Training of data collectors § Pre-testing the questionnaire § Supervision § Use of different sources for comparison

Chapter 4 Variables and Measurement Errors

Learning objectives

At the end of this course participants should be able to: ¨ define what variables are and describe why their selection is important in research ¨ state the difference between numerical and categorical variables ¨ discuss the difference between dependent and independent variables and how they are used in designing research ¨ identify the variables that will be measured in the research project you are designing and ¨ develop operational definitions with indicators for those variables that cannot be measured directly.

What is a variable?

A variable is measurable characteristic of a person, object or phenomenon, which can take on different values

A simple example of a variable is “a person’s age” The variable "age" is measurable and can take on different values since a person can be 20 years old,

35 years old and so on Other examples of numerical variables are: § weight (expressed in kilograms or in pounds); § distance between homes and clinic (expressed in kilometers or in minutes walking distance); and § monthly income (expressed in birr, dollars) § The different values of a variable may also be expressed in categories For example, the variable sex has two values male and female which are distinct categories Other examples of categorical variables are:

Table 9: Examples of categorical variables

Color red blue green, etc

Outcome of disease recovery chronic illness death Main type of staple food eaten maize millet rice cassava, etc

Operationalizing variables by choosing appropriate indicators

For some variables it is sometimes not possible to find meaningful categories unless the variables are made operational with one or more precise

INDICATORS Operationalizing variables means that you make them

1 You want to determine the level of knowledge concerning a specific issue in order to find out to what extent the factor “poor knowledge” influences the problem under study (for example low utilization of VCT programme by high school students)

The variable level of knowledge cannot be measured as such You would need to develop a series of questions to assess students' knowledge, for example on risk factors related to acquiring HIV/AIDS The answers to these questions form an indicator of someone’s knowledge on this issue, which can then be categorized If 10 questions were asked, you might decide that the knowledge of those with ¨ 0 to 3 correct answers is poor ¨ 4 to 6 correct answers is reasonable, and ¨ 7 to 10 correct answers is good

2 You want to determine the nutritional status of under 5 year olds You need to choose appropriate indicators for the variable “nutritional status” Widely used indicators for nutritional status include: ¨ Weight for age,

42 ¨ Weight for height; ¨ Height for age; and ¨ Upper-arm circumference;

For the classification of nutritional status, internationally accepted categories already exist, which are based on standard growth curves For the indicator

“weight/age”, for example, children are: ¨ Well nourished if they are above 80% of the standard ¨ Moderately malnourished if they are between 60% and 80% ¨ Severely malnourished if they are below 60%

Defining variables and indicators of variables

To ensure that everyone (the researcher, the data collectors, and eventually, the reader of the research report) understands exactly what has been measured and to ensure that there will be consistency in measurement, it is necessary to clearly define the variables (and indicators of variables) For example, to define the indicator “waiting time” it is necessary to decide what will be considered the starting point of the “waiting period” e.g Is it when the patient enters the front door, or when he has been registered and obtained his card?

For certain variables, it may not be possible to adequately define the variable or the indicator immediately because further information may be needed for this purpose The researcher may need to review the literature to find out what definitions have been used by other researchers, so that he can standardize his definitions and thus be able later to easily compare his findings with those of the other studies In some cases the opinions of “experts” or of community members of health care providers may be needed in order to define the variable or indicator

Because in health research you often look for causal explanations, it is important to make a distinction between dependent and independent variables

The variable that is used to describe or measure the problem under study is called the DEPENDENT variable This is also known as the outcome variable

The variables that are used to describe or measure the factors that are assumed to cause or at least influence the problem are called the INDEPENDENT variables These are also known as exposure variables

For example, in a study of the relationship between use of prophylactic Isonizid (INH) treatment and Tuberculosis, “development of clinical TB” (with the values yes, no) would be the dependent variable and “prophylactic INH” the independent variable

Whether a variable is dependent or independent is determined by the statement of the problem and objectives of the study It is therefore, important when designing a study to clearly state which variable is the dependent and which are the independent variables

A variable that is associated with the problem and with possible cause of the problem is a potential confounding variable

Confounding is a mixing of the effect of the exposure under study on the disease with that of third factor This third factor must be associated with the exposure and, independent of that exposure, be a risk for the disease

A confounding variable may either strengthen or weaken the apparent relationship between the problem and possible cause

Therefore, in order to give a true picture of cause and effect, the confounding variables must be considered, either during planning or while doing data analysis

For example: A relationship is shown between the low level of the mother’s education and malnutrition in under 5 children However, family income is related to the mother’s education as well as with malnutrition

Family income is therefore a potential confounding variable In order to give a true picture of the relationship between mother’s education and malnutrition, the family income should also be considered and measured This could either be incorporated into the research design, such as by selecting only mothers with a specific level of family income, or it can be taken into account in the analysis of the findings, with mother’s education and malnutrition among their children being analyzed for families with different categories of income

4.5 What is validity and reliability?

Two common sources of error that need to be controlled arise from problems with ‘reliability’ and ‘validity’ Our inference should have high reliability (if the observations are repeated under similar conditions, the inferences should be similar) and high validity (the inference should be a reflection of the true nature of the relationship)

Reliability of measurements: If repeated measurements of a characteristic in the same individual under identical conditions produce similar results, we would say that the measurement is reliable

A study result is said to be reliable if the same result is obtained when the study is repeated under the same conditions

Reliability is often closely related to the matter of validity, but refers to the repeatability of scientific observations If the same set of door-to-door interviews on respondents' sexual behavior produces approximately the same set of response on repeated trials and with different interviewers, we can say that this observational technique has high reliability, regardless of the validity of the findings

A measurement is said to be valid if it measures what it is supposed to measure

Thus, if we use a scale that is not calibrated to zero, the weights we obtain using this scale will not be valid

Validity refers to the degree to which scientific observations actually measure or record what they allege to measure Door to door interviewing about intimate details of respondents sexual behavior might produce a lot of answers duly recorded in interviewers’ notebooks, but we would seriously doubt that the answers were an accurate representation of actual behaviors Thus, such interviewing on sensitive subjects generally lacks validity

Exercise 4 is also a group exercise When everyone is ready one group will present and there will a discussion

1 A health researcher believes that in a certain region, anemia, malaria and malnutrition are serious problems among adult males and, in particular among farmers He wishes to study the prevalence of these diseases among adult males of various ages, occupational groups and educational backgrounds to determine how serious a problem these diseases are for this population ¨ What are the dependent and independent variables in the study?

Which of these are categorical and which are numerical variables

2 A Zonal health manager receives a complaint from a particular woreda that one health centre often runs out of anti TB drugs In a preliminary investigation, this shortage of anti TB drugs is confirmed The zonal manager decides to investigate why there is a shortage of anti TB drugs in the health centre ¨ What is the dependent variable in the study? ¨ What would be a meaningful indicator for the dependent variable? ¨ How would you define shortage of anti TB drugs? ¨ Can you think of some independent variables? ¨ Which variables are measurable as they are and which ones need indicators?

3 Look at the following description of a research problem and then answer the questions that follow:

In a study concerning the patterns of distribution of schistosomiasis in the adult population of a village community, a researcher found that the adults were predominantly farmers and that overall, 20% of them had schistosomiasis The researcher believed that the prevalence of the disease was moderately low in the adult population ¨ Are there any variables whose inclusions in the study might have shown that the prevalence of the disease varied greatly among different categories of adults in the village?

Summary points on Variables and Measurement ¨ A variable is measurable characteristic of a person, object or phenomenon, which can take on different values ¨ For some variables it is sometimes not possible to find meaningful categories unless the variables are made operational with one or more precise

What is validity and reliability?