61 must be extremely aware of issues such as the level of data used and characteristics of the population, namely distributional assumptions. Levels of Data Data conforms to one of four levels: Nominal (categorical) - the value is either present or not Ordinal - the value is ranked relative to others Interval - the value is scored absolute to others Ratio - the value is scored absolute to others and to a meaningful zero An example: Consider three horses in a race. Coding the race times under a nominal level will tell us if any particular horse won the race or not (e.g. Guttman‟s Folly did not win). Coding under an ordinal level, we can tell where a given horse came in relative to the others (e.g. Guttman‟s Folly came in second). Coding under an interval level, we know where a given horse came absolute to the others (e.g. Guttman‟s Folly was 1.5 seconds faster than Galloping Galton, but 2.3 seconds slower than Cattell‟s Chance). Coding under a ratio level we would know where a given horse came absolute to the others and a meaningful common zero point to all of them (e.g. Guttman‟s Folly came home in 67.5 seconds, Galloping Galton was 69.0 seconds, and Cattell‟s Chance was 65.2 seconds). Sometimes we use dichotomies. This basically is a variable that can only take one of two values, either present (1) or absent (0). Its level therefore is nominal. Descriptive Statistics Measures of Central Tendency There are three measures that give an indication of the „average‟ value of a data set: Mode - this is the most common value in the data set (most appropriate for nominal level data) Mean - this is the arithmetic average, the one most familiar to people (most appropriate for interval and ratio level data) Median - this is the middle value in the data set (most appropriate for ordinal level data) As an example, the following are the numbers of children for seven families: 0 0 0 1 1 5 7 The mode (most common value) is 0. The mean is calculated as 14 (sum of all seven scores) divided by seven (number of cases), which equals 2 62 The median is 1 (the middle case of the seven cases) Measures of Dispersion Some measures of dispersion include: Range - this is basically just the difference between the highest and lowest scores, e.g. in the above example of families the range would be 7 minus 0, which is 7. Standard Deviation - this represents an average deviation from the mean, essentially. In this case, it is 2.6. This measure of dispersion is normally calculated through SPSS. Normal (and Abnormal) Distributions A normal distribution is a reflection of a naturally occurring distribution of values also known as the bell curve, where the mean, median and mode are all equal, e.g. IQ scores. If this is the case, then the researcher is able to make certain assumptions about the population parameters. This assumption enables specific methods of analysis to be used. A normal distribution: However, normal distributions are something of an ideal. For example, upon examining the earlier data on number of children, we see that the mean, median and mode are not equal. Therefore we can not make assumptions about the population parameters. In other words, non-parametric methods of analysis must be used. Typically, the measures used to represent these kinds of distributions are the median and range, as opposed to mean and standard deviation. Often, the data you will be using will not allow you to make assumptions about the population parameters, so non-parametric methods must be used (more on this later). Frequencies Frequencies represent the number of occurrences for values of a given variable. If, for example, ten participants in an experiment were made up of five males and five females, then the frequencies for the values of 1 (male) and 2 (female) would both be 50%. A frequency score for a given value is the percentage of all the subjects/cases/participants that have that value as a score. There are different forms of 63 frequency counts in SPSS, all of which are detailed in the SPSS section on descriptive statistics. Inferential Statistics Parametric Statistics Briefly, traditional statistics are used after conducting an experiment testing a research hypothesis. This hypothesis is about a relationship between the independent and dependent variables. Inferential (i.e. hypothesis testing) statistical methods do this by applying the findings of descriptive statistics discussed earlier. Thus we can infer an aspect of the characteristics of the population from the samples we take of these populations. Take an easy example. Let us try to determine whether or not our group of MSc students are „normal‟ with respect to the population of postgraduate students in the UK. We hypothesise that you are not. We obtain scores for normality from the files and observations of your behaviour, and calculate the central tendency and variation of the group. The „N‟ rating for this group is 40 with a standard deviation of 10. We know that other postgraduate students have an „N‟ rating of 60 with a standard deviation of 20. In terms of probability, calculations would show that the chances of the MSc‟s coming from the „normal‟ population is 2%, therefore you are statistically unlikely to come from that population. However, the fun begins when we: 1. try to set up the experimental designs that allow the independent variable to be manipulated to cause a change in the dependent variable, and/or 2. have to estimate the population parameters from the sample(s) because we don‟t know enough about the population. So, looking at (2) first, estimating population parameters when they are unknown is done using the sample itself. „Aha‟, you think, „surely that‟s got to be wrong because we are testing the sample against a population which is estimated using the sample.‟ That‟s where (1) comes in - by performing certain experimental manipulations (random selection, large sample size, etc.) we can ensure that the sample provides an unbiased estimate of the population. If we do this then the error in the sample is minimised, though never eliminated - hence the need for p values. Shall we look at this in more detail? Ways to allow the experimental design to overcome the difficulties in estimating population parameters include, most importantly, random assignment of subjects and random assignment of treatments. This includes levels of treatments as well. The use of experimental controls is also important to ensure that the participants are not biasing the sample in any way, i.e. independent or between groups. Alternatively, and preferably, subjects may act as their own controls, i.e. dependent or within groups. Placebos are an important way to reduce subject error, and experimenter bias. „Blind‟ and „double-blind‟ experiments are those that take this into consideration. If 64 such things are fully accounted for, then the parameters of the population (i.e. central tendency and variation in scores) are estimated using the sample, and the accuracy of this sample is given by statistical levels of probability. Effect Size and Other Concerns So, we‟ve designed the experiment adequately and we‟ve gathered the data bearing in mind all those things described above. Now we must check that the statistics we perform on the data are capable of rejecting or accepting the hypotheses we proposed. This is the effect size of the manipulation. The risk of falsely accepting the null hypothesis when it is in fact true is traditionally set at 5%, i.e. = 0.05. Traditional statistics is very conservative, and has a morbid fear of rejecting the null hypothesis when it is in fact true. In more applied settings, the ability of a test to be sensitive enough to reject the null hypothesis when it is in fact not true is also important. This level is not usually mentioned, but is implicitly assumed to be 20%, i.e. = 0.20. These levels play a strong part in the effect size, as does the direction of the hypothesised relationship being one or two-tailed. Other influences on effect size include sample size. Often, this is limited by the number of subjects available, though ideally the sample size should be determined by the desired effect size. The other determining influence is statistical test used. For more details on the theories underlying statistics, consult a statistics book, such as Howell. Fundamentals of Statistical Testing All parametric statistical tests have one basic idea in common: each produces a test statistic (t, F, etc.) that is associated with a significance value given the size of the sample. This statistic is a summary of the following ratio: test statistic = amount of systematic variation amount of error variation Systematic variation comes from the (desirable) effect of the manipulation of the independent variable and error variation comes from the (undesirable) effect of error- ridden noise. Hence the larger the error is in sampling, the more powerful the manipulation of independent variable must be to create a „significant‟ effect. A sensible way to obtain a „good‟ test statistic is to reduce the error in the sample (the denominator in the equation), though many psychologists prefer to have HUGE samples and increase the systematic variation (the numerator in the equation). Which Statistical Test? Parametric inferential tests can be divided based on the design of the experiment, the number of conditions being tested and the number of levels of study. Designs can be of two types – between subject and within subject. The former is when you divide subjects into independent groups, such as on the basis of gender, or into one group that receives a drug, and a second that receives a placebo. Within subject 65 designs are when all subjects are subjected to all conditions, e.g. testing reaction times before and after receiving a drug. The number of conditions is merely how many “tests” you administer for an independent variable. So, in the above example, the between subjects would have two conditions (drug and placebo). The within subjects would also have two (before and after drug). For two conditions, you run a t-test. For three or more, you run an ANOVA. Finally, the design can have multiple levels, e.g. two independent variables of drug and placebo and participant gender, creating four combinations. Different levels can also result in mixed designs. An example could have a between subjects independent variable (gender) and a within subjects IV (the test-retest of reaction times). Data Level Design (Between Subjects) (Within Subjects) Nominal Chi Squared Sign Ordinal Mann-Whitney Wilcoxon Ratio/Interval (2 conditions) Ratio/Interval (3 or more conditions) Unrelated T Unrelated ANOVA Related T Related ANOVA Nonparametric Statistics Unlike parametric statistics, which (as mentioned before) test hypotheses about specific population parameters, nonparametric statistics test hypotheses about such things as similarity of distributions or the measures of central tendency. It is important to note that the assumptions for these tests are weaker than those for parametric tests, so the results are not as powerful. On the other hand, there are a lot of analyses where parametric tests are not particularly appropriate, e.g. situations with very unequal sample sizes. In Investigative Psychology, significant amounts of your data will not be of a nature that lends itself to parametric tests. Data quality and experimental control are not one of our strong points, but this is not a weakness in our research as long as we are aware of the limitations and act accordingly. Nonparametric tests are one of the ways in which we try to deal with our problematic data. Referring back to the previous table, there are 3 basic tests listed (we won‟t go into Sign here, it‟ll be in most statistics books) - Chi-square, Mann-Whitney and Wilcoxon. In addition, there are ANOVAs for nonparametric testing of more than two conditions. 66 Data Level Design (Between Subjects) (Within Subjects) Nominal Chi Square Sign Ordinal Mann-Whitney Wilcoxon Ratio/Interval Unrelated T Related T Chi-squared tests look at associations between variables, while the nonparametric t- tests and ANOVAs examine differences in shape or location of the populations. Chi-square Tests Essentially, the CS test uses frequency scores for a variable or variables to determine whether the actual observed frequencies (those that are recorded) are different from those we would expect if there were no differences between the values, in a between- subjects design. The closer the observed frequencies are to the expected ones, then the lower the value of the Chi-square. If the two are similar enough, this indicates that no significant difference exists between the values. Using the Chi-square with Crosstab Often, the test is used in conjunction with doing a crosstab, which indicates the frequency counts for each combination of values for the two variables. In the table below, there are two variables, both with two values (present/not present). The frequencies of occurrence for each of the four possible combinations of values are listed (e.g. blindfolding and threats to not report co-occurred 10 times). Threat - No Report Present Absent/not recorded Blindfold Present 10 5 Absent/not recorded 5 5 A Chi-square might reveal that there is a significant difference between the cells, and examining the table would suggest that the difference lies in how often these behaviours co-occur versus when they occur alone or when they both don‟t occur. The Mann-Whitney This is the nonparametric equivalent of the independent samples t-test. The major difference is that this test looks at the ranks of the scores, regardless of which value they belong to, for the two distributions, rather than the actual scores. Ranking is a pretty straightforward concept. Looking at the table below, we see that there are four scores for age, one of which would appear to be an extreme outlier and so skews the distribution (making it far from normal). If we rank the scores (listed in brackets 67 beside the actual scores) the rank of age 2 is 4. The scores for age shift, by ranking, from interval to ordinal data and the effect of the extreme outlier is eliminated. Age 1 Age 2 Age 3 Age 4 Score 24 (1) 78 (4) 28 (3) 27 (2) In the case of the Mann-Whitney, all the scores for both samples are listed together and ranked. If there is a difference between the two distributions, then there will be some sort of significant ordering effect in the ranking (i.e. a significant portion of one of the two samples will make up the lower ranks, rather than a random mix). The null hypothesis of no differences between the two samples will be accepted if there is no significant difference. The actual results will depend on such things as sample sizes, but SPSS will adjust itself accordingly. The Wilcoxon T-test This, unsurprisingly, is the nonparametric equivalent of the dependent samples t-test. Ranks in this case are calculated based on the differences between the two scores for each subject over the two conditions, e.g. if one subject scored 3 acts of aggression before taking speed and 6 after, the difference score would be -3. These differences are then ranked, ignoring the sign, and then the statistics are carried out to identify whether the two conditions differ. Kruskal-Wallis One-Way ANOVA Used, as with the parametric ANOVA, when a variable has more than two levels (independent of each other), the KWANOVA tests for differences using the ranks of the scores rather than the actual scores. Like the ANOVA, the KWANOVA is a test for differences in the averages of the values, but these averages are drawn from the relative ranking, rather than the actual scores. Again, a significant result indicates that differences do exist. As far as I can tell, SPSS does not have a post-hoc test option for KWANOVA, which means you‟ll have to do it by hand. Just find yourself a good statistics book and the information you need should lie within. This ANOVA, and its equivalent measure for related samples, are described in the SPSS section below. Correlations and Associations When trying to determine the relationship between two variables, graphing each case using the scores as x and y co-ordinates can give you something of an initial impression of what associations may be occurring. However, to statistically test the relationship - to see how strong it is, in a sense - you need to determine how correlated they are. In a nutshell, the results will show to what degree the scores of two variables relate to one another. The more they coincide, the stronger the degree of association between the two. In general, the correlation coefficients you will use are appropriate in situations where there is, to some degree, a linear relationship between the two variables. If the relationship is strongly curvilinear (i.e. if you plotted the two variables and the line 68 did a crazy zigzag pattern across the graph), then there are alternatives, which we won‟t go into here. For most purposes, you will use one of two correlation coefficients - Pearson‟s Product Moment and Spearman‟s Rank Order. Deciding between the two is fairly easy. If you are using an Ordinal scale, Spearman‟s is the one to use. If the variables are interval, and the actual plot of the variables is weakly curvilinear (not a straight line, but generally when x goes up, y goes up, just to varying degrees), you use Spearman‟s. If the variables are interval, and the graph is linear, then you use Pearson‟s. You can easily run both at the same time, so you might as well. However, it‟s important to understand which one of the two is more appropriate for your analysis, so that you include the right one in your assignments or dissertations. Pearson‟s for ratio data, Spearman‟s for the rest. With all the correlations, you will end up with a score between +1.00 and -1.00. The best way to think of what it means is to split it into two parts. The sign of the coefficient indicates whether the relationship is positive (+) or negative (-). The former means that as x increases in value, so does y. The latter means that as the value of x increases, the value of y decreases (or, as y goes up, x goes down). The size of the coefficient, ignoring the sign, represents how powerful the relationship is. A score of 1.00 (with a + or - sign) would represent a perfect correlation, while a value of, say, +0.85 would be very strong - as x increases, so does y to a roughly equivalent degree. A value of 0.00 would indicate that there is no relationship between the two. Some warnings 1. Remember that the coefficients indicate a degree of association, but not causality. Unless you have strong theoretical reasons to indicate such, you can not clearly state that x influences y. It could be the case that it is y influencing x, or even a third variable could exist, z, that is influencing both. 2. A number of factors can influence a non-ranking correlation: the use of extreme groups, the presence of outliers, combining different groups and curvilinearity. Each of these can lead to inaccurate findings. Pearson’s Product Moment Correlation Coefficient I won‟t bore you with detailed descriptions of covariance and such. As usual, if you want a deeper understanding of the inner workings of this procedure, you‟ll have to find a book on it. The value that is of importance to you is the squared correlation coefficient (r 2 ). This indicates how much of the variance in y can be accounted for by x - their common variance, as it were. Note that since r is between 1.00 and -1.00, r 2 is always smaller than r. 69 Spearman’s Rank Order Correlation Coefficient This is basically the nonparametric version of Pearson‟s, by way of ranking the scores for the two variables, rather than using the raw scores themselves. Interpretation of the results is the same. Other Measures of Association 1) Dichotomous Data: Jaccard‟s Jaccard‟s is the appropriate measure of association to use for dichotomous variables where mutual non-occurrence does not indicate anything about the relationship between the two variables. This is typically the case in content analysis, e.g. using police records. Yule‟s Q/Guttman‟s Mu This is the best measure to use if you do know that mutual non-occurrence does indicate something about the relationship between the two variables. There was a tendency last year to automatically run SSAs using Jaccard‟s. This was appropriate most of the time, but there were times when Mu could have been used instead. Keep in mind that Jaccard‟s is the weakest of all possible measures of association - it is used for the type of variables with the least information (dichotomous) and then does not use all the information available from the variables. If you are using materials that aren‟t subject to the problems police records suffer, e.g. if you are doing analyses on a drug abusers personal diaries you know whether the variables are present or not, then use Guttman‟s Mu. 2) Ordinal Data: Kendall‟s Tau/Guttman‟s Mu Use both of these for non-metric analyses (e.g. non-metric SSA). Use the former when you have equal numbers of categories between the two variables, and the latter, which is weaker, when you have unequal numbers. 3) Interval data: Pearson‟s for parametric analyses (see above). Alternatively, in SSA you can use Guttman‟s Mu for non-parametric analyses of interval data. . frequency counts in SPSS, all of which are detailed in the SPSS section on descriptive statistics. Inferential Statistics Parametric Statistics Briefly, traditional statistics are used. to create a „significant‟ effect. A sensible way to obtain a „good‟ test statistic is to reduce the error in the sample (the denominator in the equation), though many psychologists prefer to. statistically unlikely to come from that population. However, the fun begins when we: 1. try to set up the experimental designs that allow the independent variable to be manipulated to cause a change