obvious: there are more women than men; and the peak for men occurs at a greater height than for women (about 1.80 m compared to 1.62 m). The bins or intervals on the horizontal X-axis of the histogram can be labelled in a variety of ways. The bars may be labelled by using the mid- point of the corresponding interval, or by having a label at the start (or end) of the interval as in Figure 4.6. For histograms, we recommend that you label the horizontal axis, at the start (or end) of each interval, since with this method it is easier to work out the width of the interval (as in Figure 4.6). Some intermediate interval labels can be omitted, to avoid cluttering up the scale, without any noticeably loss of clarity as in Figure 4.6b. A useful feature of a histogram is that it is possible to assess the distribu- tional form of the data; in particular whether the data are approximately Normally distributed, or are skewed. The Normal distribution (sometimes known as the Gaussian distribution) is one of the fundamental distribu- tions of statistics, and the histogram of Normally distributed data will have a classic ‘bell’ shape, with a peak in the middle and symmetrical tails, such as that for height for women in Figure 4.7b. Skewed data are data which are not symmetrical; positively skewed data have a peak at lower values and a 1.46 1.52 1.58 Height in metres 1.64 1.70 1.76 1.82 1.88 1.94 2.001.40 0 (c) 10 20 30 50 40 Frequency Figure 4.6 (Continued.) Displaying quantitative data 37 38 How to Display Data Figure 4.7 Separate histograms for the heights of men and women: 3 (a) for men (n ϭ 77) and (b) for women (n ϭ 145). 1.41 0 (a) 10 20 30 40 50 1.47 1.53 1.59 1.65 1.71 1.77 1.83 1.89 1.95 Height in metres Frequency 1.41 1.47 1.53 1.59 1.65 1.71 1.77 1.83 1.89 1.95 0 (b) 10 20 30 40 50 Height in metres Frequency Displaying quantitative data 39 long tail of higher values (Figure 4.8) while conversely negatively skewed data have a long left-hand tail at lower values, with a peak at higher values (see Figure 4.9). Histograms are similar to bar charts in that the variable of interest is dis- played on the horizontal axis (X-axis) and the frequencies are displayed on the vertical axis (Y-axis). However bar charts are used for discontinuous data, where the categories are entirely separate while histograms are used for continuous data. Thus bar charts have gaps between the categories on the horizontal axis in order to emphasise that the categories are completely separate, whereas there are no spaces in between the bins for a histogram, as the width of these bins can be set by the investigator. The count data, for the number of deaths from SIDS per day, in Table 4.1 could also be displayed as a histogram. This is because there are a large number of categories (14) of deaths per day and it is reasonable to treat such discrete count data as if they were continuous, at least as far as the sta- tistical analysis goes. However we would recommend count data should be displayed using bar charts as opposed to histograms, as the gaps between the bars will emphasise that the categories represent discrete whole num- bers and cannot take intermediate values (e.g. it is not possible to have 1.3 SIDS per day). 0 50 100 150 200 Baseline ulcer area (cm 2 ) 250 300 350 0 50 100 150 200 Frequency Figure 4.8 Positively skewed data – histogram of baseline ulcer area (cm 2 ) from leg ulcer trial (n ϭ 217). 3 40 How to Display Data 0204060 SF-36 Social functioning: baseline 0 20 Frequency 40 60 80 80 100 Figure 4.9 Negatively skewed data – histogram of baseline social functioning from leg ulcer trial (n ϭ 233). 3 4.6 Box–whisker plots Another extremely useful method of plotting continuous data is a box-and- whisker or box plot. This is described in detail in Figure 4.10. As with dot plots, box plots can be particularly useful for comparing the distribution of the data across several groups. The box contains the middle 50% of the data, with lowest 25% of the data lying below it and the highest 25% of the data lying above it. In fact the upper and lower edges represent a particular quantity called the inter- quartile range. The horizontal line in the middle of the box represents the median value as described in Section 4.4. The whiskers extend to the largest and smallest values excluding the outlying values. The outlying values are defi ned as those values more than 1.5 box lengths from the upper or lower edges, and are represented as the dots outside the whiskers. Figure 4.10 shows box plots of the heights of the men and women in the leg ulcer trial. Similar to dot plots, the gender differences in height are immediately obvious from this plot and this illustrates the main advantage of the box plot over histograms when looking at multiple groups. Differences in the Displaying quantitative data 41 Men (n ϭ 77) Women (n ϭ 145) 1.40 1.50 1.60 1.70 1.80 1.90 2.00 Whiskers extend to last observations within 1.5 times the box height Outlying values: observation more than 1.5 times box height away from the upper side of the box Median Height of box: interquartile range. Lower limit is the 25th quartile, upper limit is the 75th quartile Figure 4.10 Annotated box plots of height for the leg ulcer patients by sex, showing what each of the items displayed mean. 3 distributions of data between groups are much easier to spot with box plots than with histograms. As a result of what they display (median, inter- quartile range, spread) they provide a good summary of the data and are more useful than dot plots for larger datasets, where a dot plot would look rather busy. Summary • Display univariate count data using bar charts as opposed to histograms unless the number of categories is large enough to be treated as approxi- mately continuous, in which case a histogram can be used. • Always display continuous data as dotplots if the sample size per group is low (Յ100 subjects). • For univariate data a stem and leaf plot can be useful since all the data are available in the chart. • Use histograms to show the distribution of single variables. • To compare groups, for larger samples (say Ͼ50 subjects per group) use box–whisker plots. . 2.001.40 0 (c) 10 20 30 50 40 Frequency Figure 4.6 (Continued.) Displaying quantitative data 37 38 How to Display Data Figure 4.7 Separate histograms for the heights of men and women: 3 (a) for men. by sex, showing what each of the items displayed mean. 3 distributions of data between groups are much easier to spot with box plots than with histograms. As a result of what they display (median,. busy. Summary • Display univariate count data using bar charts as opposed to histograms unless the number of categories is large enough to be treated as approxi- mately continuous, in which case a histogram