A probability and statistics report on a poultry dataset Techniques used: Ttest, modified Ttest, WilkShapiro test, Quantile plot, Normality test, Homoscedasticity, Một báo cáo về phân tích tập dữ liệu gia cầm Các kỹ thuật được sử dụng: Kiểm định Ttest,...
Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering Probability & Statistics Project Report A Poultry Dataset Analysis Lecturer: Prof Nguyen Tien Dung Team members: Nguyen Nho Gia Phuc – 2052214 Tran Pham Minh Hung – 2053067 Le Khanh Duy – 2052003 Pham Nguyen Nam Khoa – 2052009 Nguyen Minh Khoa – 2052538 Ho Chi Minh City, May 25, 2022 Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering Contents Introduction 1.1 Data description 1.2 Steps to analyze the given dataset 2 Data Processing and Visualization 3 Data analysis 3.1 Population average weight test 3.2 Evaluation of feed effect on chicks weight 11 Normality test 16 4.1 Q-Q plots 16 4.2 Shapiro-Wilk test 18 Probability and Statistics - CC03 - Semester 212 Page Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering Introduction 1.1 Data description The dataset chicken_feed.csv provides information of weight measurements and the feedstuff fed to a total of 71 chickens The main variables in the dataset are as follows: • weight: provides the weight measurement of a chicken Early observation witnesses some entries with N/A data • feed: provides the feedstuff fed to a chicken There are items in total: horsebean, linseed, soybean, sunflower, meatmeal, and casein The data entries are grouped by the feed item of the previously described order Further description is showcased in Section and Section 1.2 Steps to analyze the given dataset • Data import – chicken_feed.csv • Data cleaning: clearing out entries with N/A information • Summarization of data – before and after cleaning • Data visualization • t-test to compare the chosen data with results of other research • One-way ANOVA to find the correlation between the chickens’ weight and the foodstuff fed to them • Normality test using Q-Q plots and Shapiro-Wilk test The function setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) is included at the top of our R-script for better code portability When the dataset and the script are in the same folder, calling this function would set the working directory to wherever the R-script is, avoiding the need of explicitly declaring dataset path, which is a native link for a particular computer running the analysis For the solution to work, the use of RStudio is mandatory since an API of it is called Probability and Statistics - CC03 - Semester 212 Page Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering Data Processing and Visualization Null data can affect statistical calculations Data processing aims to clear our null data entries that provide little to none use to our analysis The data collection done on the chicken_feed.csv dataset made it convenient for us to identify such missing pieces In the weight column, there are some entries marked with N/A (Not Available) With the help of R, we can remove them First, the dataset is imported into RStudio # load data setwd ( dirname ( rstudioapi :: g e t A c t i v e D o c u m e n t C o n t e x t () $ path ) ) chickFeedData summary ( chickFeedData ) X Min weight : 1.0 Min :108.0 feed Length :71 st Qu :18.5 st Qu :206.0 Class : character Median :36.0 Median :260.0 Mode Mean :36.0 Mean :263.6 rd Qu :53.5 rd Qu :325.0 Max Max :423.0 NA ’s :2 :71.0 : character Listing 2: Summary of imported dataset We can see that in the weight column, there are two NA’s, so we aim to clear those entries This is done using the na.omit() function, which returns a list without any rows that contain NA values This is a simple way to purge incomplete records from an analysis cleanData = na omit ( chickFeedData ) summary ( cleanData ) Listing 3: Eliminating NA entries using na.omit() Once again, we use the summary() function to see what changes have been made to the cleaned version of our dataset Probability and Statistics - CC03 - Semester 212 Page Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering > summary ( cleanData ) X Min weight : 1.00 Min :108.0 feed Length :69 st Qu :20.00 st Qu :206.0 Class : character Median :37.00 Median :260.0 Mode Mean Mean :36.72 :263.6 rd Qu :54.00 rd Qu :325.0 Max Max :71.00 : character :423.0 Listing 4: Summary of cleaned dataset Notice that the NA entries have been cleared, and the number of entries has been reduced from 71 to 69 in the feed column It is also worth it to take a look at individual feed item’s information by subsetting the data based on the feed variable and then call summary() on each of the subsets The source code is as follows: horsebean summary ( casein ) X Min weight :60.00 Min :216.0 feed Length :12 st Qu :62.75 st Qu :277.2 Class : character Median :65.50 Median :342.0 Mode Mean Mean :65.50 :323.6 rd Qu :68.25 rd Qu :370.8 Max Max :71.00 : character :404.0 Listing 6: Information of each feed item’s data Let us plot the proportions of feed item by a 3D pie chart with the following source code by first installing and mounting the two libraries plotrix and colorspace, then choosing a color palette, and plotting with a legend table The percentage of each item is calculated Probability and Statistics - CC03 - Semester 212 Page Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering and labeled with piepercent The numbers are rounded to decimal places library ( plotrix ) library ( colorspace ) x summary ( chickenFeedData aov ) feedType Residuals Df Sum Sq Mean Sq F value 225012 45002 15.2 63 186511 2960 Pr ( > F ) 8.83 e -10 * * * Signif codes : * * * 0.001 * * 0.01 * 0.05 0.1 With the p-value scoring 8.83e-10 (much smaller than the significant level α = 0.05), we reject the null hypothesis H0 , thereby supporting the alternative hypothesis H1 , i.e the feed has an effect on chicks weight However the test does not tell us further which feed actually imparts stronger growth (or weaker) Probability and Statistics - CC03 - Semester 212 Page 13 Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering Post-hoc comparison It remains our task to analyse statistics from the ANOVA test in order to find out which treatment yields the best effect The process is called post-hoc comparison This is usually achieved by computing a specific measure named LSD and comparing it to the difference of each pair of treatment means Since this test is an unbalanced ANOVA design, the function "LSD.test" usually employed to calculate the LSD does not function properly Calculating LSD without a supported function is tedious due to the recalculation for each couple of treatments we wish to compare This can be eased by employing a single LSD formula that does not affect the power of the method when the treatment sizes are close to each other, as shown in this recent research [1]: √ LSDd = tα/2,df × M SE n ¯ Where: • LSDd is the LSD for different number of replication • n ¯ is the mean of samples observations • tα/2,df is the critical value with respect to a confidence level 100(1 − a)% (two sided) and a degree of freedom df • M SE is the mean squared error Both df and M SE can be taken from the summary of chickenFeedData.aov The calculation of LSD proceeds as follows: MSE 0.91663) but cleanData.p < casein.p (0.2585 < 0.2592) This seems to contradict our belief, that the further the statistic deviates from 1, the smaller its p-value becomes The idea is almost correct; however, the sample size N does affect W , of which cleanData.N = 69 and casein.N = 12 When N starts to get large, the distribution gets clumped towards 1, as shown in Figure Consequently, for large samples, W does not have to be much smaller than in order for the test to be significant Figure 5: Distribution of W based on sample sizes N Probability and Statistics - CC03 - Semester 212 Page 20 Ho Chi Minh City University of Technology Faculty of Computer Science and Engineering References [1] Al-Fahham, A A (2018) Development of new LSD formula when numbers of observations are unequal Open Journal of Statistics, 8(2), 258-263 [2] Havenstein, G B (2006) Performance changes in poultry and livestock following 50 years of genetic selection Lohmann Information 41: 3037 [3] R color cheatsheet - National Center for Ecological Analysis (n.d.) Retrieved May 11, 2022, from https://www.nceas.ucsb.edu/sites/default/files/2020-04/ colorPaletteCheatsheet.pdf [4] Shapiro, S S., & Wilk, M B (1965) An analysis of variance test for normality (complete samples) Biometrika, 52(3/4), 591-611 [5] Razali, N M., & Wah, Y B (2011) Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests Journal of statistical modeling and analytics, 2(1), 21-33 [6] Zuidhof, M J., Schneider, B L., Carney, V L., Korver, D R., & Robinson, F E (2014) Growth, efficiency, and yield of commercial broilers from 1957, 1978, and 2005 Poultry Science, 93(12), 2970-2982 [7] 60 Query (1948) Biometrics, 4(3), 213215 https://doi.org/10.2307/3001566 Probability and Statistics - CC03 - Semester 212 Page 21