(TIỂU LUẬN) RMIT international university vietnam ECON1193B – business statistics 1

RMIT International University Vietnam ECON1193B – Business Statistics 3A Subject code Econ1193B Subject Name Business Statistic Campus RMIT Hanoi Campus Student Names Ngo Nguyen Phu Dang Dinh Nguyen Vu Duong Tri DUng Student Numbers S3877298 S3819269 S3877892 Teacher Pham Tran Minh Trang Word count 3954 Table of Contents PART 1: DATA COLLECTION PART 2: DESCRIPTIVE STATISTICS Check for outliers Measure of Central Tendency Measure of Variation Box and Whisker Plot Conclusion PART 3: MULTIPLE REGRESSION Region A Region B PART 4: MULTIPLE REGRESSION CONCLUSION PART 5: TIME SERIES 5.1 Regression Output 5.2 Recommendation for the most suitable trend model for both regions 5.3 Prediction PART 6: TIME SERIES CONCLUSION 6.1 Line chart 6.2 Analysis PART 7: OVERALL CONCLUSION 7.1 Main factors that impact the number of COVID-19 deaths 7.2 Predict the number of deaths in the world on Oct 31, 2021 7.3 Analysis on whether global deaths will be reduced by the end of 2021 7.4 Two variables that might impact the number of Covid deaths in the world Since the end of 2019, when the Covid-19 first appeared in China, it has become a serious phenomenon that had great impacts on a lot of fields, including people's lifestyles, economy, and medical, causing many negative effects on the global health goals (WHO 2021) It cannot be denied that the Covid-19 has a remarkable spread speed, therefore it appears at almost every corner of the world, including the two regions we may discuss below, especially Africa, which was caused over 7,075,119 cases and born a new variant of the virus (Saifaddin G 2021) This report is written to provide a closer look at two specific regions – Middle East & Oceania and Africa, by investigating the relationship between the number of Covid 19 deaths in each region and related elements, analyzing the trend model as well as coming up with some significant predictions PART 1: DATA COLLECTION This section provides data (information) which is crucial for further analysis in Part and In the first region, which is Africa, there are 54 nations (Andrew W 2021) while the number of countries in Region B (both Middle East and Oceania) are only 31 However, as some countries’ information cannot be collected, these datasets only consist of 10 Middle East and Oceania countries and 25 African countries There are six variables to be measured, namely: population, average rainfall of the first for four months, average temperature of the first four months, hospital bed per 10,000 people, medical doctors per 10,000 people and total number of Covid 19 in each country period 22 January to 24 April PART 2: DESCRIPTIVE STATISTICS CV Africa Mean Median Range IQR Variance SD 31.5 16.75 318.38 22.81 4147.17 64.40 74.73 491.9 214.725 18930.454 137.58799 204.47 0.9425 MiddleEas 138.49 t & & Oceania Table Africa and Middle East & Oceania dataset descriptive statistics Check for outlier There are no observations smaller than the lower bound (Q1-1.5IQR) in the Africa data set, but two observations larger than the upper bound (Q3+1.5IQR) As a result, this dataset contains two outliers There are no observations smaller than the lower bound (Q1-1.5IQR) and no observations larger than the upper bound (Q3+1.5IQR) in the European Union data collection As a result, this dataset has no outliers Measure of central tendency Africa Middle East & Oceania Mean 31.5 138.492 Median 16.75 74.73 Mode #N/A #N/A Table 2: Measure of Central Tendency in Africa and Middle East & Oceania The mean is an appropriate measure of central tendency for this situation, since there are no outliers for region B and only two outliers for region A Africa's and the Middle East and Oceania's data sets have means of 31.5 and 138.492, respectively (Table 2) As a result, the average Covid 19 death rate per nation in the African sample is 31.5 per million For the other region, similar results might be drawn In region B, Covid 19 fatalities per million are 138.492 per million on average in the Middle East and Oceania Because the mean of the African dataset is substantially lower than that of the Middle East and Oceania datasets, the number of Covid 19 deaths in the Middle East and Oceania is significantly greater than in Africa However, the presence of two outliers in Africa means that the mean could be unreliable specifically for region A because this measure is sensitive Measure of variation Africa Middle East & Oceania Range 318.38 491.9 IQR 22.81 214.725 Standard Deviation 64.40 137.58799 Variance 4147.17 18930.454 Coefficient Variation of 204.47 0.9425 Table 3: Measure of Variation in African and Middle East & Oceania In this case, the Interquartile Range (IQR) would be the most appropriate measurement in this situation since it is unaffected by outliers and can quantify how much the middle 50% of observations deviate Africa and Middle East & Oceania datasets have interquartile ranges of 22.81 and 214.725, respectively (Table 3) The interquartile range (IQR) depicts how the middle 50% of observations are distributed, and the lower the IQR, the more consistent the middle 50% of data are As a result, it's possible to conclude that the Middle East and Oceania have a lower consistency of Covid 19 fatalities per nation To put it in another way, in the Middle East and Oceania, the disparity in the number of Covid 19 fatalities per nation is greater than in Africa Box and Whisker plots Figure 1: Africa and Middle East & Oceania box plot Both datasets are right-skewed, according to the box and whisker plots graph, since the right half is longer than the left This means that the majority of the data in each dataset is found in the graph's upper reaches In other words, Covid 19 fatalities are common in most nations (in both areas) The graph also shows that one area, Africa, has outliers in its dataset, indicating that it has extreme values that might impact the objectivity of sensitive metrics like Mean or Range As a result, the reliability of such measures might be diminished The box plot clearly demonstrates that Europe's Box is smaller than the Middle East & Oceania's, implying that the number of Covid 19 fatalities in the Middle East & Oceania is significantly greater than in African nations When the two boxes are compared, the Middle East & Oceania box is larger than the African box (Figure 1), resulting in the wider number of fatalities in the middle 50% of Middle East & Oceania nations spread from the Median compared to the middle 50% of Africa Because the middle half of the dataset is typically thought to be the most concentrated, it's probable that the number of Covid 19 fatalities in the Middle East and Oceania varies more than in African nations Conclusion After examining the descriptive statistics of the two datasets, it can be inferred that from April 31st to July 31st, the Middle East and Oceania nations had more cases of Covid 19 fatalities than Africa, because Africa's Mean is lower than the other region's Mean Because only Africa’s dataset contains outliers, sensitive measures like Mean are appropriate for analyzing both regions However, we determined that the range is not as suitable as the IQR data for analyzing the spread of data due to the presence of outliers The IQR data for Africa and the Middle East and Oceania indicated a difference in the number of Covid 19 fatalities between the two areas Because the IQR of African countries is significantly lower than that of the Middle East and Oceania, the number of Covid 19 deaths in Africa is considered more consistent than in the Middle East and Oceania, implying that the difference between the number of Covid 19 fatalities in the Middle East and Oceania is greater than the difference between African countries Part 3: Multiple Regression Region A (Africa): Through the process of backwards elimination (Appendix), the final regression model for the African region includes only one independent variable which is significant at the 5% level of significance a The regression analysis output of the final model is presented below: b From this data output, we obtain the regression equation: Total Deaths from 1/4 to 31/7 (per million) = 7.450 x Hospital Beds (per 10,000) - 33.062 c Interpretation of the significant independent variable’s coefficient in context with paper’s research topic: According to the regression equation, one extra hospital bed per 10,000 people correlates to an increase of 7.45 total deaths per one million people during the period from April 1st to July 31st of 2021 and vice versa The p-value for this model is 0.0002, which is statistically significant at 95%, 98% and 99% This means that the variable hospital beds per 10,000 has a very strong association with the total number of deaths This is a surprising discovery Since the number of hospital beds per 10,000 is an indicator of a country’s capability in providing medical care, one would assume that it would have a negative relationship with the dependent variable total deaths One possible explanation is that in reality, healthcare systems across Africa respond to the rising death toll due to Covid-19 by increasing the number of hospital beds available However, the data for hospital beds are not regularly updated for African countries and many figures were recorded before the start of the pandemic in 2019 Therefore, this explanation is unreliable and either further research or additional independent variables are required to draw a solid conclusion A promising variable could be one that indicates the strength of a nation’s preventive and reactive measures against Covid-19 d Interpretation of the coefficient of determination: The coefficient of determination, or R-squared, of this regression model is 46.1% This value indicates that 46.1% of the variation in the total deaths data is explained by the independent variable In terms of goodness of fit, 46.1% indicates that this linear regression does not fit data samples very well, making estimations more unreliable This suggests that our study might need more independent variables in order to obtain better estimation models Region B (Middle East/Oceania): Through the process of backwards elimination (Appendix B), the final regression model for the Middle East and Oceania regions includes only one independent variable which is significant at the 5% level of significance a The regression analysis output of the final model is presented below: b From this data output, we obtain the regression equation: Total Deaths from 1/4 to 31/7 (per million) = 246.671 - 7.856 x Population (millions) c Interpretation of the significant independent variable’s coefficient in context with paper’s research topic: According to the regression equation, one million extra people in the population correlates to a decrease of 7.856 total deaths per one million people during the period from April 1st to July 31st of 2021 and vice versa The p-value associated with the independent variable is 0.023 Thus, we would be confident that there exists a significant relationship between the independent and dependent variable with 95% confidence level, but not at 98% or 99% This is also a surprising result because a larger population should theoretically have more infections and consequently more deaths Different nations respond to the pandemic differently in terms of social distancing and travel, which significantly affects the rate that the virus could spread among the population Similar to the sample from region A, additional data on alternative variables regarding pandemic control could provide a better regression model for predicting Covid-19 deaths d Interpretation of the coefficient of determination: The R-squared of this sample is 49.6% This means that 49.6% of the variation in the total deaths data is explained by the independent variable population Like the sample from Region A, this linear regression does not fit data samples very well, making estimations more unreliable, and calls for more independent variables to be identified to improve future models Part 4: Team Regression Conclusion Regression Comparison: The regression results from region A and region B returned different significant independent variables While the number of hospital beds per 10,000 is significant for region A, population is the significant independent variable for region B The corresponding pvalues are 0.0002 and 0.023, respectively, which shows that both models are indeed statistically significant The coefficient of determination is roughly the same for both region A and region B, which are 46.1% and 49.6% This indicates that both regression models can be improved by including new variables According to the final regressions, the coefficients of region A and region B’s independent variables are 7.450 and 7.856 This shows that the number of hospital beds per 10,000 people have a lower absolute impact on the number of total deaths than the population in millions, but this difference is small On the other hand, the results from Part show that the average total deaths for region B is much higher than region A, 138.492 to 31.5 Therefore, a one-unit change in each independent variable would correlate to a larger relative change in the total deaths of region A compared to region B So in conclusion, while the impacts of each independent variable on their respective region is roughly the same in absolute terms, the number of hospital beds per 10,000 creates a larger percentage change on region A than what population in millions does to region B PART 5: TIME SERIES We have collected this dataset from 24 countries in region A (Africa) and countries in region B (Middle East and Oceania) This part analyses the trend models of the daily death due to Covid-19 in region A and B from April 1st to July 31, 2020 Part also calculates the regression output and formula of the significant trend model Moreover, it recommends the most suited trend model to predict the number of deaths due to Covid-19 in both region A and B Lastly, it predicts the number of deaths due to Covid-19 in both regions on September 28, September 29, and September 30, 2021 5.1 Regression output Region A: Linear trend model: Figure 2: Linear trend model for region A Formula: Y^ = 85.08 + 3.74T Y^ is the predicted number of deaths due to Covid-19 and T is the trend (time) B0 = 85.084 when the trend is means that there were about 85.084 deaths due to Covid-19 on March 31 B1 = 3.74 indicates that from April to July 31 the mortality rate would increase about 3.74 deaths a day from April to July 31 Quadratic trend model: Figure 3: Quadratic trend model for region A 10 Formula: Y^ = 228.46 – 3.18T + 0.05T^2 Y^ is the predicted number of deaths due to Covid-19 and T is the trend (time) B0 = 228.36 means that there were about 228.36 deaths in day (March 31) B2 = 0.056 indicates that there were about 0.056 deaths recorded every T^2 day from April to July 31 Exponential trend model: Figure 4: Exponential trend model for region A Formula: Linear format: Log(Y^) = 2.15 + 0.004T Non-linear format: Y^ = 142.202 * 1.01^T Y^ is the predicted number of deaths due to Covid-19 and T is the trend (time) B0 = 142.2 means the number of deaths on day (March 31) were about 142.2 (B1 – 1) * 100% = 1% indicates that the mortality rate due to Covid-19 increased for about 1% each day from April to July 31 Region B: Linear trend model: 11 Figure 6: Linear trend model for region B Formula: Y^ = 10.87 – 0.11T Y^ is the predicted number of deaths due to Covid-19 and T is the trend (time) B0 = 10.87 means that there were about 10.87 deaths in day (March 31) B1 = -0.11 illustrates that about -0.11 deaths recorded each day from April to July 31 Quadratic trend model: Figure 6: Quadratic trend model for region B Formula: Y^ = 101.81 – 1.48T + 0.008T^2 Y^ is the predicted number of deaths due to Covid-19 and T is the trend (time) B0 = 101.81 means that there were about 228.36 deaths in day (March 31) 12 B2 = 0.008 indicates that there were about 0.008 deaths recorded every T^2 day from April to July 31 Exponential trend model: Figure 7: Exponential trend model for region B Formula: Linear format: Log(Y^) = 1.904 – 0.003T Non-linear format: 80.26 * 0.99^T Y^ is the predicted number of deaths due to Covid-19 and T is the trend (time) B0 = 80.26 means that there were about 80.26 deaths in day (March 31) (B1 – 1) * 100% = -1% indicates that the number of deaths due to Covid-19 decreased for about 1% every day from April to July 31 5.2 Recommendation for the most suitable trend model for both regions a Region A We will use the least squares method, which is often used to find the best fit model by selecting models with the minimum sums of the error terms In this case, we will choose the best model by using the Mean Absolute Deviation (MAD) and Sum of Squares Error (SSE) By collecting data on August and August 2, we will predict the number of deaths due to Covid-19 on those two days and compare it to its actual counterpart Model Error day 123 Error 124 day Sum of absolute MAD value of error SSE Linear 400-545=-145 469-548=-79 -145-134=-224 -224/2=-112 27266 Quadratic 400-593=-193 489-603=134 -193-134=-327 -327/2=-163 55205 Exponential 400-483=-83 469-588=119 -119-83=-202 -202/2=-101 21050 13 Table 4: MAD and SSE calculation of region A It can be seen that the exponential trend model is the most suitable one because it has the smallest MAD and SSE among the three b Region B Model Error 123 day Error 124 day Sum of absolute MAD value of error SSE Linear 33+2 = 35 34+3 = 37 35+37 = 72 72/2 = 36 2594 Quadratic 33-40 = -7 34-41 = -7 -7-7 = -14 -14/2 = -7 98 Exponential 33-23 = 10 34-23 = 11 10+11 = 21 21/2 = 10.5 221 Table 5: MAD and SSE calculation of region B It can be seen that the quadratic trend model is the best measurement for prediction because it has the smallest MAD and SSE out of the three 5.3 Prediction In this part, we will predict the number of deaths due to Covid-19, given that September 28, September 29 and September 30, 2021, are day 181, day 182 and day 183 in sequence Date Region A Region B September 28, 2021 861 96 September 29, 2021 869 97 September 30, 2021 878 98 Table 6: Number of deaths prediction for region A and region B Part 6: Time Series conclusion 6.1 Line chart 14 Figure 8: Daily deaths due to Covid-19 from April to July 31 for region A and region B 6.2 Analysis It can be seen that both regions have many fluctuations throughout the specified period of time However, this result can be affected by the fact that these two regions not have similar numbers of countries Region A has 24 countries while region B only has countries Furthermore, region A has the exponential trend model and region B has the quadratic trend model Nonetheless, the number of deaths in region B is more stable and downward even though region A is more fluctuated and upward Number deaths in region A reaches its peaks on July 5, 2021, with 1314 deaths whilst region B peaks on April with 106 deaths According to Rousson & Goşoniu (2007), Coefficient of Determination can be used for model selection because it not only provides a rationale for choosing whether a variable should be included in the model or not, but also gives information about what is gained or lost if the variable is kept As a result, Coefficient of Determination makes model selection easier (Rousson & Goşoniu 2007) As regards, in order to find the best model to indicate the fatality rate due to Covid-19, we will use the R square, or Coefficient of Determination to calculate the model with the highest R square value which also means the one with the lowest error R Square Region A Region B 60% 80% Figure: R square table for region A and region B 15 Figure … has illustrated that the Coefficient of Determination value in region B is higher than region A Therefore, it is clear that the quadratic trend model is the best to predict the mortality rate due to Covid-19 in the world Overall team conclusion 7.1 Main factors that impact the number of COVID-19 deaths The higher death rate of Covid -19 may be recognized as a result of numerous causes, as evidenced by the aforementioned measurements In particular, it can be determined that Africa, the Middle East, and Oceania have a restricted number of hospital beds and medical doctors, which is understandable given that these are not affluent parts of the globe compared to other regions The two areas are putting a lot of effort into dealing with the pandemic due to the high level of infection of the coronavirus due to an underdeveloped medical system and lack of competent medical staff COVID-19 is a lethal virus that may spread quickly, thus a country's medical system may need to be well prepared to deal with the pandemic However, because to a scarcity of ICU beds and medical personnel, it is unable to provide a good Covid-19 therapy As a result, patients are at high risk of infection and of not being treated in time, leading to an increase in COVID-19-related fatalities Therefore, it can be concluded that the number of hospital beds and medical doctors have a negative relationship to the total deaths of a country For instance, in Africa, we can compare between the two nations Kuwait and Oman Kuwait’s medical workforce is higher that than of Oman’s, which is namely 26.46 per 10,000 and 19,3 per 10,000 (WHO 2021) As a result, Oman indicates the higher deaths rate compared to Kuwait, which is 408.36 per million and 231.25 per million people, respectively (WHO 2021) 7.2 Predict the number of deaths in the world on Oct 31, 2021 As mentioned in part 6, we have suggested that the quadratic trend model is the most suited indicator With this trend model, we will predict the number of deaths from Covid-19 in the world on October 31, 2021 By applying the formula in part 6: 101.81 – 1.48T + 0.008T^2, we can predict the world’s number of deaths due to Covid-19 trend October 31, 2021 will be day 214, therefore, the result is 3448 deaths by Covid-19 This outcome means that the number of deaths will decrease in the future However, because this model only has observed countries, this result is only approximate, not precise 7.3 Analysis on whether global deaths will be reduced by the end of 2021 It is fair to assume that the number of fatalities due to covid-19 will not decrease by the end of 2021 based on the same function that is best suited for forecasting global mortality Initially, the graph (Figure 8) with a rising line notwithstanding some variations or curve might be used to explain the increased trend in deaths predicted This indicates that worldwide fatalities will continue to rise until the end of 2021 and beyond However, the Covid-19 mortality rate has been shown to be relate to a person’s current age and a pre-existing medical condition, which is not taken into consideration in the model, and therefore the estimate made may not be realistic or appropriate in real life as there might be some errors As a result, the information obtained should only be used as a guideline 16 7.4 Two variables that might impact the number of Covid deaths in the world The citizens’ awareness can be considered one of the first crucial factor that may affect the deaths rate of Covid-19 infection in the world There a reasonable proves to indicate our statement First, a country where people are strongly aware of the Corona Virus which make them defend themselves from the danger in any available methods, compared to another country where the citizens don’t mind and risk themselves toward the virus For instance, Vietnam in 2020 is dealing with a whole different situation compare to it in the current year (Todd H 2021) In 2020, Vietnam had a record of death case and only below 70 cases throughout the country due to the early lockdown conducted by the government and the high awareness of the citizens who volunteer to stay at home to avoid the virus On contrast, in 2021 when the virus re-approached, the nation has lowered the awareness as individuals still went out regularly (Todd H 2021), resulting in the surprising number of infected cases (601,349 cases) and high deaths cases (15,279 cases) since June 2021 till now (WorldOMeter 2021) Despite the country is now being in a recovery, the citizens still need to conscious on the danger of the virus as well as its new variant On the other hand, in order to relate to the individuals’ awareness, “vaccination rates” is also a factor which is fundamental while considering the deaths rate According to the World Health Organization, all covid-19 vaccines are authorized and justified with a consideration of above 50% efficacy As an example, consider a vaccination that has been shown to have an 80 percent effectiveness rate This indicates that individuals who received the vaccination had an 80% reduced risk of acquiring illness than those who received the placebo in the clinical study This is determined by comparing the number of illness cases in the vaccinated and placebo groups A vaccine with an effectiveness of 80% does not guarantee that 20% of the vaccinated population would fall sick Therefore, it is obvious that vaccination can be concerned crucial for a nation to deal with the pandemic as it create safety for individuals, help the country the move steps closer the fully recovery, moreover lower the danger of the virus (WHO 2021) 17 REFERENCES: 2021, 'COVID-19 responsible for at least million excess deaths in 2020', World Health Organization, May, viewed 11 September 2021, 2021, 'Number of coronavirus (COVID-19) cases in the African continent as of August 8, 2021, by country', Statista, August, viewed 11 September 2021, 'Medical doctors (per 10 000 population)', World Health Organization, viewed 13 September 2021, Rousson, V 2007, 'An R-square coefficient based on final prediction error', ScienceDirect, July, viewed 12 September 2021, Walton, A, 'How Many Countries are there in Africa? Wow 54!', African Overload, viewed 12 September 2021, 2021, 'Vaccine efficacy, effectiveness and protection', World Health Organization, July, viewed 12 September 2021, 2021, 'Vietnam Coronavirus Cases', WorldOMeter, September, viewed 12 September 2021, 2021, 'World Coronavirus Cases', WorldOMeter, September, viewed 12 September 2021, Pollack, T 2021, 'Emerging COVID-19 success story: Vietnam’s commitment to containment', Our World in Data, March, viewed 12 September 2021, 18 Appendix Backwards Elimination Process: Region A: Step 1: Run regression on all variables Step 2: Remove Population (millions) because it has the highest p-value and is greater than 0.05, then re-run regression analysis on remaining variables Step 3: Remove Annual average rainfall because it has the highest p-value and is greater than 0.05, then re-run regression analysis on remaining variables Step 4: Remove Medical doctors (per 10,000) because it has the highest p-value and is greater than 0.05, then re-run regression analysis on remaining variables 19 Step 5: Remove Annual average temperature because it has the highest p-value and is greater than 0.05, then re-run regression analysis on remaining variables Step 6: Since all independent variables have p-value less than 0.05, this model is kept Region B: Step 1: Run regression on all variables Step 2: Remove Annual average temperature because it has the highest p-value and is greater than 0.05, then re-run regression analysis on remaining variables 20 Step 3: Remove Annual average rainfall because it has the highest p-value and is greater than 0.05, then re-run regression analysis on remaining variables Step 4: Remove Hospital beds (per 10,000) because it has the highest p-value and is greater than 0.05, then re-run regression analysis on remaining variables Step 5: Remove Medical doctors (per 10,000) because it has the highest p-value and is greater than 0.05, then re-run regression analysis on remaining variables Step 6: Since all independent variables have p-value less than 0.05, this model is kept 21 Statement of Authorship sheet Student name Work Allocation % contribute Ngo Nguyen Phu Part + + 100% Dang Dinh Nguyen Vu Part + 100% Duong Tri Dung Part + 100% 22 ... -14 5 -13 4=-224 -224/2= -11 2 27266 Quadratic 400-593= -19 3 489-603 =13 4 -19 3 -13 4=-327 -327/2= -16 3 55205 Exponential 400-483=-83 469-588 =11 9 -11 9-83=-202 -202/2= -10 1 210 50 13 Table 4: MAD and SSE calculation of... Variance SD 31. 5 16 .75 318 .38 22. 81 414 7 .17 64.40 74.73 4 91. 9 214 .725 18 930.454 13 7.58799 204.47 0.9425 MiddleEas 13 8.49 t & & Oceania Table Africa and Middle East & Oceania dataset descriptive statistics. .. Formula: Y^ = 10 .87 – 0 .11 T Y^ is the predicted number of deaths due to Covid -19 and T is the trend (time) B0 = 10 .87 means that there were about 10 .87 deaths in day (March 31) B1 = -0 .11 illustrates

Định dạng
Số trang	22
Dung lượng	1,28 MB