data science report of final examination predict the number of deaths

People at high risk of lung cancer are smokers, passive smokers, people who have relatives with lung cancer, work in environments at risk of exposure to carcinogens.According to statisti

Overview of cancer and cancer mortality

Cancers are a group of diseases that involve uncontrolled cell proliferation and those cells have the ability to invade other tissues by growing directly into nearby tissue or moving to other parts of the body (metastasis) Not all tumors are cancerous, there are some that belong to the benign group, that is, tumors that do not invade other parts of the body Some signs and symptoms of melanoma include abnormal bleeding, prolonged unexplained cough, weight loss, and abnormalities in urination.Although these symptoms can be signs of cancer, they can also have other causes There are currently more than 100 types of cancer that affect human life.

Cancer remains a global concern According to GLOBOCAN statistics in 2020, illness and death from cancer worldwide are on the rise In Vietnam, there are an estimated 182,563 new cases and 122,690 deaths due to cancer For every 100,000 people, 159 people expect new mail and 106 people die from cancer.

In Vietnam, common cancers in men include liver, lung, stomach, colorectal, and prostate cancers which are the most common cancers (accounting for about 65.8% of all cancers) In women, common cancers include breast, lung, colorectal, stomach, and liver cancer (accounting for about 59.4% of all cancers) Common to both sexes,

1 the most common types of cancer are liver, lung, breast, stomach and colorectal cancers The death rate from cancer is increasing.

Motivation

The most common, difficult to detect cancer and the most effective and deadly treatment results today is lung cancer.Patients develop lung cancer because:

 90% of patients develop lung cancer from smoking

 10% is due to other causes such as: smog, radioactive ,

People at high risk of lung cancer are smokers, passive smokers, people who have relatives with lung cancer, work in environments at risk of exposure to carcinogens.According to statistics, every year in our country there are about 18,000 people suffering from lung cancer, of which 15,000 cases die; These statistics require investment in research on diseases related to the gastrointestinal tract more and more deeply and qualitativelyTherefore, we analyze data on patients of lung cancer, so that people have the most comprehensive view of this disease, it contributes to early diagnosis, effective treatment and prevention, helping to reduce the disease burden of respiratory diseases."

Conclusion

Data set

Dataset Description

Dataset configuration: 2044 observations (rows), 9 features (variables)

There are 7 attributes in each case of the dataset Here is the table describing all the features.

1 District Names of some districts in Vietnam

3 Year Year of taking figures

4 Population Total number of people in a district

Number of people who died due to the disease

Number of people with lung cancer due to causes: radon radioactive gas, asbestos dust,

7 Smoker Number of people infected with tobacca use

8 Age Average age of smokers

9 Number of disease Number of people dying from Lung cancer

Conclusion

Model

Introduction to linear regression

Linear regression is a statistical method for regressing data with the dependent variable having continuous values while the independent variables can have either continuous or categorical values In other words, "Computational Regression" is a method to predict the dependent variable (Y) based on the value of the independent variable (X) It can be used for cases where we want to predict a continuous quantity. For example, predicting traffic in a retail store, predicting how long users spend on a certain page or the number of pages visited on a certain website, etc.

Linear regression is a data analysis technique that predicts the value of unknown data using another known and related data value It mathematically models unknown or dependent variables and known or independent variables as a linear equation For example, let's say you have data about your expenses and income for the last year The statistical regression technique analyzes this data and determines that your expenses are half of your income They then calculate an unknown future cost by reducing the known future income by half.

Linear regression models are relatively simple and provide an easy-to- explain mathematical formula for making predictions Linear regression is a statistical technique that has been used for a long time and is easy to use in software and calculations Businesses use it to reliably and predictably transform raw data into business intelligence and actionable insights Scientists in many fields, including biology and the behavioral, environmental, and social sciences, use linear regression to conduct preliminary data analysis and predict relative trends hybrid Many data science methods, such as machine learning and artificial intelligence, use linear regression to solve complex problems.

In essence, a simple linear regression algorithm attempts to plot a line between two data variables, x and y As an independent boundary, x is plotted along the horizontal axis Independent variables are also known as explanatory variables or predictor variables The dependent variable, y, plotted on the vertical axis, can also refer to y values as regression or predictor variables.

Some form of linear regression

Simple linear regression is defined using a linear function:

BO and ò1 are two unknown constants that represent the regression coefficient, while e (epsilon) is the error term.

You can use simple linear regression to model a relationship between two variables, such as the following:

• Age and height in children

• Temperature and expansion of the metal lobe in the thermometer b Multiple linear regression

In multiple linear regression analysis, the data set contains one dependent variable and multiple independent variables The linear regression line function is changed to include many factors as follows:

As the number of predictor variables increases, the constants ò also increase correspondingly.

Multiple linear regression models multiple variables and their impact on an outcome:

• Rainfall, temperature and level of fertilizer use affect crop yield

• Diet and exercise for heart disease

• Wage growth and inflation on home loan interest rates

Linear regression is a good algorithm and highly applicable to life, so we decided to use it for our analysis It will make our data look simpler and more predictive than other algorithms

ANALYZE DATASETS

- To be able to run code on python, the first thing we need to do is import the library.Here we proceed to import 12 library

+ Pandas: offers you multiple Series and DataFrames Allows you to organize, drill down, present, and manipulate data.

+ Numpy :multidimensional array processing, matrix

+ From collections import Counter : the Collections module implements high - performance container datatypes (beyond the built-in types list, dict and tuple) and contains many useful data structures that you can use to store information in memory.Counter is a container that tracks how many times equivalent values are added)

+ From sklearn.linear_model import LinearRegression:This is Ordinary least squares

Linear Regression from sklearn.linear_module

+ From sklearn.model_selection import train_test_split:The train_test_split() method is used to split our data into train and test sets.

+ From sklearn.preprocessing import MinMaxScaler:This estimator scales and translates each feature individually such that it is in the given range on the training set

+ Sklearn.metrics as metrics:The sklearn.metrics module implements functions assessing prediction error for specific purposes These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.

+ From sklearn.feature_selection import RFE:Feature ranking with recursive feature elimination.

+ Statsmodels.api as smprovides a complement to scipy for statistical computations including descriptive statistics and estimation and inference for statistical models. + Variance inflation factor, VIF, for one exogenous variable

- After importing the library, we create a dataframe using the pd.read_csv() command to read the csv file available in our computer's memory and name the dataframe cancer Then print the information of the dataframe to the screen using the print command

The above data set includes data that we have collected related to lung cancer in 204 districts in 3 regions in Vietnam

- To make it easy for everyone to follow, we have proceeded to print the specific information of the data sheet to the screen:

+ First we use the dataframe.shape command to count the number of rows and columns of the dataframe

We have 204 rows corresponding to 204 districts in Vietnam and 9 columns respectively:

District,Regions,Year,Population,Death,Other,Smoker,Age and Disease.

+After that, we continue to summarize the data with the dataframe.describe command so that everyone can observe the data in the easiest way.

- The tabular Dataframe seems difficult to follow, we will proceed to print a brief summary of the DataFrame with the dataframe.info command for everyone to follow

From the summary data table, we can see that there are 7 columns of numerical data used: Year,Population,Death,Other, Smoker,Age and Disease And the 2 columns of text data are District and Regions

 We proceed to delete columns and rows with data that receive null values

- First we use the dataframe.dropna() command to delete rows whose data receives null values

- Next, we use the dataframe.dropna(axis=1) command to delete columns whose datareceives a null value : cancer.dropna(axis=1)

We check the data using the pd.isnul(dataframe) l function to find null values:

- From the data returned, we can see that none of the data has a null value Prove data is completely clean

 We use the dataframe.iloc command to split 67 rows and 6 columns of data

- The first data set we named Dataframe=n1 consists of the data of the first

67 rows and 6 columns including Population, Death, Smoker, Age, Disease.This is representative data for some districts in Northern Vietnam

The second data set we named dataframe=n2 consists of the data of the second 67 rows and 6 columns including Population, Death, Smoker, Age, Disease.This is representative data for some districts in Southern Vietnam

- The third data set we named dataframe=n3 consists of the data of the last 67 rows and 6 columns including Population, Death, Smoker, Age, Disease.This data is representative of some districts in Central Vietnam

- Second,We averaged the number of lung cancer deaths of 3 small datasets divided using the sum()/len() function

 We first averaged lung cancer deaths in n1 and call the result f1 :

We get the result F1 +Second,we averaged lung cancer deaths in n1 and call the result f2 :

+Next,we averaged lung cancer deaths in n1 and call the result f3:

We get the result f3 Next , we averaged the number of people with lung cancer of 3 small datasets divided using the sum()/len() function

- We first averaged the number of people with lung cancer in n1 and call the result g1:

We get the result g1- Second,we averaged the number of people with lung cancer in n2 and call the result g2:

We get the result g2- Next,we averaged the number of people with lung cancer in n3 and call the result g3:

We get the result g3 Finally,we counted the occurrences of the average age of lung cancer in all regions:listcer.Age

 Based on the data we have, it is easy to see that the age when the most people develop lung cancer is 30 years old and then 14 years old, both are quite young. Meanwhile, in their 50s, there are quite a few people with lung cancer, which is a sad sign

 Next, we will visualize the data to give you the most objective and accurate view of lung cancer in the districts of the 3 regions that we are considering

 Taking data from the counting function in the preparation step, we grouped the age of lung cancer in 204 districts across the country into 5 age groups:’0- 9’,’10-19’,’20-29’,’30-39’,’40-50’ to draw a column graph of the frequency of disease of each age group

- To plot the first column graph, we declare variable x as age group and y as frequency

- Then use plt.bar() to draw a column chart and plt.show to display the chart to the screen

- From the chart, we can see that the average age group with the highest lung cancer is 20-29 years old, there are 57 districts in this age group, while the average age group with lung cancer is almost the lowest is 50-60 years old, only 5 districts belong to this age group We can see that people with lung cancer are tending to rejuvenate, there are up to 52 districts in the age group of 30-39 and 40 districts in the group of 40-49 years old, there is no age group 60- 70

- This is an alarming reality among young people today, through this chart we hope people will pay more attention to their own health because the above data shows that young people are suffering from lung cancer and this disease can ruin their lives

 We draw box charts to make statistics on patients with lung cancer as clear as possible

- With the data taken from the Disease column of the datafame cancer, we use the sn.boxplot()

- Based on the graph shown, we see that no district has less than 1000 cancer patients and the highest district is nearly 3000 people.That proves that the number of cancer patients in districts ranges from 1000-3000 people, we can easily see that most districts have people with diseases greater than 2000 people.

- We compiled data on the number of patients with lung cancer and got the following results

- According to statistics, the district with the highest number of people with lung cancer reached 2994 people, the district with the smallest number of people with cancer also reached 1010 Up to 75% of people with the disease are greater than 2500 people.The number is alarming because we are looking at the number of patients in districts with small areas.On average, each province will have 4 or more districts, each district has from 1000-3000 people with lung cancer, if you add the whole province, it is indeed an unimaginably large number

 To demonstrate that there are so many people with lung cancer in the districts, we created a line chart representing the population of a district:

- Taking the data from the Population column of the dataframe cancer, we use the plt.plot() command to plot the graph

 The data from the chart shows that the most populous district is only close to

5000 people and the least populous is only about 3000 people, which is the clearest demonstration of the smallness of the districts Although the population is small, the number of people suffering from cancer is extremely large

 Lung cancer is an extremely dangerous disease and has a high mortality rate, the more people with lung cancer, the greater the number of people dying from lung cancer, that is clearly shown in the data we give but to be more objective we will proceed to visualize it with a bar chart

- For easy comparison, we will divide the chart into 3 columns representing 3 North, South Central and this time the data we will take 3 numbers, f1,f2,f3, we calculated in the preparation step.

- From the data in the char, we can easily see that the central region has the highest average number of cancer deaths in the 3 regions up to nearly 600 people while the average number of people infected in the central region is

Tiêu đề	Predict the number of deaths
Tác giả	Le Phuong Thao, Giang Thuy Trang, Nguyen Thi Le Quyen, Bui Minh Duc
Người hướng dẫn	Associate Professor Do Trung Tuan
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Data Science
Thể loại	Report of final examination
Thành phố	Hanoi

Định dạng
Số trang	31
Dung lượng	7,51 MB