People at high risk of lung cancer are smokers, passive smokers, people who have relatives with lung cancer, work in environments at risk of exposure to carcinogens.According to statisti
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL
i
Trang 22.1 Introduction to linear regression 5
2.2 Some form of linear regression 7
a Simple linear regression 7
b Multiple linear regression 7
II ANALYZE DATASETS 8
1.Reading and Understanding Data 8
2 Data Cleaning and Preparations 11
3 Visualize data 17
ii
Trang 3List of figures and tables
Figure 1.1 lao phoi 1
Figure 1.2 nguoi 2
Figure 1.3 Phoi 2
Table 2.1 Dataset Description 4
iii
Trang 41 Lê Phương Thảo 16%2 Giang Thùy Trang 3 Nguyễn Thị Lệ Quyên 4 Bùi Minh Đức
iv
Trang 5CHAPTER 1.INTRODUCTION 1.1 Overview of cancer and cancer mortality
Cancers are a group of diseases that involve uncontrolled cell proliferationand those cells have the ability to invade other tissues by growing directly into nearbytissue or moving to other parts of the body (metastasis) Not all tumors are cancerous,there are some that belong to the benign group, that is, tumors that do not invade otherparts of the body Some signs and symptoms of melanoma include abnormal bleeding,prolonged unexplained cough, weight loss, and abnormalities in urination.Althoughthese symptoms can be signs of cancer, they can also have other causes There arecurrently more than 100 types of cancer that affect human life.
Figure 1.1 lao phoi
Cancer remains a global concern According to GLOBOCAN statistics in 2020,illness and death from cancer worldwide are on the rise In Vietnam, there are anestimated 182,563 new cases and 122,690 deaths due to cancer For every 100,000people, 159 people expect new mail and 106 people die from cancer.
In Vietnam, common cancers in men include liver, lung, stomach, colorectal,and prostate cancers which are the most common cancers (accounting for about 65.8%of all cancers) In women, common cancers include breast, lung, colorectal, stomach,and liver cancer (accounting for about 59.4% of all cancers) Common to both sexes,
1
Trang 6the most common types of cancer are liver, lung, breast, stomach and colorectalcancers The death rate from cancer is increasing.
Figure 1.2 nguoi
1.2 Motivation
The most common, difficult to detect cancer and the most effective and deadlytreatment results today is lung cancer.Patients develop lung cancer because:
90% of patients develop lung cancer from smoking
10% is due to other causes such as: smog, radioactive ,
Figure 1.3 Phoi
2
Trang 7People at high risk of lung cancer are smokers, passive smokers, people who have relatives with lung cancer, work in environments at risk of exposure to carcinogens.According to statistics, every year in our country there are about 18,000 people suffering from lung cancer, of which 15,000 cases die; These statistics require investment in research on diseases relatedto the gastrointestinal tract more and more deeply and
qualitativelyTherefore, we analyze data on patients of lung cancer, so that people have the most comprehensive view of this disease, it contributes to early diagnosis, effective treatment and prevention, helping to reduce the disease burden of respiratory diseases."
1.3 Conclusion
3
Trang 8Chapter 2 Data set 2.1 Dataset Description
Dataset configuration: 2044 observations (rows), 9 features (variables)There are 7 attributes in each case of the dataset Here is the table describing allthe features.
Table 2.1 Dataset Description
1 District Names of some districts in Vietnam
5 Number of deaths Number of people who died due to the disease
Number of people with lung cancer due to causes: radon radioactive gas, asbestos dust,
9 Number of disease Number of people dying from Lung cancer
2.2 Features
2.3 Conclusion
4
Trang 10Chapter 3 Model LINEAR REGRESSION
2.1 Introduction to linear regression
Linear regression is a statistical method for regressing data with thedependent variable having continuous values while the independent variables can haveeither continuous or categorical values In other words, "Computational Regression" isa method to predict the dependent variable (Y) based on the value of the independentvariable (X) It can be used for cases where we want to predict a continuous quantity.For example, predicting traffic in a retail store, predicting how long users spend on acertain page or the number of pages visited on a certain website, etc.
Linear regression is a data analysis technique that predicts the value ofunknown data using another known and related data value It mathematically modelsunknown or dependent variables and known or independent variables as a linearequation For example, let's say you have data about your expenses and income for thelast year The statistical regression technique analyzes this data and determines thatyour expenses are half of your income They then calculate an unknown future cost byreducing the known future income by half.
Linear regression models are relatively simple and provide an explain mathematical formula for making predictions Linear regression is a statisticaltechnique that has been used for a long time and is easy to use in software andcalculations Businesses use it to reliably and predictably transform raw data intobusiness intelligence and actionable insights Scientists in many fields, includingbiology and the behavioral, environmental, and social sciences, use linear regressionto conduct preliminary data analysis and predict relative trends hybrid Many datascience methods, such as machine learning and artificial intelligence, use linearregression to solve complex problems.
In essence, a simple linear regression algorithm attempts to plot a linebetween two data variables, x and y As an independent boundary, x is plotted alongthe horizontal axis Independent variables are also known as explanatory variables orpredictor variables The dependent variable, y, plotted on the vertical axis, can alsorefer to y values as regression or predictor variables.
6
Trang 11Figure 2: Linear regression
7
Trang 122.2Some form of linear regressiona Simple linear regression
Simple linear regression is defined using a linear function:
BO and ß1 are two unknown constants that represent the regressioncoefficient, while e (epsilon) is the error term.
You can use simple linear regression to model a relationship betweentwo variables, such as the following:
• Rainfall and crop yields• Age and height in children
• Temperature and expansion of the metal lobe in the thermometerb Multiple linear regression
In multiple linear regression analysis, the data set contains onedependent variable and multiple independent variables The linearregression line function is changed to include many factors as follows:
8
Trang 13As the number of predictor variables increases, the constants ß alsoincrease correspondingly.
Multiple linear regression models multiple variables and their impacton an outcome:
• Rainfall, temperature and level of fertilizer use affect crop yield• Diet and exercise for heart disease
• Wage growth and inflation on home loan interest rates1 Conclusion
Linear regression is a good algorithm and highly applicable to life, so we decided to use it for our analysis It will make our data look simpler and more predictive than other algorithms
II ANALYZE DATASETS
1.Reading and Understanding Data
- To be able to run code on python, the first thing we need to do is import thelibrary.Here we proceed to import 12 library
9
Trang 14+ Pandas: offers you multiple Series and DataFrames Allows you to organize, drill down, present, and manipulate data.
+ Numpy :multidimensional array processing, matrix + Matplotlib 2D graphing
+ Seaborn:visualize models
+ From collections import Counter : the Collections module implements high - performance container datatypes (beyond the built-in types list, dict andtuple)
and contains many useful data structures that you can use to store information in memory.Counter is a container that tracks how many times equivalent values are added)
+ From sklearn.linear_model import LinearRegression:This is Ordinary leastsquares
Linear Regression from sklearn.linear_module
+ From sklearn.model_selection import train_test_split:The train_test_split()method
is used to split our data into train and test sets.
+ From sklearn.preprocessing import MinMaxScaler:This estimator scales and translates each feature individually such that it is in the given range on thetraining
set
+ Sklearn.metrics as metrics:The sklearn.metrics module implements functions assessing prediction error for specific purposes These metrics are detailed insections
on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.
+ From sklearn.feature_selection import RFE:Feature ranking with recursivefeature
elimination.
10
Trang 15+ Statsmodels.api as smprovides a complement to scipy for statistical computations including descriptive statistics and estimation and inference for statistical models.
- After importing the library, we create a dataframe using the pd.read_csv()command to read the csv file available in our computer's memory and name thedataframe cancer Then print the information of the dataframe to the screenusing the print command
The above data set includes data that we have collected related to lung cancer in 204 districts in 3 regions in Vietnam
- To make it easy for everyone to follow, we have proceeded to print the specificinformation of the data sheet to the screen:
+ First we use the dataframe.shape command to count the number of rows andcolumns
of the dataframe
We have 204 rows corresponding to 204 districts in Vietnam and 9 columnsrespectively:
District,Regions,Year,Population,Death,Other,Smoker,Age and Disease.
+After that, we continue to summarize the data with the dataframe.describecommand so that everyone can observe the data in the easiest way.
11
Trang 16- The tabular Dataframe seems difficult to follow, we will proceed to print abrief summary of the DataFrame with the dataframe.info command foreveryone to follow
From the summary data table, we can see that there are 7 columns ofnumerical data used: Year,Population,Death,Other, Smoker,Age and Disease Andthe 2 columns of text data are District and Regions
12
Trang 172 Data Cleaning and Preparations
Trang 18- Next, we use the dataframe.dropna(axis=1) command to delete columns whose datareceives a null value : cancer.dropna(axis=1)
We check the data using the pd.isnul(dataframe) l function to find null values:
14
Trang 19- From the data returned, we can see that none of the data has a null value Prove data is completely clean
15
Trang 20- The third data set we named dataframe=n3 consists of the data of the last 67 rowsand 6 columns including Population, Death, Smoker, Age, Disease.This data isrepresentative of some districts in Central Vietnam
- Second,We averaged the number of lung cancer deaths of 3 small datasetsdivided using the sum()/len() function
We first averaged lung cancer deaths in n1 and call the result f1 :
We get the result F1=
+Second,we averaged lung cancer deaths in n1 and call the result f2 :
We get the result f2=
+Next,we averaged lung cancer deaths in n1 and call the result f3:
We get the result f3=
16
Trang 21 Next , we averaged the number of people with lung cancer of 3 small datasetsdivided using the sum()/len() function
- We first averaged the number of people with lung cancer in n1 and call the
- Second,we averaged the number of people with lung cancer in n2 and call the result g2:
We get the result g2=
- Next,we averaged the number of people with lung cancer in n3 and call the
Finally,we counted the occurrences of the average age of lung cancer in all regions:list=cancer.Age
Counter(list)
17
Trang 22 Based on the data we have, it is easy to see that the age when the most peopledevelop lung cancer is 30 years old and then 14 years old, both are quite young.Meanwhile, in their 50s, there are quite a few people with lung cancer, which isa sad sign
Next, we will visualize the data to give you the most objective and accurateview of lung cancer in the districts of the 3 regions that we are considering
3 Visualize data
Taking data from the counting function in the preparation step, we grouped theage of lung cancer in 204 districts across the country into 5 age groups:’0-9’,’10-19’,’20-29’,’30-39’,’40-50’ to draw a column graph of the frequency ofdisease of each age group
- To plot the first column graph, we declare variable x as age group and y asfrequency
18
Trang 23- Then use plt.bar() to draw a column chart and plt.show to display the chart
- From the chart, we can see that the average age group with the highest lungcancer is 20-29 years old, there are 57 districts in this age group, while theaverage age group with lung cancer is almost the lowest is 50-60 years old,only 5 districts belong to this age group We can see that people with lungcancer are tending to rejuvenate, there are up to 52 districts in the age group of30-39 and 40 districts in the group of 40-49 years old, there is no age group 60-70
- This is an alarming reality among young people today, through this chart wehope people will pay more attention to their own health because the above datashows that young people are suffering from lung cancer and this disease canruin their lives
We draw box charts to make statistics on patients with lung cancer as clear aspossible
19
Trang 24- With the data taken from the Disease column of the datafame cancer, we usethe sn.boxplot()
- Based on the graph shown, we see that no district has less than 1000 cancerpatients and the highest district is nearly 3000 people.That proves that thenumber of cancer patients in districts ranges from 1000-3000 people, we caneasily see that most districts have people with diseases greater than 2000people.
- We compiled data on the number of patients with lung cancer and got thefollowing results
- According to statistics, the district with the highest number of people with lungcancer reached 2994 people, the district with the smallest number of peoplewith cancer also reached 1010 Up to 75% of people with the disease are
20
Trang 25greater than 2500 people.The number is alarming because we are looking atthe number of patients in districts with small areas.On average, each provincewill have 4 or more districts, each district has from 1000-3000 people withlung cancer, if you add the whole province, it is indeed an unimaginably largenumber
To demonstrate that there are so many people with lung cancer in the districts,we created a line chart representing the population of a district:
- Taking the data from the Population column of the dataframe cancer, we usethe plt.plot() command to plot the graph
The data from the chart shows that the most populous district is only close to5000 people and the least populous is only about 3000 people, which is theclearest demonstration of the smallness of the districts Although thepopulation is small, the number of people suffering from cancer is extremelylarge
Lung cancer is an extremely dangerous disease and has a high mortality rate,the more people with lung cancer, the greater the number of people dying fromlung cancer, that is clearly shown in the data we give but to be more objectivewe will proceed to visualize it with a bar chart
21