1. Trang chủ
  2. » Luận Văn - Báo Cáo

data science report of final examination predict the number of deaths

31 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

People at high risk of lung cancer are smokers, passive smokers, people who have relatives with lung cancer, work in environments at risk of exposure to carcinogens.According to statisti

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL

i

Trang 2

2.1 Introduction to linear regression 5

2.2 Some form of linear regression 7

a Simple linear regression 7

b Multiple linear regression 7

II ANALYZE DATASETS 8

1.Reading and Understanding Data 8

2 Data Cleaning and Preparations 11

3 Visualize data 17

ii

Trang 3

List of figures and tables

Figure 1.1 lao phoi 1

Figure 1.2 nguoi 2

Figure 1.3 Phoi 2

Table 2.1 Dataset Description 4

iii

Trang 4

1 Lê Phương Thảo 16%2 Giang Thùy Trang 3 Nguyễn Thị Lệ Quyên 4 Bùi Minh Đức

iv

Trang 5

CHAPTER 1.INTRODUCTION 1.1 Overview of cancer and cancer mortality

Cancers are a group of diseases that involve uncontrolled cell proliferationand those cells have the ability to invade other tissues by growing directly into nearbytissue or moving to other parts of the body (metastasis) Not all tumors are cancerous,there are some that belong to the benign group, that is, tumors that do not invade otherparts of the body Some signs and symptoms of melanoma include abnormal bleeding,prolonged unexplained cough, weight loss, and abnormalities in urination.Althoughthese symptoms can be signs of cancer, they can also have other causes There arecurrently more than 100 types of cancer that affect human life.

Figure 1.1 lao phoi

Cancer remains a global concern According to GLOBOCAN statistics in 2020,illness and death from cancer worldwide are on the rise In Vietnam, there are anestimated 182,563 new cases and 122,690 deaths due to cancer For every 100,000people, 159 people expect new mail and 106 people die from cancer.

In Vietnam, common cancers in men include liver, lung, stomach, colorectal,and prostate cancers which are the most common cancers (accounting for about 65.8%of all cancers) In women, common cancers include breast, lung, colorectal, stomach,and liver cancer (accounting for about 59.4% of all cancers) Common to both sexes,

1

Trang 6

the most common types of cancer are liver, lung, breast, stomach and colorectalcancers The death rate from cancer is increasing.

Figure 1.2 nguoi

1.2 Motivation

The most common, difficult to detect cancer and the most effective and deadlytreatment results today is lung cancer.Patients develop lung cancer because:

 90% of patients develop lung cancer from smoking

 10% is due to other causes such as: smog, radioactive ,

Figure 1.3 Phoi

2

Trang 7

People at high risk of lung cancer are smokers, passive smokers, people who have relatives with lung cancer, work in environments at risk of exposure to carcinogens.According to statistics, every year in our country there are about 18,000 people suffering from lung cancer, of which 15,000 cases die; These statistics require investment in research on diseases relatedto the gastrointestinal tract more and more deeply and

qualitativelyTherefore, we analyze data on patients of lung cancer, so that people have the most comprehensive view of this disease, it contributes to early diagnosis, effective treatment and prevention, helping to reduce the disease burden of respiratory diseases."

1.3 Conclusion

3

Trang 8

Chapter 2 Data set 2.1 Dataset Description

Dataset configuration: 2044 observations (rows), 9 features (variables)There are 7 attributes in each case of the dataset Here is the table describing allthe features.

Table 2.1 Dataset Description

1 District Names of some districts in Vietnam

5 Number of deaths Number of people who died due to the disease

Number of people with lung cancer due to causes: radon radioactive gas, asbestos dust,

9 Number of disease Number of people dying from Lung cancer

2.2 Features

2.3 Conclusion

4

Trang 10

Chapter 3 Model LINEAR REGRESSION

2.1 Introduction to linear regression

Linear regression is a statistical method for regressing data with thedependent variable having continuous values while the independent variables can haveeither continuous or categorical values In other words, "Computational Regression" isa method to predict the dependent variable (Y) based on the value of the independentvariable (X) It can be used for cases where we want to predict a continuous quantity.For example, predicting traffic in a retail store, predicting how long users spend on acertain page or the number of pages visited on a certain website, etc.

Linear regression is a data analysis technique that predicts the value ofunknown data using another known and related data value It mathematically modelsunknown or dependent variables and known or independent variables as a linearequation For example, let's say you have data about your expenses and income for thelast year The statistical regression technique analyzes this data and determines thatyour expenses are half of your income They then calculate an unknown future cost byreducing the known future income by half.

Linear regression models are relatively simple and provide an explain mathematical formula for making predictions Linear regression is a statisticaltechnique that has been used for a long time and is easy to use in software andcalculations Businesses use it to reliably and predictably transform raw data intobusiness intelligence and actionable insights Scientists in many fields, includingbiology and the behavioral, environmental, and social sciences, use linear regressionto conduct preliminary data analysis and predict relative trends hybrid Many datascience methods, such as machine learning and artificial intelligence, use linearregression to solve complex problems.

In essence, a simple linear regression algorithm attempts to plot a linebetween two data variables, x and y As an independent boundary, x is plotted alongthe horizontal axis Independent variables are also known as explanatory variables orpredictor variables The dependent variable, y, plotted on the vertical axis, can alsorefer to y values as regression or predictor variables.

6

Trang 11

Figure 2: Linear regression

7

Trang 12

2.2Some form of linear regressiona Simple linear regression

Simple linear regression is defined using a linear function:

BO and ß1 are two unknown constants that represent the regressioncoefficient, while e (epsilon) is the error term.

You can use simple linear regression to model a relationship betweentwo variables, such as the following:

• Rainfall and crop yields• Age and height in children

• Temperature and expansion of the metal lobe in the thermometerb Multiple linear regression

In multiple linear regression analysis, the data set contains onedependent variable and multiple independent variables The linearregression line function is changed to include many factors as follows:

8

Trang 13

As the number of predictor variables increases, the constants ß alsoincrease correspondingly.

Multiple linear regression models multiple variables and their impacton an outcome:

• Rainfall, temperature and level of fertilizer use affect crop yield• Diet and exercise for heart disease

• Wage growth and inflation on home loan interest rates1 Conclusion

Linear regression is a good algorithm and highly applicable to life, so we decided to use it for our analysis It will make our data look simpler and more predictive than other algorithms

II ANALYZE DATASETS

1.Reading and Understanding Data

- To be able to run code on python, the first thing we need to do is import thelibrary.Here we proceed to import 12 library

9

Trang 14

+ Pandas: offers you multiple Series and DataFrames Allows you to organize, drill down, present, and manipulate data.

+ Numpy :multidimensional array processing, matrix + Matplotlib 2D graphing

+ Seaborn:visualize models

+ From collections import Counter : the Collections module implements high - performance container datatypes (beyond the built-in types list, dict andtuple)

and contains many useful data structures that you can use to store information in memory.Counter is a container that tracks how many times equivalent values are added)

+ From sklearn.linear_model import LinearRegression:This is Ordinary leastsquares

Linear Regression from sklearn.linear_module

+ From sklearn.model_selection import train_test_split:The train_test_split()method

is used to split our data into train and test sets.

+ From sklearn.preprocessing import MinMaxScaler:This estimator scales and translates each feature individually such that it is in the given range on thetraining

set

+ Sklearn.metrics as metrics:The sklearn.metrics module implements functions assessing prediction error for specific purposes These metrics are detailed insections

on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.

+ From sklearn.feature_selection import RFE:Feature ranking with recursivefeature

elimination.

10

Trang 15

+ Statsmodels.api as smprovides a complement to scipy for statistical computations including descriptive statistics and estimation and inference for statistical models.

- After importing the library, we create a dataframe using the pd.read_csv()command to read the csv file available in our computer's memory and name thedataframe cancer Then print the information of the dataframe to the screenusing the print command

The above data set includes data that we have collected related to lung cancer in 204 districts in 3 regions in Vietnam

- To make it easy for everyone to follow, we have proceeded to print the specificinformation of the data sheet to the screen:

+ First we use the dataframe.shape command to count the number of rows andcolumns

of the dataframe

We have 204 rows corresponding to 204 districts in Vietnam and 9 columnsrespectively:

District,Regions,Year,Population,Death,Other,Smoker,Age and Disease.

+After that, we continue to summarize the data with the dataframe.describecommand so that everyone can observe the data in the easiest way.

11

Trang 16

- The tabular Dataframe seems difficult to follow, we will proceed to print abrief summary of the DataFrame with the dataframe.info command foreveryone to follow

From the summary data table, we can see that there are 7 columns ofnumerical data used: Year,Population,Death,Other, Smoker,Age and Disease Andthe 2 columns of text data are District and Regions

12

Trang 17

2 Data Cleaning and Preparations

Trang 18

- Next, we use the dataframe.dropna(axis=1) command to delete columns whose datareceives a null value : cancer.dropna(axis=1)

We check the data using the pd.isnul(dataframe) l function to find null values:

14

Trang 19

- From the data returned, we can see that none of the data has a null value Prove data is completely clean

15

Trang 20

- The third data set we named dataframe=n3 consists of the data of the last 67 rowsand 6 columns including Population, Death, Smoker, Age, Disease.This data isrepresentative of some districts in Central Vietnam

- Second,We averaged the number of lung cancer deaths of 3 small datasetsdivided using the sum()/len() function

 We first averaged lung cancer deaths in n1 and call the result f1 :

We get the result F1=

+Second,we averaged lung cancer deaths in n1 and call the result f2 :

We get the result f2=

+Next,we averaged lung cancer deaths in n1 and call the result f3:

We get the result f3=

16

Trang 21

 Next , we averaged the number of people with lung cancer of 3 small datasetsdivided using the sum()/len() function

- We first averaged the number of people with lung cancer in n1 and call the

- Second,we averaged the number of people with lung cancer in n2 and call the result g2:

We get the result g2=

- Next,we averaged the number of people with lung cancer in n3 and call the

Finally,we counted the occurrences of the average age of lung cancer in all regions:list=cancer.Age

Counter(list)

17

Trang 22

 Based on the data we have, it is easy to see that the age when the most peopledevelop lung cancer is 30 years old and then 14 years old, both are quite young.Meanwhile, in their 50s, there are quite a few people with lung cancer, which isa sad sign

 Next, we will visualize the data to give you the most objective and accurateview of lung cancer in the districts of the 3 regions that we are considering

3 Visualize data

 Taking data from the counting function in the preparation step, we grouped theage of lung cancer in 204 districts across the country into 5 age groups:’0-9’,’10-19’,’20-29’,’30-39’,’40-50’ to draw a column graph of the frequency ofdisease of each age group

- To plot the first column graph, we declare variable x as age group and y asfrequency

18

Trang 23

- Then use plt.bar() to draw a column chart and plt.show to display the chart

- From the chart, we can see that the average age group with the highest lungcancer is 20-29 years old, there are 57 districts in this age group, while theaverage age group with lung cancer is almost the lowest is 50-60 years old,only 5 districts belong to this age group We can see that people with lungcancer are tending to rejuvenate, there are up to 52 districts in the age group of30-39 and 40 districts in the group of 40-49 years old, there is no age group 60-70

- This is an alarming reality among young people today, through this chart wehope people will pay more attention to their own health because the above datashows that young people are suffering from lung cancer and this disease canruin their lives

 We draw box charts to make statistics on patients with lung cancer as clear aspossible

19

Trang 24

- With the data taken from the Disease column of the datafame cancer, we usethe sn.boxplot()

- Based on the graph shown, we see that no district has less than 1000 cancerpatients and the highest district is nearly 3000 people.That proves that thenumber of cancer patients in districts ranges from 1000-3000 people, we caneasily see that most districts have people with diseases greater than 2000people.

- We compiled data on the number of patients with lung cancer and got thefollowing results

- According to statistics, the district with the highest number of people with lungcancer reached 2994 people, the district with the smallest number of peoplewith cancer also reached 1010 Up to 75% of people with the disease are

20

Trang 25

greater than 2500 people.The number is alarming because we are looking atthe number of patients in districts with small areas.On average, each provincewill have 4 or more districts, each district has from 1000-3000 people withlung cancer, if you add the whole province, it is indeed an unimaginably largenumber

 To demonstrate that there are so many people with lung cancer in the districts,we created a line chart representing the population of a district:

- Taking the data from the Population column of the dataframe cancer, we usethe plt.plot() command to plot the graph

 The data from the chart shows that the most populous district is only close to5000 people and the least populous is only about 3000 people, which is theclearest demonstration of the smallness of the districts Although thepopulation is small, the number of people suffering from cancer is extremelylarge

 Lung cancer is an extremely dangerous disease and has a high mortality rate,the more people with lung cancer, the greater the number of people dying fromlung cancer, that is clearly shown in the data we give but to be more objectivewe will proceed to visualize it with a bar chart

21

Ngày đăng: 08/08/2024, 18:33

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w