Trang 2 5Giá xe bao nhiêu Trang 4 13Source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data14pháp rõ dòng.. phân tích Trang 8 29Comma-separated Values
Trang 2Giá xe bao nhiêu
Trang 39 10
Giá heo?
Data scientists
Jeffrey C Schlemmerhttps://archive.ics.uci.edu/ml/machine-learning-databases/autos/
Trang 4Thu tính liên tính liên quan ra.
Attribute1 Attribute2 Attribute3 Attribute4 0
1 2 3
n
Sourse: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
Trang 5-cao
18
Jeffrey C Schlemmer Giá xe
UCI, Kaggle, Kdnuggets :
1
2
3 N4
Trang 8Hierarchical Data Format (HDF) hdf pandas.read_hdf()
30
Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
header
Trang 9SQL database
Trang 10Comma-separated
Excel sheet excel pandas.read_excel() df to_excel()
Hierarchical Data
Trang 1246 47
-2
Trang 14df)(hay feature)
(average)(frequency)
DataFrame dropna( axis= 0, how='any', thresh=None, subset= None, inplace= False )
Parameters
axis Determine if rows or columns which contain missing values are removed.
0, or : Drop rows which contain missing values.
1, or : Drop columns which contain missing value.
how Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
thresh Require that many non-NA values.
subset Labels along other axis to consider, e.g if you are dropping rows these would be a list of columns to include.
inplace
If True, do operation inplace and return None.
Returns
DataFrame DataFrame with NA entries dropped from it.
Trang 15df.dropna(subset = [ ], axis = 0, inplace = True)
horsepower peak-rpm price
df replace ( to_replace= None , value= None , inplace= False , limit = None , regex = False ,
)
Trang 16horsepower peak-rpm price
Trang 1719 20
các khác nhau => nên không quán
theo tiêu quát chung, và cho phép
inplace = False , level = None ,
target with mapper Can be either the axis name
Trang 18mpgL/100km
dataframe.astype( )
Trang 190.167 0.99 0.75
df[ 'length' ] = df[ ] / df[ 'length' ] max () df[ 'width' ] = df[ ] / df[ 'width' ] max () df[ 'height' ] = df[ ] / df[ 'height' ] max ()
Trang 20df[ 'length' ] = (df[ 'length' ] - df[ 'length' ] min() ) / (df[ 'length' ] max() - df[ 'length' ] min() )
df[ 'width' ] = (df[ 'width' ] - df[ 'width' ] min() ) / (df[ 'width' ] max() - df[ 'width' ] min() )
df[ 'height' ] = (df[ 'height' ] - df[ 'height' ] min() ) / (df[ 'height' ] max() - df[ 'height' ] min() )
32
-score
df[ 'length' ] = (df[ 'length' ] - df[ 'length' ] mean() ) / df[ 'length' ] std()
df[ 'width' ] = (df[ 'width' ] - df[ 'width' ] mean() ) / df[ 'width' ] std()
df[ 'height' ] = (df[ 'height' ] - df[ 'height' ] mean() ) / df[ 'height' ] std()
-score
Parameters:
a: array_like An array like object containing the sample data
axis: int or None, optional Axis along which to operate Default is 0 If
None, compute over the whole array a
Returns:zscore array_like
The z-scores, standardized by mean and standard deviation of input array a
Trang 21binbin
Trang 22prefix = None , prefix_sep = , dummy_na = False , columns = None , sparse = False , drop_first = False , dtype = None , ) ->
Trang 256 7
1
2
Trang 26Là t
MeanMedianMode
là cho giátrung bình trung
là kê cho xu trung tâm, hay kê trí
Trang 27
2.1 Central tendencyMode
2.1 Central tendency
Mean = Median = ModeMode < Median < Mean Positive skewMean < Median < Mode Negative skew
2.1 Central tendency
báo cáo
Trang 282.2 Dispersion Tính to
x1, x2, x3 n
Min = x1Max = xn
Range= Max Min = xn x1
Quartile cung thông tin các giá
Trang 292.2 DispersionInterquartile Range (IQR)
IQR = Q3 Q1
23
2.2 DispersionVariance
các quan sát so trung bình chúng
bình
Trang 302.2 DispersionCoefficient of Variation (CV)
Trang 311
N i i
x skewness
N
Skewness
Trang 3234
Trang 33quan:
Ung th
T quan KHÔNGKHÔNG
2.4 Correlation
correlation
2.4 Correlation
Trang 34Close to +1 : Large positive relationship Close to -1 : Large negative relationship Close to 0 : No relationship
P-value
P-value < 0.001 Strong P-value < 0.05 Moderate
P-value < 0.1 Weak P-value > 0.1 No
Trang 35quan Pearson: 0.81-value: 9.35e-48
2.4 Correlation - Statistics
quan(Correlation heatmap)
3 Gom nhóm
drive-wheelsbody-styleprice
Trang 363 Gom nhómPh
dòng
3 Gom nhómDùng heatmap
Trang 373 Gom nhómPh
Trang 384 Phân tích ANOVA
kê hay không
60
4 Phân tích ANOVAANOVA F-test
import scipy.stats as stats
groupB, groupC, )
Trang 394 Phân tích ANOVAimport scipy.stats as stats
honda và jaguar
- Shape)
Phân tích ANOVA
Trang 41asethub/ds105/master/Model_Datase
t.csv
https://raw.githubusercontent.com/datasethub/ds105/master/Model_Dataset_Lab.csv
ph
Mô hình'city-mpg'
39
'price'
7000
Trang 42'body-style' 'horsepower' 'highway-mpg' 'engine-size'
Trang 4312 13
-2019
-19 và 2017
Trang 46Y = b0+ b1x1+ b2x2+ b3x3+ b4x4
b0 intercept(X = 0)
Trang 494.2 Residual plot
Trang 504.3 Distribution plotDistribution Plots
46
A
Trang 53- PolynomialFeaturestrong package preprocessing
sklearn
58
Trang 55Pipeline Constructor
pipeline object
Trang 56mô hình.
7.1 Mean Squared Error (MSE)
7.2 R-squared (R^2)
7.1 Mean Squared Error (MSE)
y = 150
yHat = 50
150 50 = 100
7.1 Mean Squared Error (MSE)
(100) 2
7.1 Mean Squared Error (MSE)
Trang 577.1 Mean Squared Error (MSE)
7.2 R-squared (R^2)R-squared (R^2)
Y
= 6
Trang 587.1 R-squared (R^2)
.Các hình vuông màu xanh MSE
=
thì x
Trang 597.1 R-squared (R^2)Tính R^2
Trang 6090Regression plot
Distribution plot
Trang 635 6
giá cho phép chúng ta
mô hình có phùdùng phát mô hình
Trang 65fromsklearn.model_selection importcross_val_score
scores = cross_val_score(lr, x_data, y_data, cv=3)
Trang 66Cross validation (CV) cross_val_score()
Root Mean Squared Error (RMSE)
Relative Squared Error (RSE)
Mean Absolute Error (MAE)
Relative Absolute Error (RAE)
2 Overfitting - Underfitting
Trang 672.1 Overfitting
Trang 68y(x) + noise
26y(x) + noise
Trang 701 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
34
Code
Trang 7241 42
:
và overfitting.
có nhau.
Trang 73alpha
Trang 75-6
Trang 78So sánh
11
Trang 79https://python-graph-gallery.com/
2 Matplotlib
Trang 80Artist Player(Artist) Scripting Player(pyplot)
2 MatplotlibBackend player
(abstract interface class)
1 FigureCanvas: matplotlib.backend_bases.FigureCanvas
2 Renderer: matplotlib.backend_bases.Renderer
3 Event: matplotlib.backend_bases.Event
2 MatplotlibArtist player
Artist objectArtist:
1 Primitive Artist
2 Container Artist: Axis, Axes, Figure, và Tickcontainer artist container artist khác các primitive artist
Ref.: https://www.aosabook.org/en/matplotlib.html
Trang 812 MatplotlibArtist Player
23
2 MatplotlibArtist Player
2 MatplotlibArtist Player
2 Matplotlib
Scripting player
pyplot
Trang 822 MatplotlibScripting player
Trang 83plot(*args, scalex=True, scaley=True, data=None, **kwargs)
(line)
plot([x], y, [fmt], *, data=None, **kwargs)
plot([x], y, [fmt], [x2], y2, [fmt2], , **kwargs)
Trang 84Ref.: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
35
kê c
.https://github.com/datasethub/ds105/blob/master/Canada.xlsx
Trang 85https://www.un.org/en/development/desa/population/migrati on/data/empirical2/migrationflows.asp
39
df_can.head()
Trang 946 Pie chartPie Chart
Trang 978 Scatter plotScatter Plot
Trang 988 Scatter plot
49https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
9 Waffle chart
h
9 Waffle chartWaffle Chart
Trang 999 Waffle chartWaffle Chart
Trang 10057 58
-8
2
Trang 101{
"ID" : "28",
Trang 102Ref: http://132.72.155.230:3838/js/geojson-1.html
Trang 1031.2 GeoJSONStructure of GeoJSON
type Feature FeatureCollection geometry
Point LineString MultiLineString Polygon MultiPolygon GeometryCollection properties
Trang 10727 28
Folium Open Street Map
Map Styles Stamen Toner
Trang 109Ontario
36
Trang 11040
2.4 Choropleth map
Trang 1112.4 Choropleth mapGeojson File
44
2.4 Choropleth map
Trang 113gmapsgmaps
gmapsAPI key
gmaps
import gmaps gmaps.configure( api_key = GOOGLE_API_KEY) fig = gmaps.figure()
fig
from ipywidgets.embed import embed_minimal_html embed_minimal_html( 'export.html' , views =[fig])
Trang 114cities, overlaid,'TERRAIN' is a map that emphasizes terrain
stroke_color = 'green' , scale = )
fig = gmaps.figure() fig.add_layer(symbol_layer) fig
Trang 115gmapsChoropleth map
60
gmaps
Heat map
import gmaps gmaps.configure( api_key =GOOGLE_API_KEY) fig = gmaps.figure( map_type = 'SATELLITE' ) locations = [
weights =[ 1895 , 926 , 5785 , 4256 , 3745 ],
point_radius = 50 ) fig.add_layer(heatmap_layer) fig
gmapsHeat map