Using zip code rental price data as a proxy for income, we employ Python-based analytics tools to explore the correlation between these factors.. Utilizing zip code rental price data as
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
7.)
5
=
=—= a
REPORT OF GROUP 3
PROJECT Determining the Effect of Income Level on NCAA Game Performance
Lecturer: Ha Manh Hung — Faculty of Applied Sciences
Members of group: Nguyen Duy Thuc - 21070278
Hoang Ngoc Khoa - 21070330
Le Thuy Huyen - 21070410 Dao Diem Quynh — 21070558
Bui Thanh Thao - 21070893
Trang 2Contents
I0 09)0060.12))310214/.9009) 1 ea 4 3
1 ProJject ProposalL 2 1 1212111211111 12122111151 118112011 51111 nxkHknrnkhyn 4
2 Data Source and [escrIpHOn c1 1222122122112 11151151151 1122 key 4
2.2 Fair Market Rent Data ST 1 TS HT HH TH HH ng rệt 5
3 Importng Modules and Ïafa - S0 2221212112 111 1211121121111 1181111 ky kg 6
4 EDA and Data Cleaning - - 02221112111 22112115111 1101 1115 11H key 6 4.1 Exploring team đata - - c2 2 2222121112111 151 1121151111112 tre, 6 4.2 Exploring renf price dafa -c c1 221112112211 1111 1121112111111 8111k 8 4.3 Merging the teams Information with the rent information 10
44 Exploring game InÍormatfIo - 22: 222121212121 111 211115212 tk II 4.5 Merging games data with team-rent informatfion - - s22 cc c2 13
5 Data Analytics and Hypothesis Tesfing - c2 212v 2x2 re 14 5.1 Impact of income level ơn the performance of the teama - 14 5.2 Impact of game venue in higher-income areas and attendance 15 5.3 Impact of team's hometown income level and the number and type of fouls thề COIMTMIE - L1 2 1222122111211 1 1511511151 1151115111811 5111111151111 K 115151118 k KH Hết 16 5.4 Impact of rent near the team's venue and the number and type of fouls they commit (Personal and fagrant foul§) -¿- c 1 2 2221122211122 113tr re 17 5.5 Impact ofa team performance for home øames vs away games 18 5.6 Impaect on home team performance based on aftendance 18
6.1 Dashboard L 211 11 11011111101 11111011111 211 1111111111011 1 0111k kg 19
Trang 3LIST OF ABBREVIATION
NCAA National Collegiate Athletic Association
HUD Department of Housing and Urban Development
ABSTRACT
This study examines the impact of zip code income levels on NCAA basketball teams' performance Using zip code rental price data as a proxy for income, we employ Python-based analytics tools to explore the correlation between these factors Rigorous hypothesis testing validates or invalidates this relationship, with key insights presented through informative plots Our report underscores the significant role of environmental factors in NCAA basketball teams' success, drawing from datasets in NCAA basketball conferences and fair market rent sources
Trang 41 Project Proposal
Our objective is to analyze the effect of the environment, described as the income level of the zip code, on the performance of NCAA basketball teams We aim to investigate the influence of zip code income levels on NCAA basketball teams’ performance Utilizing zip code rental price data as a proxy for income, we will delve into the correlation between these variables Leveraging datasets from NCAA basketball conferences and fair market rent sources, our analysis will employ Python-based analytics tools
Through rigorous hypothesis testing, we will validate, invalidate, or seek additional information regarding the relationship between zip code income and game performance Each analysis will yield key insights, showcased through informative plots Our conclusive report will provide a comprehensive overview, summarizing findings, and emphasizing the pivotal role of environmental factors in NCAA basketball teams' success
2 Data Source and Description
There are two datasets relevant to our project:
ncaa basketbalLmbb games sr: This dataset contains team statistics for every man’s basketball game from the 2013-14 season to the 2017-18 season It consists of 29,802 records and 132 variables, mcluding an [D Each row in this dataset represents the statistics for both teams in a single game
ncaa_basketball.mbb_ teams: This dataset provides general information about the 351 basketball teams It contains 351 records and 28 variables, including an ID Each row in this dataset represents statistics for a specific basketball team Part of the data structure in the dataset:
game_id: (Data type: String, Can be NULL value)
Trang 5season: (Data type: Integer, Can be NULL value)
status: (Data type: String, Can be NULL value)
coverage: (Data type: String, Can be NULL value)
In this data structure, game_id is the unique identifier for each match, season indicates the season of the match, status represents the final status of the game file from Sportradar, and coverage describes the type of report being reported provide for that match
22 Fair Market Rent Data
For this project, rental price data is annually published by the Department of Housing and Urban Development (HUD) on their website We are utilizing the latest dataset available for the 40th percentile of rent prices for the year 2022
fy2022 erap fmrs revised: This dataset includes the 40th percentile of rent prices for each housing type Each row in this dataset provides statistical information for a zip code or a sub-area if a zip code encompasses multiple areas The structure of this dataset 1s described as follows:
HUD Metro Fair Market Rental Area Name: Name of the area defined by
HUD
CBSASub22: ID used by HUD for each area
Zip Code: Zip code of the area
erap_fmr_ br0: 40th percentile of rent prices for studios in the area or zip code (whichever is smaller)
erap_fmr_brl: 40th percentile of rent prices for 1-bedroom apartments 1n the area or zip code (whichever is smaller)
erap_fmr_br2: 40th percentile of rent prices for 2-bedroom apartments 1n the area or zip code (whichever is smaller)
Trang 6erap_fmr_br3: 40th percentile of rent prices for 3-bedroom apartments 1n the area or zip code (whichever is smaller)
erap_fmr_br4: 40th percentile of rent prices for 4-bedroom apartments 1n the area or zip code (whichever is smaller)
3 Importing Modules and Data
The data from csv file in Google Drive will be connect to Google Colab Some necessary modules such as pandas, numpy, seaborn, also be imported for serving the analyzing process in this project
4 EDA and Data Cleaning
In the EDA and data cleaning section, we will work with each data set In this step we will learn and have an initial view of each data set along with cleaning the data
4.1 Exploring team data
First is the information data of 351 basketball teams With the 3 commands head, shape and info we get some basic information and some of the first data in the data set
Trang 7
RangeIndex: 351 entries, 9 to 359 Data columns (total 28 columns):
# Column Non-Null Count Dtype
@ market 351 non-null object
1 alias 351 non-null object
4 code ncaa 351 non-null inte4
5 ` kaggle team id 351 non-null inte4
7 turner name 351 non-null object
§ league name 351 non-null object
9 league_alias 351 non-null object 1@ league_id 351 non-null object
12 conf_alias 351 non-null object
13 conf_id 351 non-null object
17 venue_id 351 non-null object
18 venue_city 351 non-null object
2@ venue_address 347 non-null object
21 venue_zip 35@ non-null floate4
22 venue_country 351 non-null object
23 venue_name 351 non-null object
24 venue_capacity 351 non-null inte4
25 logo large 351 non-null object
26 logo_medium 351 non-null object
dtypes: floaté4(1), int64(3), object(24) memory usage: 76.9+ KB
Figure 1 Some first column in the dataset
The data set consists of 351 rows and 28 columns We can see some information about basketball teams such as name, id, code ncaa, school ncaa, turner name, league name, conf name, division name, venue _id, venue_city and
so on Most data types are objects and some are int and float Overall the data set is very complete, there is only | missing value in the venue_zip column and 4 in the venue_address column It can be seen most clearly when using the isnull command
We will clean this dataset before moving on to the next dataset Although there are two columns with missing values, we will only handle one column, venue zip, because it will be used to connect to another dataset Meanwhile, the variable venue_address seems to have no meaning in the following steps and has only missing one row, so we can remove this row later
Trang 8teams_df[teams_df[‘venue_zip'].isna()]
Figure 3 Find the row missing zip data
# we fill in the missing value manually using information from Google
teams_df.loc[135, ‘venue_zip'] = 79699
Figure 4 Fill in missing value
We will find the row containing the missing value and then search Google for information about that missing data and fill it in
To ensure uniformity of data types in the venue zip variable, we check 351 data and to be more careful, we convert the entire original data type to int (this step
is not too important, maybe omitted)
# since zip codes are integers, we check if there are any values that are inconsistent teams_df['ve
'wenue_zip ' ].astype( 'int64' ; - errorsz'raise' )).value_counts( )
Figure 5 Check values inconsistent
# We can safely cast the data to ‘int64'
venue_zip ' ].astype( 'inte4',
teams_df[ 'venue_zip'] = te
Figure 6 Cast the data to mt64
4.2 Exploring rent price data
Next, we work with the second dataset on rent prices Since we are connecting the team data using zip code, we have to check the data sanity of the zip code and rent price column We limit our analysis to the 1-bedroom rent prices We proceed with a brief examination of this dataset
Trang 90 Abilene, TX MSA METRO10180M10180 76437 $688 $732 $945 $1,288 $1,598
Figure 7 Second dataset on rent prices 1
Figure 9 Second dataset on rent prices 3
The data set contains 29283 rows and 8 columns including columns such as: HUD fair market rent area name, CBSASub22, zip code and prices of different room All data types are in the form of objects and do not contain missing values So here with the prices rent for l1-bedroom we will change the data type to int We create a new column called “rent” and bring in data from the 1-bedroom rental price column Along with that in this data set, there were cases where one zip code was in many different areas so we grouped them by zip code and calculated the average
Trang 10rental price Finally, check the data set again after processing to avoid data loss along with the zip code number and it is unique, corresponding to the row number
# cleaning rent prices for @ bedroom by removing any non-digit characters
rent_df[ 'rent'] z rent_df[ 'erap_fmr_br9' ]1.str.replace( '$', )
rent_df[ 'rent'] = rent_df[ 'rent' ].str.replace(',', ‘').astype(int)
# if a zip code has multiple areas, it will have multiple entries
# we aggregate data by zip code and take the mean of the areas in the zip code
zip_rent_price_df z rent_df.groupby( 'ZI1P\nCode ').agg({'rent' : 'mean"})
# checking for any data loss; the number of rows should be equal to the number of unique zip codes print("Checking for data consistency:")
print(len(zip_rent_price_df) == len(rent_df[*ZIP\nCode*].unique()))
fusr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: FuturewWarning: The default value of r Checking for data consistency:
Figure 10 Cleaning, aggregate, checking data
4.3 Merging the teams information with the rent information
Before moving on to the final data set, we will connect the two data sets of basketball teams and rental prices together We use an inner join because we only want to analyze teams with rent data Along with that, remove some unnecessary columns
# we-use an inner -join because -we only want-to analyze -teams-with rent data
teams_rent_df = teams_df.merge(zip_rent_price_df, right_index=True, left_on="venue_zip', how="inner')
Figure 11 Analyze teams with rent data
# we drop the irrelevant columns
teams_rent_cols_drop = [5, 25, 26, 27]
teams_rent_df.drop(teams_rent_df.columns[teams_rent_cols_drop], axis=1, inplace=True)
Figure 12 Drop wrelevant columns
Now we will check the information of the newly created dataset
Trang 114.4
Final dataset of match information We repeat the steps above to get basic Exploring game information
<class ‘pandas.core.frame.DataFrame*>
Inté4Index: 299 entries, @ to 35¢
Data columns (total 25 columns):
Non-Null Count Dtype
#
24 dtype
Column
code_ncaa school_ncaa league_name league_alias league_id conf_alias conf_id division_name division_alias division_id venue_id venue_city venue_state venue_address venue_zip venue_country venue_name venue_capacity rent
299
299
299
299
299
299
299
299
299
299
299
299
non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null
object object inte4 object object object object object object object object inte4 object inte4 float64 s: float64(1), int64(3), object(21) memory usage: 6@.7+ KB
Figure 13 Newly created dataset
information about this dataset
tang 1đ
b4451a02
26p5-2005-
: k0Gc1/7Ieúô1 Sac8-
BZ1579c3-997- -4b2b 37 47-
81164388:2e9
S7 Ibe71c-
215
2015
zs
215
status coverage nleutral site scheduled date gametine comfarence game tournament tournasent ty0e
dosed
NaN
NRN NaN
2015-11 24
2015124 2t ong ƯTC 2016-11
2015112 95 0000 ure 2015-11- 2
2015125 4s ung UTC 2015-12
20151219 quang ƯTC 2015-12- 20 2015-1220 gs 0000 UTG
Figure 14 Final dataset of match information
360
160
80
s0
19
250
a fast beeak pts a secœod chance pts a team twvlovefs a polnts 0ff turflovers
316
250 120