Report Of Group 3 Project Determining The Effect Of Income Level On Ncaa Game Performance.pdf

Using zip code rental price data as a proxy for income, we employ Python-based analytics tools to explore the correlation between these factors.. Utilizing zip code rental price data as

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

INTERNATIONAL SCHOOL

7.)

5

=

=—= a

REPORT OF GROUP 3

PROJECT Determining the Effect of Income Level on NCAA Game Performance

Lecturer: Ha Manh Hung — Faculty of Applied Sciences

Members of group: Nguyen Duy Thuc - 21070278

Hoang Ngoc Khoa - 21070330

Le Thuy Huyen - 21070410 Dao Diem Quynh — 21070558

Bui Thanh Thao - 21070893

Trang 2

Contents

I0 09)0060.12))310214/.9009) 1 ea 4 3

1 ProJject ProposalL 2 1 1212111211111 12122111151 118112011 51111 nxkHknrnkhyn 4

2 Data Source and [escrIpHOn c1 1222122122112 11151151151 1122 key 4

2.2 Fair Market Rent Data ST 1 TS HT HH TH HH ng rệt 5

3 Importng Modules and Ïafa - S0 2221212112 111 1211121121111 1181111 ky kg 6

4 EDA and Data Cleaning - - 02221112111 22112115111 1101 1115 11H key 6 4.1 Exploring team đata - - c2 2 2222121112111 151 1121151111112 tre, 6 4.2 Exploring renf price dafa -c c1 221112112211 1111 1121112111111 8111k 8 4.3 Merging the teams Information with the rent information 10

44 Exploring game InÍormatfIo - 22: 222121212121 111 211115212 tk II 4.5 Merging games data with team-rent informatfion - - s22 cc c2 13

5 Data Analytics and Hypothesis Tesfing - c2 212v 2x2 re 14 5.1 Impact of income level ơn the performance of the teama - 14 5.2 Impact of game venue in higher-income areas and attendance 15 5.3 Impact of team's hometown income level and the number and type of fouls thề COIMTMIE - L1 2 1222122111211 1 1511511151 1151115111811 5111111151111 K 115151118 k KH Hết 16 5.4 Impact of rent near the team's venue and the number and type of fouls they commit (Personal and fagrant foul§) -¿- c 1 2 2221122211122 113tr re 17 5.5 Impact ofa team performance for home øames vs away games 18 5.6 Impaect on home team performance based on aftendance 18

6.1 Dashboard L 211 11 11011111101 11111011111 211 1111111111011 1 0111k kg 19

Trang 3

LIST OF ABBREVIATION

NCAA National Collegiate Athletic Association

HUD Department of Housing and Urban Development

ABSTRACT

This study examines the impact of zip code income levels on NCAA basketball teams' performance Using zip code rental price data as a proxy for income, we employ Python-based analytics tools to explore the correlation between these factors Rigorous hypothesis testing validates or invalidates this relationship, with key insights presented through informative plots Our report underscores the significant role of environmental factors in NCAA basketball teams' success, drawing from datasets in NCAA basketball conferences and fair market rent sources

Trang 4

1 Project Proposal

Our objective is to analyze the effect of the environment, described as the income level of the zip code, on the performance of NCAA basketball teams We aim to investigate the influence of zip code income levels on NCAA basketball teams’ performance Utilizing zip code rental price data as a proxy for income, we will delve into the correlation between these variables Leveraging datasets from NCAA basketball conferences and fair market rent sources, our analysis will employ Python-based analytics tools

Through rigorous hypothesis testing, we will validate, invalidate, or seek additional information regarding the relationship between zip code income and game performance Each analysis will yield key insights, showcased through informative plots Our conclusive report will provide a comprehensive overview, summarizing findings, and emphasizing the pivotal role of environmental factors in NCAA basketball teams' success

2 Data Source and Description

There are two datasets relevant to our project:

ncaa basketbalLmbb games sr: This dataset contains team statistics for every man’s basketball game from the 2013-14 season to the 2017-18 season It consists of 29,802 records and 132 variables, mcluding an [D Each row in this dataset represents the statistics for both teams in a single game

ncaa_basketball.mbb_ teams: This dataset provides general information about the 351 basketball teams It contains 351 records and 28 variables, including an ID Each row in this dataset represents statistics for a specific basketball team Part of the data structure in the dataset:

game_id: (Data type: String, Can be NULL value)

Trang 5

season: (Data type: Integer, Can be NULL value)

status: (Data type: String, Can be NULL value)

coverage: (Data type: String, Can be NULL value)

In this data structure, game_id is the unique identifier for each match, season indicates the season of the match, status represents the final status of the game file from Sportradar, and coverage describes the type of report being reported provide for that match

22 Fair Market Rent Data

For this project, rental price data is annually published by the Department of Housing and Urban Development (HUD) on their website We are utilizing the latest dataset available for the 40th percentile of rent prices for the year 2022

fy2022 erap fmrs revised: This dataset includes the 40th percentile of rent prices for each housing type Each row in this dataset provides statistical information for a zip code or a sub-area if a zip code encompasses multiple areas The structure of this dataset 1s described as follows:

HUD Metro Fair Market Rental Area Name: Name of the area defined by

HUD

CBSASub22: ID used by HUD for each area

Zip Code: Zip code of the area

erap_fmr_ br0: 40th percentile of rent prices for studios in the area or zip code (whichever is smaller)

erap_fmr_brl: 40th percentile of rent prices for 1-bedroom apartments 1n the area or zip code (whichever is smaller)

erap_fmr_br2: 40th percentile of rent prices for 2-bedroom apartments 1n the area or zip code (whichever is smaller)

Trang 6

3 Importing Modules and Data

The data from csv file in Google Drive will be connect to Google Colab Some necessary modules such as pandas, numpy, seaborn, also be imported for serving the analyzing process in this project

4 EDA and Data Cleaning

In the EDA and data cleaning section, we will work with each data set In this step we will learn and have an initial view of each data set along with cleaning the data

4.1 Exploring team data

First is the information data of 351 basketball teams With the 3 commands head, shape and info we get some basic information and some of the first data in the data set

Trang 7

RangeIndex: 351 entries, 9 to 359 Data columns (total 28 columns):

# Column Non-Null Count Dtype

@ market 351 non-null object

1 alias 351 non-null object

4 code ncaa 351 non-null inte4

5 ` kaggle team id 351 non-null inte4

7 turner name 351 non-null object

§ league name 351 non-null object

9 league_alias 351 non-null object 1@ league_id 351 non-null object

12 conf_alias 351 non-null object

13 conf_id 351 non-null object

17 venue_id 351 non-null object

18 venue_city 351 non-null object

2@ venue_address 347 non-null object

21 venue_zip 35@ non-null floate4

22 venue_country 351 non-null object

23 venue_name 351 non-null object

24 venue_capacity 351 non-null inte4

25 logo large 351 non-null object

26 logo_medium 351 non-null object

dtypes: floaté4(1), int64(3), object(24) memory usage: 76.9+ KB

Figure 1 Some first column in the dataset

The data set consists of 351 rows and 28 columns We can see some information about basketball teams such as name, id, code ncaa, school ncaa, turner name, league name, conf name, division name, venue _id, venue_city and

so on Most data types are objects and some are int and float Overall the data set is very complete, there is only | missing value in the venue_zip column and 4 in the venue_address column It can be seen most clearly when using the isnull command

We will clean this dataset before moving on to the next dataset Although there are two columns with missing values, we will only handle one column, venue zip, because it will be used to connect to another dataset Meanwhile, the variable venue_address seems to have no meaning in the following steps and has only missing one row, so we can remove this row later

Trang 8

teams_df[teams_df[‘venue_zip'].isna()]

Figure 3 Find the row missing zip data

# we fill in the missing value manually using information from Google

teams_df.loc[135, ‘venue_zip'] = 79699

Figure 4 Fill in missing value

We will find the row containing the missing value and then search Google for information about that missing data and fill it in

To ensure uniformity of data types in the venue zip variable, we check 351 data and to be more careful, we convert the entire original data type to int (this step

is not too important, maybe omitted)

# since zip codes are integers, we check if there are any values that are inconsistent teams_df['ve

'wenue_zip ' ].astype( 'int64' ; - errorsz'raise' )).value_counts( )

Figure 5 Check values inconsistent

# We can safely cast the data to ‘int64'

venue_zip ' ].astype( 'inte4',

teams_df[ 'venue_zip'] = te

Figure 6 Cast the data to mt64

4.2 Exploring rent price data

Next, we work with the second dataset on rent prices Since we are connecting the team data using zip code, we have to check the data sanity of the zip code and rent price column We limit our analysis to the 1-bedroom rent prices We proceed with a brief examination of this dataset

Trang 9

0 Abilene, TX MSA METRO10180M10180 76437 $688 $732 $945 $1,288 $1,598

Figure 7 Second dataset on rent prices 1

Figure 9 Second dataset on rent prices 3

The data set contains 29283 rows and 8 columns including columns such as: HUD fair market rent area name, CBSASub22, zip code and prices of different room All data types are in the form of objects and do not contain missing values So here with the prices rent for l1-bedroom we will change the data type to int We create a new column called “rent” and bring in data from the 1-bedroom rental price column Along with that in this data set, there were cases where one zip code was in many different areas so we grouped them by zip code and calculated the average

Trang 10

rental price Finally, check the data set again after processing to avoid data loss along with the zip code number and it is unique, corresponding to the row number

# cleaning rent prices for @ bedroom by removing any non-digit characters

rent_df[ 'rent'] z rent_df[ 'erap_fmr_br9' ]1.str.replace( '$', )

rent_df[ 'rent'] = rent_df[ 'rent' ].str.replace(',', ‘').astype(int)

# if a zip code has multiple areas, it will have multiple entries

# we aggregate data by zip code and take the mean of the areas in the zip code

zip_rent_price_df z rent_df.groupby( 'ZI1P\nCode ').agg({'rent' : 'mean"})

# checking for any data loss; the number of rows should be equal to the number of unique zip codes print("Checking for data consistency:")

print(len(zip_rent_price_df) == len(rent_df[*ZIP\nCode*].unique()))

fusr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: FuturewWarning: The default value of r Checking for data consistency:

Figure 10 Cleaning, aggregate, checking data

4.3 Merging the teams information with the rent information

Before moving on to the final data set, we will connect the two data sets of basketball teams and rental prices together We use an inner join because we only want to analyze teams with rent data Along with that, remove some unnecessary columns

# we-use an inner -join because -we only want-to analyze -teams-with rent data

teams_rent_df = teams_df.merge(zip_rent_price_df, right_index=True, left_on="venue_zip', how="inner')

Figure 11 Analyze teams with rent data

# we drop the irrelevant columns

teams_rent_cols_drop = [5, 25, 26, 27]

teams_rent_df.drop(teams_rent_df.columns[teams_rent_cols_drop], axis=1, inplace=True)

Figure 12 Drop wrelevant columns

Now we will check the information of the newly created dataset

Trang 11

4.4

Final dataset of match information We repeat the steps above to get basic Exploring game information

Inté4Index: 299 entries, @ to 35¢

Data columns (total 25 columns):

Non-Null Count Dtype

#

24 dtype

Column

code_ncaa school_ncaa league_name league_alias league_id conf_alias conf_id division_name division_alias division_id venue_id venue_city venue_state venue_address venue_zip venue_country venue_name venue_capacity rent

299

non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null

object object inte4 object object object object object object object object inte4 object inte4 float64 s: float64(1), int64(3), object(21) memory usage: 6@.7+ KB

Figure 13 Newly created dataset

information about this dataset

tang 1đ

b4451a02

26p5-2005-

: k0Gc1/7Ieúô1 Sac8-

BZ1579c3-997- -4b2b 37 47-

81164388:2e9

S7 Ibe71c-

215

2015

zs

215

status coverage nleutral site scheduled date gametine comfarence game tournament tournasent ty0e

dosed

NaN

NRN NaN

2015-11 24

2015124 2t ong ƯTC 2016-11

2015112 95 0000 ure 2015-11- 2

2015125 4s ung UTC 2015-12

20151219 quang ƯTC 2015-12- 20 2015-1220 gs 0000 UTG

Figure 14 Final dataset of match information

360

160

80

s0

19

250

a fast beeak pts a secœod chance pts a team twvlovefs a polnts 0ff turflovers

316

250 120

Tiêu đề	Determining the Effect of Income Level on NCAA Game Performance
Tác giả	Nguyen Duy Thuc, Hoang Ngoc Khoa, Le Thuy Huyen, Dao Diem Quynh, Bui Thanh Thao
Người hướng dẫn	Ha Manh Hung, Faculty of Applied Sciences
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Applied Sciences
Thể loại	Project Report
Thành phố	Hanoi

Định dạng
Số trang	20
Dung lượng	2,57 MB