Final report major business intelligent major code a01 social media data analysis

5 Figure 3: Description of the last 5 lines of the data table .... 7 Figure 5 Unique data description for each data column .... 11 Figure 8: Description of the column of redundant data t

Trang 1

MINISTRY OF EDUCATION & TRAINING

HO CHI MINH CITY UNIVERSITY OF ECONOMICS AND

FINANCE

FINAL REPORT MAJOR: BUSINESS INTELLIGENT

MAJOR CODE: A01 SOCIAL MEDIA DATA ANALYSIS

Lecture: Tran Thanh Cong

Group: 5

Le Nguyen Trieu Giang 205120305

Tran Ngoc My Dung 205023554

Ho Chi Minh City, March, 2024

Trang 2

TABLE OF CONTENTS

Mu󰈨c l c u󰈨

TABLE OF CONTENTS 1

LIST OF FIGURES 3

CHAPTER 1: IDENTIFY PROBLEM 1

1.1 OVERVIEW 1

1.2 PROBLEM STATEMENT 1

1.3 OBJECTIVES OF THE STUDY METHODOLOGY 1

1.4 REPORT THE STRUCTURE 2

1.4.1 INTRODUCTION: 2

1.4.2 DATA COLLECTION 2

1.4.3 ANALYSIS AND FINDINGS 3

1.4.4 RECOMMENDATIONS 3

1.4.5 CONCLUSION 3

1.4.6 REFERENCES 3

CHAPTER 2: DATA DESCRIPTION 4

2.1 IMPORT PYTHON LIBRARIES: 4

2.2 READING DATASET 4

2.2.1 DATA HEAD 5

2.2.2 DATA TAIL 6

2.2.3 DATA INFO 7

CHAPTER 3: DATA CLEANING 8

3.1 WHAT IS DATA CLEANING? 8

3.2 CHECK FOR DUPLICATION 9

Trang 3

- 2 -

3.4 DATA REDUCTION 11

3.5 FEATURE ENGINEERING 12

3.6 CREATING FEATURES 12

CHAPTER 4: EDA EXPLORATORY DATA ANALYSIS 14

4.1 STATISTICS SUMMARY 14

4.2 EDA UNIVARIABLE ANALYSIS 16

4.3 DATA TRANSFORMATION 20

4.4 EDA BIVARIATE ANALYSIS 21

4.5 EDA MULTIVARIATE ANALYSIS 23

CHAPTER 5: RECOMMENDATIONS 26

CHAPTER 6 CONCLUSIONS 27

6.1 SUMMARIZE THE PROJECT 27

6.2 LIMITATIONS 27

REFERENCES 29

Trang 4

LIST OF FIGURES

Figure 1 Code to access the data library 4:

Figure 2: Data Table Description 5

Figure 3: Description of the last 5 lines of the data table 6

Figure 4: Description of data columns 7

Figure 5 Unique data description for each data column 9:

Figure 6: Depicting no missing data 10

Figure 7: Description of columns 0 and 1 having duplicate data 11

Figure 8: Description of the column of redundant data that 11

Figure 9 Feature column description changed to ID 12:

Figure 10 Description of : DateTime data table 12

Figure 11: Describe data without non-numeric columns 15

Figure 12: Describe data with all types of columns 15

Figure 13: Univariate chart “Platform” 16

Figure 14: Univariate chart “Like” 17

Figure 15: Univariate chart “Country” 18

Figure 16: Chart of 2 variables “Top country likes” 19

Figure 17:Separated variables 20

Figure 18: Remove unwanted characters 20

Figure 19: Bivariate graph between Hour and Like 21

Figure 20 Schedule 2 variables "like" and "platform" 22:

Figure 21 Multivariate chart "platform" Instagram, Facebook, Twitter 23:

Figure 22: Multivariate chart “Percentages of Platforms” 24

Trang 5

Businesses may gain a lot from analyzing social media data to maximize sales, including improved customer understanding, higher engagement, performance measurement, and lowermarketing expenses

• Identify trends and patterns: Analyze user-generated material to determine popular themes, hashtags, and interaction patterns across many social media sites

Trang 6

• To understand user behavior: Examine how people interact with material, including their preferences, feelings, and engagement metrics like as likes, shares, and comments

• To measure impact and inﬂuence: Determine the effect of social media initiatives, influencers, or viral material on brand perception, customer behavior, and audience attitude

• Explore Audience Segmentation: To better understand your social media audience's requirements and preferences, segment them by demographics, interests, and habits

• Identify Opportunities and Challenges: Identify new opportunities or risks in the social media ecosystem, such as algorithm updates, regulatory difﬁculties, or changes in customer behavior

• To Inform Decision Making: Provide stakeholders with information and suggestions so they can make educated decisions about content strategy, marketing campaigns, customer service,and product development

• To enhance customer engagement: Create methods for increasing consumer engagement and loyalty via tailored content, timely communication, and community management on social media platforms

• To support research and innovation: Contribute to scholarly research and innovation in social media analysis methodology, tools, and best practices, helping to develop the discipline and handle emergent difficulties

1.4.1 INTRODUCTION:

• Provide an overview of the study's objectives and rationale

• Introduce the importance of social media analysis in understanding user behavior, brand perception, and market trends

• Outline the structure of the report

1.4.2 DATA COLLECTION

• Provide an overview of the social media platforms and datasets used in the study

• Explain the criteria for selecting data sources and the timeframe of data collection

Trang 7

- 3 -

1.4.3 ANALYSIS AND FINDINGS

• Present the results of the social media analysis based on the research objectives

• Analyze trends, patterns, and insights derived from the data

• Use visualizations such as charts, graphs, and heatmaps to illustrate key findings

• Interpret the findings in the context of existing literature and theoretical frameworks 1.4.4 RECOMMENDATIONS

• Provide actionable recommendations based on the study ﬁndings

• Suggest strategies for optimizing social media engagement, improving brand reputation, or addressing identified challenges

• Prioritize recommendations based on their potential impact and feasibility of implementation 1.4.5 CONCLUSION

• Summarize the key ﬁndings of the study and their implications

• Reﬂect on the contribution of the study to the ﬁeld of social media analysis

• Highlight avenues for future research and areas for further exploration

1.4.6 REFERENCES

• Provide a list of references cited throughout the report, following the appropriate citation style

Trang 8

CHAPTER DATA DESCRIPTION 2:

We place a high priority on enhancing our raw data through the critical phases of exploratory data analysis (EDA) in our data-driven procedures In this effort, feature engineering and data pre-processing are both essential Data integration, analysis, cleaning, transformation, and dimension reduction are just a few of the many tasks involved in EDA Preparing and cleaning raw data is known as data pre-processing, and it is done to make feature engineering easier In the meanwhile, feature engineering involves modifying the data using a variety of methods This could entail, among other things, handling missing data, encoding variables, addressing category variables, and adding or deleting pertinent features

Without a doubt, feature engineering is an important undertaking that has a big impact on

a model's outcome While pre-processing mainly concentrates on cleaning and arranging the data, it also involves creating new features based on current data

The first step related to ML using Python is to understand and test our data using libraries

Import all the libraries needed for our analysis, such as Data Loading, statistical analysis, Visualization, Data Transformation, Merge, and Join, etc

Figure 1 Code to access the data library :

The Pandas library provides many possibilities for loading data into a Pandas DataFrame from files such as JSON, csv, xlsx, sql, pickle, html, txt, images, etc

Trang 9

- 5 -

Most data is available in table format of CSV ﬁles It is trendy and easily accessible By using the read_csv() function, the data can be converted to a pandas DataFrame

In this article, social media user data is used as an example In this dataset, we are trying

to analyze the approach of communication network users and how EDA focuses on determining what factors inﬂuence the business of an enterprise We have stored data in DataFrame data

2.2.1 DATA HEAD

Function data head: The data.head() method is designed to display the ﬁrst few rows of a DataFrame (two-dimensional data table) or Series (one-dimensional data list) in pandas The default number of rows to be displayed is 5, but you can specify how many rows you want to display by passing a parameter to this method

• Preview data: data.head() helps preview a part of the data without displaying the entire DataFrame or Series, saving time and increasing productivity

• Check the data structure: By displaying the ﬁrst few rows, you can check the structure of the data, including the columns and their data types

• Check input data: You can use data.head() to check data after reading it from some source, such as from a CSV file or database

• Inspect data after processing: If you have performed data transformations, you can use data.head() to examine the results of those transformations

Figure 2: Data Table Description

Trang 10

2.2.2 DATA TAIL

Function: The data.tail() method is designed to display the last few rows of a DataFrame

or Series in pandas The default number of rows to be displayed is 5, but you can also specify how many rows you want to display by passing a parameter to this method

• Preview data: data.tail() helps preview the last part of the data without displaying the entire DataFrame or Series, saving time and increasing productivity

• Check the data structure: By displaying the last few rows, you can check the structure of the data, including the columns and their data types

• Check input data: You can use data.tail() to check data after reading it from some source, such

as from a CSV file or database

• Inspect data after processing: If you have performed data transformations, you can use data.tail() to examine the results of those transformations

Figure 3: Description of the last 5 lines of the data table

Trang 11

• Total number of rows (entries) and columns (columns)

• The name of each column and the data type of each column

• Total number of non-null values in each column

Figure 4: Description of data columns

Trang 12

CHAPTER 3: DATA CLEANING

Data cleaning during analytical data transfer is the process of processing data to clean and normalize data before performing analysis The goal of data cleaning is to remove or correct inaccurate, unreliable, or unreliable values in the data to ensure that the data input to the analysis process is accurate and reliable trust

Below are some of the methods and operations commonly performed when cleaning data during analytics streaming:

• Eliminate data loops: Check and eliminate data records that have identical values in rows or columns of data

• Handling missing data: Filling in missing values in the data, often using methods such as filling in the mean, median, or mode, or using model predictions to predict the missing base value on other data

• Prepare data: Ensure that values in column data are expressed in the same units or on the samscale, for example by converting unit measures (e.g., change from Fahrenheit to degrees Celsius), or converting the data to the same form (for example, convert a string to lower or upper case)

• Noise Removal: Identiﬁes and removes value noise or imprecision in data, such as value margins or outliers that may be the result of recording errors or inaccurate recording

• Error checking and error correction: Check data to detect invalid or unprocessable values and correct errors if any

• Reformat data: Ensure that column data is properly formatted as numbers, strings, dates, or other data types appropriate to their content

Trang 13

- 9 -

• Identify minimum properties: Check minimum properties between columns of data or between data records, to ensure that data is recorded or collected correctly and without integration

• Data cleansing is an important part of the analytics data pipeline, ensuring that analytics results are accurate and meaningful

The data.nunique() method in data analysis is used to count the number of unique values (unique values) in each column of DataFrame data It returns a Series containing the number of unique values in each column, with the column name number (index) In the table, the resulting social media data is analyzed

Figure 5: Unique data description for each data column

Trang 14

3.3 MISSING VALUES CALCULATION

Missing Values Calculation is the process of calculating and evaluating the proportion of missing values in a data set When working with real-world data, it is common to encounter missing or incomplete data There are many reasons for missing data, including errors in data collection, data conversion, or data not being available for specific observations

data.isnull().sum()

Figure 6: Depicting no missing data

After checking the above data, there is no blank data

Trang 15

- 11 -

Some columns or variables can be dropped if they do not add value to our analysis

In the data table above, column "Unnamed: 0.1" and column "Unnamed: 0" have the same data We will delete one of the two columns without affecting the analysis of social media data that we doing

Figure 7: Description of columns 0 and 1 having duplicate data

The code line data.drop('Unnamed: 0.1', axis=1, inplace=True) is used to remove the column named 'Unnamed: 0.1' from the DataFrame data This line does the following:

Figure 8: Description of the column of redundant data that

Trang 16

3.5 FEATURE ENGINEERING

Feature engineering refers to using domain knowledge to select and transform the most relevant variables from raw data when creating a predictive model using machine learning or statistical modeling The main goal of Feature engineering is to create meaningful data from raw data

Figure 9 Feature column description changed to ID :

"Creating features" is the process of creating new variables or attributes from the original data to improve model performance or to better understand the data In data analytics and machine learning, selecting and generating appropriate features can be an important and extremely creative part of the modeling process

Figure 10 Description of datetime data table :

Trang 17

- 13 -

Create a feature named “datetime” to be able to retrieve data specifically and easier to analyze data for optimization in the sales field based on the social media user data panel

Trang 18

CHAPTER 4: EDA EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis (EDA) is a key approach in data research and analysis that allows you to explore, comprehend, and extract crucial information from the original data set EDA not only allows us to handle data more thoroughly, but it also allows us to discover hidden patterns, trends, and information without making prior assumptions

During the EDA process, we will utilize descriptive statistics and data visualization tools

to investigate the data set's fundamental properties, such as variable distribution, variable dependencies, and potential outliers This allows us to obtain a better grasp of the data and formulate specific questions for future research

EDA is not only a vital ﬁrst stage in the data analysis process, but it is also an effective instrument for producing preliminary ideas and hypotheses, which define subsequent analysis techniques and tactics According to It also aids in identifying difficulties and obstacles in the data collection, laying the groundwork for further data processing and preparation

A statistical summary provides a general understanding of the data's distribution, including whether it is regularly distributed, skewed left or right, or contains any outliers This may be accomplished in Python by using describe() The function describe() offers all

of the data's statistical summaries Besides that describe() is a function that yields a statistical overview of data of numeric data types, such as ﬂoat and int

Tiêu đề	Social Media Data Analysis
Tác giả	Le Nguyen Trieu Giang, Tran Ngoc My Dung, Cao Nhat Phi, Nguyen Duc Vinh, Nguyen Phi Hung
Người hướng dẫn	Tran Thanh Cong
Trường học	Ho Chi Minh City University of Economics and Finance
Chuyên ngành	Business Intelligent
Thể loại	Final Report
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	33
Dung lượng	3,37 MB