5 Figure 3: Description of the last 5 lines of the data table .... 7 Figure 5 Unique data description for each data column .... 11 Figure 8: Description of the column of redundant data t
Trang 1MINISTRY OF EDUCATION & TRAINING
HO CHI MINH CITY UNIVERSITY OF ECONOMICS AND
FINANCE
FINAL REPORT MAJOR: BUSINESS INTELLIGENT
MAJOR CODE: A01 SOCIAL MEDIA DATA ANALYSIS
Lecture: Tran Thanh Cong
Group: 5
Le Nguyen Trieu Giang 205120305
Tran Ngoc My Dung 205023554
Ho Chi Minh City, March, 2024
Trang 2TABLE OF CONTENTS
Muc l c u
TABLE OF CONTENTS 1
LIST OF FIGURES 3
CHAPTER 1: IDENTIFY PROBLEM 1
1.1 OVERVIEW 1
1.2 PROBLEM STATEMENT 1
1.3 OBJECTIVES OF THE STUDY METHODOLOGY 1
1.4 REPORT THE STRUCTURE 2
1.4.1 INTRODUCTION: 2
1.4.2 DATA COLLECTION 2
1.4.3 ANALYSIS AND FINDINGS 3
1.4.4 RECOMMENDATIONS 3
1.4.5 CONCLUSION 3
1.4.6 REFERENCES 3
CHAPTER 2: DATA DESCRIPTION 4
2.1 IMPORT PYTHON LIBRARIES: 4
2.2 READING DATASET 4
2.2.1 DATA HEAD 5
2.2.2 DATA TAIL 6
2.2.3 DATA INFO 7
CHAPTER 3: DATA CLEANING 8
3.1 WHAT IS DATA CLEANING? 8
3.2 CHECK FOR DUPLICATION 9
Trang 3- 2 -
3.4 DATA REDUCTION 11
3.5 FEATURE ENGINEERING 12
3.6 CREATING FEATURES 12
CHAPTER 4: EDA EXPLORATORY DATA ANALYSIS 14
4.1 STATISTICS SUMMARY 14
4.2 EDA UNIVARIABLE ANALYSIS 16
4.3 DATA TRANSFORMATION 20
4.4 EDA BIVARIATE ANALYSIS 21
4.5 EDA MULTIVARIATE ANALYSIS 23
CHAPTER 5: RECOMMENDATIONS 26
CHAPTER 6 CONCLUSIONS 27
6.1 SUMMARIZE THE PROJECT 27
6.2 LIMITATIONS 27
REFERENCES 29
Trang 4LIST OF FIGURES
Figure 1 Code to access the data library 4:
Figure 2: Data Table Description 5
Figure 3: Description of the last 5 lines of the data table 6
Figure 4: Description of data columns 7
Figure 5 Unique data description for each data column 9:
Figure 6: Depicting no missing data 10
Figure 7: Description of columns 0 and 1 having duplicate data 11
Figure 8: Description of the column of redundant data that 11
Figure 9 Feature column description changed to ID 12:
Figure 10 Description of : DateTime data table 12
Figure 11: Describe data without non-numeric columns 15
Figure 12: Describe data with all types of columns 15
Figure 13: Univariate chart “Platform” 16
Figure 14: Univariate chart “Like” 17
Figure 15: Univariate chart “Country” 18
Figure 16: Chart of 2 variables “Top country likes” 19
Figure 17:Separated variables 20
Figure 18: Remove unwanted characters 20
Figure 19: Bivariate graph between Hour and Like 21
Figure 20 Schedule 2 variables "like" and "platform" 22:
Figure 21 Multivariate chart "platform" Instagram, Facebook, Twitter 23:
Figure 22: Multivariate chart “Percentages of Platforms” 24
Trang 5Businesses may gain a lot from analyzing social media data to maximize sales, including improved customer understanding, higher engagement, performance measurement, and lowermarketing expenses
• Identify trends and patterns: Analyze user-generated material to determine popular themes, hashtags, and interaction patterns across many social media sites
Trang 6• To understand user behavior: Examine how people interact with material, including their preferences, feelings, and engagement metrics like as likes, shares, and comments
• To measure impact and influence: Determine the effect of social media initiatives, influencers, or viral material on brand perception, customer behavior, and audience attitude
• Explore Audience Segmentation: To better understand your social media audience's requirements and preferences, segment them by demographics, interests, and habits
• Identify Opportunities and Challenges: Identify new opportunities or risks in the social media ecosystem, such as algorithm updates, regulatory difficulties, or changes in customer behavior
• To Inform Decision Making: Provide stakeholders with information and suggestions so they can make educated decisions about content strategy, marketing campaigns, customer service,and product development
• To enhance customer engagement: Create methods for increasing consumer engagement and loyalty via tailored content, timely communication, and community management on social media platforms
• To support research and innovation: Contribute to scholarly research and innovation in social media analysis methodology, tools, and best practices, helping to develop the discipline and handle emergent difficulties
1.4.1 INTRODUCTION:
• Provide an overview of the study's objectives and rationale
• Introduce the importance of social media analysis in understanding user behavior, brand perception, and market trends
• Outline the structure of the report
1.4.2 DATA COLLECTION
• Provide an overview of the social media platforms and datasets used in the study
• Explain the criteria for selecting data sources and the timeframe of data collection
Trang 7- 3 -
1.4.3 ANALYSIS AND FINDINGS
• Present the results of the social media analysis based on the research objectives
• Analyze trends, patterns, and insights derived from the data
• Use visualizations such as charts, graphs, and heatmaps to illustrate key findings
• Interpret the findings in the context of existing literature and theoretical frameworks 1.4.4 RECOMMENDATIONS
• Provide actionable recommendations based on the study findings
• Suggest strategies for optimizing social media engagement, improving brand reputation, or addressing identified challenges
• Prioritize recommendations based on their potential impact and feasibility of implementation 1.4.5 CONCLUSION
• Summarize the key findings of the study and their implications
• Reflect on the contribution of the study to the field of social media analysis
• Highlight avenues for future research and areas for further exploration
1.4.6 REFERENCES
• Provide a list of references cited throughout the report, following the appropriate citation style
Trang 8CHAPTER DATA DESCRIPTION 2:
We place a high priority on enhancing our raw data through the critical phases of exploratory data analysis (EDA) in our data-driven procedures In this effort, feature engineering and data pre-processing are both essential Data integration, analysis, cleaning, transformation, and dimension reduction are just a few of the many tasks involved in EDA Preparing and cleaning raw data is known as data pre-processing, and it is done to make feature engineering easier In the meanwhile, feature engineering involves modifying the data using a variety of methods This could entail, among other things, handling missing data, encoding variables, addressing category variables, and adding or deleting pertinent features
Without a doubt, feature engineering is an important undertaking that has a big impact on
a model's outcome While pre-processing mainly concentrates on cleaning and arranging the data, it also involves creating new features based on current data
The first step related to ML using Python is to understand and test our data using libraries
Import all the libraries needed for our analysis, such as Data Loading, statistical analysis, Visualization, Data Transformation, Merge, and Join, etc
Figure 1 Code to access the data library :
The Pandas library provides many possibilities for loading data into a Pandas DataFrame from files such as JSON, csv, xlsx, sql, pickle, html, txt, images, etc
Trang 9- 5 -
Most data is available in table format of CSV files It is trendy and easily accessible By using the read_csv() function, the data can be converted to a pandas DataFrame
In this article, social media user data is used as an example In this dataset, we are trying
to analyze the approach of communication network users and how EDA focuses on determining what factors influence the business of an enterprise We have stored data in DataFrame data
2.2.1 DATA HEAD
Function data head: The data.head() method is designed to display the first few rows of a DataFrame (two-dimensional data table) or Series (one-dimensional data list) in pandas The default number of rows to be displayed is 5, but you can specify how many rows you want to display by passing a parameter to this method
• Preview data: data.head() helps preview a part of the data without displaying the entire DataFrame or Series, saving time and increasing productivity
• Check the data structure: By displaying the first few rows, you can check the structure of the data, including the columns and their data types
• Check input data: You can use data.head() to check data after reading it from some source, such as from a CSV file or database
• Inspect data after processing: If you have performed data transformations, you can use data.head() to examine the results of those transformations
Figure 2: Data Table Description
Trang 102.2.2 DATA TAIL
Function: The data.tail() method is designed to display the last few rows of a DataFrame
or Series in pandas The default number of rows to be displayed is 5, but you can also specify how many rows you want to display by passing a parameter to this method
• Preview data: data.tail() helps preview the last part of the data without displaying the entire DataFrame or Series, saving time and increasing productivity
• Check the data structure: By displaying the last few rows, you can check the structure of the data, including the columns and their data types
• Check input data: You can use data.tail() to check data after reading it from some source, such
as from a CSV file or database
• Inspect data after processing: If you have performed data transformations, you can use data.tail() to examine the results of those transformations
Figure 3: Description of the last 5 lines of the data table
Trang 11• Total number of rows (entries) and columns (columns)
• The name of each column and the data type of each column
• Total number of non-null values in each column
Figure 4: Description of data columns
Trang 12CHAPTER 3: DATA CLEANING
Data cleaning during analytical data transfer is the process of processing data to clean and normalize data before performing analysis The goal of data cleaning is to remove or correct inaccurate, unreliable, or unreliable values in the data to ensure that the data input to the analysis process is accurate and reliable trust
Below are some of the methods and operations commonly performed when cleaning data during analytics streaming:
• Eliminate data loops: Check and eliminate data records that have identical values in rows or columns of data
• Handling missing data: Filling in missing values in the data, often using methods such as filling in the mean, median, or mode, or using model predictions to predict the missing base value on other data
• Prepare data: Ensure that values in column data are expressed in the same units or on the samscale, for example by converting unit measures (e.g., change from Fahrenheit to degrees Celsius), or converting the data to the same form (for example, convert a string to lower or upper case)
• Noise Removal: Identifies and removes value noise or imprecision in data, such as value margins or outliers that may be the result of recording errors or inaccurate recording
• Error checking and error correction: Check data to detect invalid or unprocessable values and correct errors if any
• Reformat data: Ensure that column data is properly formatted as numbers, strings, dates, or other data types appropriate to their content
Trang 13- 9 -
• Identify minimum properties: Check minimum properties between columns of data or between data records, to ensure that data is recorded or collected correctly and without integration
• Data cleansing is an important part of the analytics data pipeline, ensuring that analytics results are accurate and meaningful
The data.nunique() method in data analysis is used to count the number of unique values (unique values) in each column of DataFrame data It returns a Series containing the number of unique values in each column, with the column name number (index) In the table, the resulting social media data is analyzed
Figure 5: Unique data description for each data column
Trang 143.3 MISSING VALUES CALCULATION
Missing Values Calculation is the process of calculating and evaluating the proportion of missing values in a data set When working with real-world data, it is common to encounter missing or incomplete data There are many reasons for missing data, including errors in data collection, data conversion, or data not being available for specific observations
data.isnull().sum()
Figure 6: Depicting no missing data
After checking the above data, there is no blank data
Trang 15- 11 -
Some columns or variables can be dropped if they do not add value to our analysis
In the data table above, column "Unnamed: 0.1" and column "Unnamed: 0" have the same data We will delete one of the two columns without affecting the analysis of social media data that we doing
Figure 7: Description of columns 0 and 1 having duplicate data
The code line data.drop('Unnamed: 0.1', axis=1, inplace=True) is used to remove the column named 'Unnamed: 0.1' from the DataFrame data This line does the following:
Figure 8: Description of the column of redundant data that
Trang 163.5 FEATURE ENGINEERING
Feature engineering refers to using domain knowledge to select and transform the most relevant variables from raw data when creating a predictive model using machine learning or statistical modeling The main goal of Feature engineering is to create meaningful data from raw data
Figure 9 Feature column description changed to ID :
"Creating features" is the process of creating new variables or attributes from the original data to improve model performance or to better understand the data In data analytics and machine learning, selecting and generating appropriate features can be an important and extremely creative part of the modeling process
Figure 10 Description of datetime data table :
Trang 17- 13 -
Create a feature named “datetime” to be able to retrieve data specifically and easier to analyze data for optimization in the sales field based on the social media user data panel
Trang 18CHAPTER 4: EDA EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis (EDA) is a key approach in data research and analysis that allows you to explore, comprehend, and extract crucial information from the original data set EDA not only allows us to handle data more thoroughly, but it also allows us to discover hidden patterns, trends, and information without making prior assumptions
During the EDA process, we will utilize descriptive statistics and data visualization tools
to investigate the data set's fundamental properties, such as variable distribution, variable dependencies, and potential outliers This allows us to obtain a better grasp of the data and formulate specific questions for future research
EDA is not only a vital first stage in the data analysis process, but it is also an effective instrument for producing preliminary ideas and hypotheses, which define subsequent analysis techniques and tactics According to It also aids in identifying difficulties and obstacles in the data collection, laying the groundwork for further data processing and preparation
A statistical summary provides a general understanding of the data's distribution, including whether it is regularly distributed, skewed left or right, or contains any outliers This may be accomplished in Python by using describe() The function describe() offers all
of the data's statistical summaries Besides that describe() is a function that yields a statistical overview of data of numeric data types, such as float and int