Liên hệ: 0868442806 Để được giảm 25% giá tài liệu Introduction In an era where digital travel planning is becoming increasingly important, the analysis of hotel reviews is becoming more and more relevant. This project addresses the challenge of gaining valuable insights from an abundance of hotel reviews. By automatically collecting around 20,000 individual reviews via the Google Travel website, we provide a comprehensive insight into the needs and expectations of travelers. The immense amount of information available, especially in the form of unstructured text data from hotel reviews, holds a potential that has so far remained largely untapped. This data contains valuable information about the customer experience, the quality of services and the strengths and weaknesses of hotels. In order to fully exploit this potential, we have supplemented the automatic data collection with advanced text understanding models. The main goal of this project is not only to collect data, but to understand it in depth and breadth. By applying advanced text comprehension models and comprehensive data analysis, we aim to paint a clear picture of customer reviews. This precise interpretation will allow companies to not only capture the tone of their customers, but also derive concrete actions to optimize their services and maintain a positive customer experience. For companies in the tourism sector manifests itself in the possibility of gaining indepth insights into customer opinions. By precisely analyzing the collected data, companies can emphasize their strengths, address weaknesses and make targeted improvements. This is not only in response to past reviews, but also as a proactive approach to future customer expectations. At a time when customer loyalty is heavily influenced by online reviews, this project offers businesses the opportunity to strengthen their online reputation and gain a clear competitive advantage. By understanding the data collected, companies can deploy their resources more effectively and continuously adapt their services to the needs of their customers. In the following sections, you will dive into the intricacies of our data preparation and collection process, where a detailed analysis of the user rating scheme is presented. Then we provide a comprehensive overview of our data cleansing procedures, where we also discuss the more detailed cleansing of review texts. Finally, we provide an indepth analysis of the cleansed data, supported by informative visuals. As an added feature, we conclude the report with a sentiment analysis for the ratings, which provides additional insight into user feedback.
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY - -🕮🙢 PROJECT REPORT Hotel customer reviews analysis Instructor: Tran Viet Trung Group: 16 Member: Ngo Quang Viet Tran Tung Lam Yannic Elias Hanel Le Dam Quan Ayoub Ala Mostafa Hà Nội, 2023 20194881 20194788 2023T039 2023T051 2023T011 Table of content Table of content Introduction Data preparation 1.1 Business analysis 1.2 Data collection 1.3 Data understanding Data cleaning and preprocessing 2.1 Handling missing values and duplications 2.2 Datatypes 2.3 One-hot encoding 2.4 Cleaning text data Exploratory Data Analysis 3.1 Univariate analysis 3.2 Bivariate analysis 3.3 Multivariate analysis 3.4 Text analysis 3.4.1 Text length 3.4.2 Common words 3.4.3 Sentiment analysis Sentiment Classification with Machine Learning model 4.1 Overview 4.2 Prepare data 4.3 Feature extraction 4.3.1 BoW features 4.3.2 TF-IDF features 4.4 Model training 4.5 Model evaluation Conclusion Introduction In an era where digital travel planning is becoming increasingly important, the analysis of hotel reviews is becoming more and more relevant This project addresses the challenge of gaining valuable insights from an abundance of hotel reviews By automatically collecting around 20,000 individual reviews via the Google Travel website, we provide a comprehensive insight into the needs and expectations of travelers The immense amount of information available, especially in the form of unstructured text data from hotel reviews, holds a potential that has so far remained largely untapped This data contains valuable information about the customer experience, the quality of services and the strengths and weaknesses of hotels In order to fully exploit this potential, we have supplemented the automatic data collection with advanced text understanding models The main goal of this project is not only to collect data, but to understand it in depth and breadth By applying advanced text comprehension models and comprehensive data analysis, we aim to paint a clear picture of customer reviews This precise interpretation will allow companies to not only capture the tone of their customers, but also derive concrete actions to optimize their services and maintain a positive customer experience For companies in the tourism sector manifests itself in the possibility of gaining in-depth insights into customer opinions By precisely analyzing the collected data, companies can emphasize their strengths, address weaknesses and make targeted improvements This is not only in response to past reviews, but also as a proactive approach to future customer expectations At a time when customer loyalty is heavily influenced by online reviews, this project offers businesses the opportunity to strengthen their online reputation and gain a clear competitive advantage By understanding the data collected, companies can deploy their resources more effectively and continuously adapt their services to the needs of their customers In the following sections, you will dive into the intricacies of our data preparation and collection process, where a detailed analysis of the user rating scheme is presented Then we provide a comprehensive overview of our data cleansing procedures, where we also discuss the more detailed cleansing of review texts Finally, we provide an in-depth analysis of the cleansed data, supported by informative visuals As an added feature, we conclude the report with a sentiment analysis for the ratings, which provides additional insight into user feedback Data preparation 1.1 Business analysis This data science project centers around mining hotel reviews to discern the factors that resonate with customers The goal extends beyond identifying surface-level preferences; it encompasses delving into the finer details that shape a guest's perception Rather than just cataloging customer likes, the focus is on unraveling the underlying reasons behind their preferences It goes beyond recognizing, for instance, a fondness for comfortable beds, seeking to comprehend how these seemingly small details contribute to an overall positive guest experience The significance of this initiative transcends individual hotels; it aspires to provide valuable insights to the broader hospitality industry The findings serve as a strategic tool, offering foresight into emerging trends and ensuring that hotels are not just keeping pace but staying ahead in meeting evolving guest expectations It's akin to a strategic guide, empowering hotels to proactively enhance guest satisfaction and continually elevate their standards 1.2 Data collection For this analysis, we collect hotel information from 200 hotels across locations: Birmingham, Edinburgh, Liverpool, London and Manchester, as well as their users’ reviews: 100 for each hotel All hotel information and reviews are collected from Google Travel Hotel information schema is as follows: Field name Type Description May be empty? source str Website which the hotel information was collected from (In No this case, it’s only “Google”) name str Hotel name No address str Hotel address No images_count int Number of photos submitted by hotel’s owner No popular_amenities list[str] List of popular amenities No Users’ review schema is as follows: Field name Type Description May be empty? hotel_name str Name of the hotel No review_text str Review in text No rating float Rating score (scale of 5) No review_timestamp datetime Timestamp when the review was made No trip_type str Type of trip (Possible values: “Business”, “Vacation”) Yes trip_companions str With whom the reviewer traveled with (Possible values: “Family”, “Friends”, “Couple”, “Solo” Yes The crawling workflow is as follows: - First, crawl a list of URLs to a hotel’s details page, by executing `get_hotel_list.py` script: python3 get_hotel_list.py [-h] output OUTPUT] [ headless locations [locations ] limit LIMIT [-| no-headless] List of parameters: Parameter Required? Description -h No Shows the help and exit –-limit LIMIT Yes Number of maximum hotels by a location output OUTPUT No Path to write the URL list (defaults to stdout, or console if not specified) –-headless, –no-headless No Whether to run the script with a headless browser or not, suitable for debugging Defaults to headless mode - Then, from the list of URLs acquired from the above step, we proceed to get hotel details as well its reviews in separate flows: One for collecting hotel details, one for collecting hotel’s reviews To retrieve a list of hotel details (in CSV format), use get_hotel_details.py script: python3 get_hotel_details.py [-h] [ output OUTPUT] [-headless | no-headless] input List of parameters: Parameter Required? Description -h No Displays the help message and exit input Yes Path to file containing list of URLs collected from the previous step output OUTPUT No Path to output file, defaults to a csv file containing the timestamp the script starts headless, -no-headless No Whether to run the script with a headless browser or not, suitable for debugging Defaults to headless mode If your computer/runner is powerful enough, it is advised that you perform a few batches at a time, the processes should not interfere with one another To retrieve a list of hotel reviews (in CSV format), use get_hotel_reviews.py script: python3 get_hotel_reviews.py [-h] [ output OUTPUT] limit LIMIT [ headless | -no-headless] input List of parameters: Parameter Required? Description -h No Displays the help message and exit input Yes Path to file containing list of URLs collected from the previous step output OUTPUT No Path to output file, defaults to a csv file containing the timestamp the script starts –-limit LIMIT Yes Number of maximum reviews by a hotel headless, -no-headless No Whether to run the script with a headless browser or not, suitable for debugging Defaults to headless mode 1.3 Data understanding In the data understanding phase of this project, our focus is on gaining insights into the collected data, particularly the hotel reviews This involves exploring, examining, and comprehending the structure and content of the dataset The primary objectives are to identify patterns, trends, and potential challenges within the data, paving the way for more informed analysis and interpretation Overview of Ratings Distribution : We begin by examining the distribution of ratings across all hotel reviews Understanding the distribution helps us identify whether there's a skew towards positive or negative sentiments Text Length Analysis : Analyzing the length of review texts can provide insights into customers' engagement levels We explore the distribution of text lengths to understand if there's a correlation between review length and the assigned rating Hotel-wise Analysis : We conduct a detailed analysis of ratings, review lengths, and sentiments for each hotel individually This allows us to identify specific patterns and variations unique to each location Sentiment Analysis : Utilizing sentiment analysis models, we aim to categorize each review as positive, negative, or neutral This step is crucial in understanding the overall sentiment of customers towards the hotels Topic Modeling : Applying topic modeling techniques, we extract key themes and topics present in the reviews This helps in understanding the major factors influencing customer opinions Handling Unstructured Text : Given that the data primarily consists of unstructured text, we address challenges related to natural language processing (NLP), including tokenization, stemming, and lemmatization Data cleaning and preprocessing 2.1 Handling missing values and duplications The table below shows the count and percentage of null values in various columns of the dataset From what is described, the trip_type column has a significant number of missing values, roughly 49.83% of the data The trip_companions column also has a high percentage of null values, around 45.51% Given that almost half of the data for trip_type is missing, the strategy for handling these null values is crucial Removing such a large portion of the dataset is likely not advisable, as it would result in a significant loss of data Imputation might also be challenging unless there are strong predictors for trip type within the data One-hot encoding the trip_type column with an additional category for nulls might be the most suitable approach here It would allow us to retain all the data and treat the missing values as a separate unknown category For trip_companions, the approach would be similar due to the high percentage of missing values Including an unknown category could be beneficial for any predictive modeling or analysis, as this allows the model to account for the fact that the information is missing, which might itself be a pattern of interest The review_text column has a relatively small percentage of missing values (0.166%) This is a very small fraction of the dataset, which could be handled differently than the trip_type and trip_companions columns with their substantially higher percentages of missing data Since we plan to perform text analysis, especially sentiment analysis, the quality and completeness of the text data will be paramount, we decided to drop all records that have null values in this column Moreover, there are no duplications present within the dataset 2.2 Datatypes We use `data.dtypes` command to list the data types of each column in the dataset, as shown in figure below The rating column is of type float64, indicating numerical values with decimal points, which is common for rating data The images_count column is an integer (int64), which is appropriate for count data The `review_timestamp` column is also listed as an `object`, indicating it has not been interpreted as a date or time format by Pandas Other columns are of the type object, which typically means they are strings or mixed types It is worth noting that the review_timestamp column is also listed as an object, indicating it has not been interpreted as a date or time format by Pandas The conversion of the review_timestamp to datetime type is a necessary step for analysis involving time series, as it will enable functions such as resampling, time-based indexing, and extracting components of the date like the month or day of the week The correct interpretation of this column will make temporal analyses and visualizations much easier and more efficient We convert the review_timestamp column from an object type to a datetime type, which is crucial for any time-series analysis This conversion will allow the use of Pandas' powerful time-series functionality 2.3 One-hot encoding One-hot encoding is a common technique used in data preprocessing to convert categorical data into a numerical format that can be provided to machine learning algorithms We will perform one-hot encoding on three categorical columns: trip_types, trip_companions, and popular_amenities The trip_type column is one-hot encoded into three columns: type_Business, type_Vacation, and type_unknown This indicates that there were originally two known types of trips (Business and Vacation) and an additional encoding has been created for the unknown types, which represents the null values in the original data as discussed earlier The trip_companions column has been transformed into five columns: companions_Couple, companions_Family, companions_Friends, companions_Solo, and companions_unknown This suggests that there were four known categories for trip with these groups Additionally, the lower frequency of trips with friends and solo travelers offers an area for further investigation to see if there are opportunities to enhance the appeal for these segments 3.2 Bivariate analysis Review count over time The line graph shows the trend of reviews from 2012 to 2024 There is a consistent low level of reviews from 2012 until an initial increase starts in 2019 However, the most striking feature of the graph is the dramatic spike in the number of reviews in 2023, where the count soars to a peak far exceeding any previous year This could potentially be attributed to a special event, a promotional campaign, or a sudden surge in popularity of the service or product being reviewed For 2024, the data shows a very low number of reviews, but since we are just at the beginning of the year, this is not unexpected It is too early to draw any conclusions for 2024, as the data is likely incomplete and will accumulate over the course of the year The trend for 2024 will become clearer as more data is collected with each passing month The violin plot indicates that across different trip companions: family, couples, friends, and solo travelers.The ratings tend to lean towards the higher end of the scale, with most distributions extending towards a rating of 10 This suggests that regardless of the travel companion, the overall experience is perceived positively, indicating that the hotel or service being evaluated is likely providing a satisfactory experience to its guests While there is some variation in satisfaction, particularly with family and solo travelers, the general trend towards higher ratings is a good sign for the hotel, as it implies that guests tend to leave happy with their experience The consistency in higher ratings for couple trips could suggest that the hotel's offerings are particularly well-suited for couples, while the wider spread in ratings for family and solo trips may point to a diversity in expectations and experiences within these groups 3.3 Multivariate analysis The first heatmap presents a matrix of correlation values between various categorical variables related to travel data In this matrix, each cell's color intensity and corresponding numerical value represent the strength and direction of the correlation between the variables: a value of indicates a perfect positive correlation, while indicates no correlation Strong positive correlations are visible between certain amenities and trip types or companions, such as type_Business with amenity_Wi-Fi, and companions_Solo with amenity_Parking These relationships suggest that business travelers are likely to require Wi-Fi, and solo travelers may have a preference or need for parking facilities However, many cells display a value of 0, indicating no correlation, which suggests that for many combinations of these categorical variables, there is no apparent relationship The matrix is useful for identifying potential patterns and relationships in the data that could inform targeted marketing strategies or service improvements 3.4 Text analysis 3.4.1 Text length To determine how much time users invest in writing individual reviews, we examined the length of the reviews in detail Our results show that the majority of users use around 50 words/tokens and around 250 characters to describe their experience with the hotel While there are also reviews with a length of 400 words/token and 2000 characters, such extensive reviews are relatively rare You can also see how long the cleaned text is compared to the original 3.4.2 Common words To discern the frequently used words by users, which inherently carry high information value regarding the customer's experience and highlight crucial aspects of service These key terms reveal the prevalent themes and sentiments expressed by users and provide valuable insight into aspects that are most important to customers during their experience The graph shows that aspects such as rooms, staff and breakfast stand out as decisive evaluation criteria 3.4.3 Sentiment analysis We are using libraries to score and classify the sentiment of review texts Specifically, different approaches are implemented: Lexicon-based and Deep learning-based The scatter plot shows the relationship between lexicon-based sentiment analysis results and some sort of rating, plotted on the X and Y axes respectively The sentiment values, which range from -1 to 1, having numerous data points at each level of sentiment which suggesting a good variety in the sentiment of the reviews The ratings are discrete and range from to At a glance, there does not appear to be a strong visible correlation between the lexicon-based sentiment scores and the ratings We can see a wide spread of sentiment values for each rating level For instance, ratings of have sentiment scores ranging from very negative to very positive It's interesting to note that there are many reviews with a sentiment score around across all rating levels This might suggest that the lexicon-based method either finds a balance There are several cases where reviews with negative sentiment scores have high ratings and vice versa These could be outliers or instances where the lexicon-based method does not align well with the actual sentiment expressed in the review text The variance in sentiment scores for similar ratings may come from the reason that reviews might contain sarcasm, mixed sentiments, or complex expressions that a simple lexicon-based method cannot accurately interpret To fully evaluate the effectiveness of the lexicon-based sentiment analysis, it would be useful to compare these results with the deep learning-based sentiment analysis The provided images below show the results of a deep learning-based sentiment analysis compared to user ratings, and the correlation between these sentiments and the ratings Let's discuss each figure in detail The first set of bar charts shows the count of sentiments (negative, neutral, positive) for each rating level, for both raw and cleaned review texts