Build machine learning models for that purpose using the Luxstay listings in Hanoi

44 126 0
Build machine learning models for that purpose using the Luxstay listings in Hanoi

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Data Analytics MSc Dissertation MTH775P, 2019/20 Disquisitiones Arithmeticæ Predicting the prices for breakfasts and beds Hai Nam Nguyen, ID 161136118 Supervisor: Dr Martin Benning A thesis presented for the degree of Master of Science in Data Analytics School of Mathematical Sciences Queen Mary University of London Declaration of original work This declaration is made on August 17, 2020 Student’s Declaration: I Student Name hereby declare that the work in this thesis is my original work I have not copied from any other students’ work, work of mine submitted elsewhere, or from any other sources except where due reference or acknowledgement is made explicitly in the text, nor has any part been written for me by another person Referenced text has been flagged by: Using italic fonts, and using quotation marks “ ”, and explicitly mentioning the source in the text i This work is dedicated to my niece Nguyen Le Tue An(Mochi), who has brought a great source of joy to me and my family recently Abstract Pricing and guessing the right prices are vital for both hosts and renters on homesharing plat-form from internet based companies To contribute the growing interest and immense literatureon applying Artificial Intelligence on predicting rental prices, this paper attempts to build ma-chine learning models for that purpose using the Luxstay listings in Hanoi R2 score is used as the main criterion for the model performance and the results show that Extreme GradientBoostings (XGB) is the model with the best performance with R2 = 0.62, beating the most sophisticated machine learning model: Neural Networks iii Contents Declaration of original work i Abstract iii Introduction Literature Review Experimental Design 3.1 Dataset 3.2 K-Fold Cross Validation 3.3 Measuring Model Accuracy Methods 4.1 LASSO 4.1.1 FISTA 10 4.2 Random Forest 12 4.3 Gradient Boosting 14 4.4 Extreme Gradient Boosting 16 4.5 LightGBM 19 4.5.1 Gradient-based One-sided Sampling 20 4.5.2 Exclusive Feature Bundling 20 Neural Networks 23 4.6.1 Adam Algorithm 25 4.6.2 Backpropagation 26 4.6 iv CONTENTS v Experiments and Results 28 Conclusion and Outlook 30 A Some special mathematical notations 32 A.1 Vector Norm 32 A.2 The Hadamard product 33 B The Chain Rule 34 References 34 Chapter Introduction Since its establishment in 2016, Luxstay has become one of the most popular platforms on home-sharing along with Airbnb in Vietnam with a network of more than 15,000 listings The platform connects the guests’ demand to rent villas, houses, apartment, to hosts and vice versa Hence, providing a reasonable price will help hosts to gain a high and stable income and guests will get great experiences in new places Therefore, working on the sensible predictor and suggestion of Luxstay prices can generate a real-life value and practical application Hanoi is the capital of Vietnam and has the second most listings on Luxstay The city has been also ranked in top 10 destinations to visit by TripAdvisor As a dynamic city with active bookings and listings, Hanoi can be a great example for the study of Luxstay Pricing In this paper, we build a price prediction model and compare the performance of different methods using R2 as the main measure The input of our models is the data scraped on the Hanoi page of the website which includes continuous and categorical records about listings Then a number of methods including traditional Machine Learning models (LASSO, random forest, gradient boosting), Extreme Gradient Boosting, LightGBM and neural network to predict prices of listings Chapter Literature Review The sharing economy is a socio-economic system that arranges ”the peer-to-peerbased activity of obtaining, giving, or sharing the access to goods and services” through ”community-based online services” (J Hamari 2015) Home-sharing is one of the sharing activities and it has been experienced a significant growth due to a high demand from tourism (Guttentag 2015) Given that Luxstay is a startup from an emerging economy, the platform has not received much attention from the academic community as well as Airbnb, the leading company for this service (Wang & Nicolau 2017) Nevertheless, the Vietnamese home-sharing platforms has some similar characteristics to Airbnb as it is also an internet-based company that coordinates the demand of short-term renters and hosts Therefore, it is worth to conduct a review some findings on Airbnb from recent papers Gibbs et al (2017) stated that one of the biggest challenges of Airbnb was pricing the right prices by identifying the two key reasons for this issue Firstly, unlike the hotel business, where the prices are set by trained experts and industry benchmarks, rental prices on Airbnb are normally determined by regular hots with limited supports Secondly, instead of letting algorithm to control prices like Uber and Lyft, Airbnb leaves the prices to hosts to decide given that they might not be well-informed Consequently, these two factors may lead to cause a ptentially financial loss and empirical evidence shows that incompetent pricing causes a loss of 46% of additional revenue on Airbnb Hence, there have been an interest in the CHAPTER LITERATURE REVIEW study of rental price prediction on the leading platform The two trends for this topic are hedonic-based regression and artificial intelligence techniques The term Hedonic is defined to describe ”the weighting of the relative importance of various components among others in constructing an index of usefulness and desirability” (Goodman 1998) In other words, Hedonic pricing is to identifies factors and characteristics affecting an item price (Investopedia.com) Wang & Nicolau (2017) aimed to design a system to understand which features are important input for an automated price suggestion on Airbnb using hedonic-base regression approach The functional form used were Ordinary Least Squares and Quantile Regression to analyse 25 variables of 180,533 listings in 33 cities The result shows that features related to host attributes such as the number of their listings and the profile pictures are the most important features Among those, super host status, which reveals experienced hosts on the platform, is the best one However, the authors also discussed the limitation of this analysis The approach is under some economic assumptions needed to be examined The assumption of hosts’ rationality requires a qualitative check which is skipped in the study Generally, the effectiveness of hedonic-based regression for price prediction is restricted by the model assumptions and esimation (Selim 2009) Another approach for price prediction is to apply artificial intelligence techniques which mainly includes machine learning an neural network models Tang & Sangani (2015) produced a model fore price prediction for San Francisco listings To reduce the complexity of the task, they turned the regression problem into a classification task that predict both the neighbour hood and price range of a listing and Support Vector Machine was the main model to be tuned Uniquely, they included images as inputs for the model by creating a visual dictionary to categorise the image of a listing The result shows that while the price prediction achieves a high accuracy in the test set at 81.2%, the neighbourhood prediction suffers from overfitting with a big gap between the train and test sets Alternatively, Cai & Han (2019) attempted to work on the regression problem using the listings in Melbourne The study implemented l1 regularisation as feature selection for all traditional machine learning methods and then compared to models without it The result shows that the latter perform better overall and gradient CHAPTER LITERATURE REVIEW boosting algorithm produces the best precision with R2 = 0.6914 in the test set Recently, another study of the listings in New York holds an interesting result with an highest R2 of 0.7768 (Kalehbasti et al 2019) To gain that score, they performed a logarithmic transformation to the prices and then train their models Additionally, they also attempted to compare three feature selection methods, which are manual selection, p-value and LASSO The analysis shows that p-value and LASSO outperformed manual selection and the best method to be applied in the paper is LASSO In this paper, we applied the knowledge of the last three studies to build our price predictor for the listings on Luxstay Apart from widely used traditional machine learning methods and neural networks, we also attempted to code an algorithm to compute LASSO regression ourselves and used the two recent gradient boosting technique, Extreme Gradient Boosting and LightGBM The project worked on the original rental prices to produce a price prediction without any logarithmic transformation CHAPTER METHODS 24 fw (x) = ϕL (ϕL−1 ( ϕ1 (ϕ0 (x, w1 , b1 ), w2 , b2 ) , wL−1 , bL−1 )wL , bL ) (4.7) The equation 4.7 how a Neural Network of L layers is presented mathematically L We have {ϕl }L l=1 is a set of L activation functions that contain weights w = {wL }l=1 and biases b = {bL }L l=1 Typically, an activation function is in form of affine-linear transformation, which is defined as ϕ(x, W, b) = W T x + b (4.8) where x ∈ Rn represents number of inputs, W ∈ Rn×m is a weighted matrix and b ∈ Rm is a bias vector By this way, the activation function maps n inputs onto m outputs In this study, we only used this type of activation functions in the output layer For the input and hidden layers, Rectified Linear Unit (ReLU) is chosen The function is as the following: ϕ(x, W, b) = max(0, W T x + b) (4.9) The function is a combination of the affine-linear transformations and the rectifier, ϕ(x) = max(0, x) An advantage of using ReLU is that it is computational efficient due to its simplicity This contributes to make ReLU become one of the most popular activation functions used for Neural Networks However, this function is not flawless as there appears a problem called ’Dying ReLU’ When the function produces an output of zero or a negative value, the gradient of the function becomes zero Thus, the process of backpropagation, which is mention in the subsection below, can not perform in that neuron, making it turn off As we train a Network that appears to have those neurons, we may end up having a large part of the network doing nothing A solution for the Dying ReLU is that we can choose to use the so-called Leaky ReLU, where we adjust the function so its gradient has a small slope for negative values Nevertheless, we ignored this issue and only used ReLU for the hidden layers in this study We would like to try the Leaky ReLU in our future work for further experiments with Neural Networks CHAPTER METHODS 25 In order to produce an accurate prediction, the output layer needs to go through a loss function to match with the actual values closely Combined with the choice of loss function, a model of Neural Networks can be generalised into this problem: s arg w,b where i ∈ {1, ,s} and {li }si=1 s li (fw (xi ), yi ) (4.10) i=1 is a family of loss functions For the regression problem of this study, we choose the least-squares, li (x) = 12 (fw (xi ) − y)2 in order to optimise the problem 4.10 as follow: arg w,b 2s s (fw (xi ) − y)2 i=1 Since we have a differentible neural network with differentible activation functions We can determine an algorithm to train the model Our choice in this task is Adam 4.6.1 Adam Algorithm Adaptive Moment Estimation (Adam) is an extension of Stochastic Gradient Descent (SGD) (Kingma & Ba 2014) While SGD keeps only one learning for all weights and does not adjust the rate during training, Adam provides a learning rate for each weight and separtely adjust the rate through the training process The authors also claims that Adam inherits the advantages of Adaptive Gradient Algorithm (Adagrad) and Root Mean Square Propagation (RMSProp) The benefit of the former is to improve performance with sparse gradients and the benefit of the latter is to improve the performance on online and non-stationary problems As a result, Adam is insisted to be effective to solve practical problems related to the use of Neural Network (Kingma & Ba 2014) The algorithm of Adam is as follow regard to the original paper (Kingma & Ba 2014): CHAPTER METHODS 26 Algorithm Adam Specify: f (θ): Stochastic objective function with parameters θ Specify: α, β1 , β2 ∈ [0, 1), Initialise: θ0 m0 = v0 = for t = 1, ,T compute gt = ∇θ ft (θt−1 ) compute gt2 = gt gt ( : Hadamard product (A)) compute mt = β1 mt−1 + (1 − β1 )gt compute vt = β2 vt−1 + (1 − β2 )gt2 mt compute m ˆ t = 1−β t vt compute vˆt = 1−β t compute θt = θt−1 − √αvˆmtˆ+t end for return θT In order to implement Algorithm 8, we need to compute the gradients with respect to the weights in different layers The process of this computation is called backpropagation 4.6.2 Backpropagation Backward propagation of errors (Backpropagation) is the practice of readjusting the weights and biases of a Neural Network based on the error rate obtained in the previous epoch of the training process The term ”backwards” here means that the algorithm goes through the network backwards It computes the gradient of the final layer first then does it to the first layer last The algorithm of backpropagation is given as (Benning 2020, p 32): CHAPTER METHODS 27 Algorithm Backpropagation Specify: activation function ϕ, sample {(xi , yi )}si=1 , weight and bias dimensions and no of layer L Iterate: for i = 1, ,s for l = 1, ,s + bl Forward pass: compute zil = WlT xl−1 i Forward pass: compute xli = ϕ(zil ) end for end for for i = 1, ,s for l = L, ,1 Backward pass: compute δil =  ϕ (z l ) ∇ l(xLi , yi ), s ϕ (z l ) Wl+1 δ l+1 , i i l we have p n x p |xi |p = i=1 Example: • p = 1: The l1-norm x = |x1 | + |x2 | + + |xn | 32 APPENDIX A SOME SPECIAL MATHEMATICAL NOTATIONS 33 • p = 2: The l2-norm x A.2 = x21 + x22 + + x2n The Hadamard product The Hadamard Product or elementwise product is an operator that perform multiplication of elements in the same index between matrices or vectors Suppose x and y are two matrices of the same dimension, we have: (x y)ij = xij yij Appendix B The Chain Rule Suppose we have two differentiable functions f (x) and g(x) Then to differentiate y = f (g(x)), let u = g(x) and then y = f (u) We have dy du dy = × dx du dx 34 Bibliography Beck, A & Teboulle, M (2009), ‘A fast iterative shrinkage-thresholding algorithm for linear inverse problems’, SIAM J Imaging Sciences 2, 183–202 Benning, M (2019), MTH786P Machine Learning with Python Coursework 3, Queen Mary, University of London Benning, M (2020), MTH793P - Advanced on Machine Learning, Leacture Notes (Last updated on: May 6, 2020), Queen Mary, University of London Bishop, C M (2006), Pattern Recognition and Machine Learning, Information science and statistics, 1st ed 2006 corr 2nd printing edn, Springer Breiman, L (2001), ‘Random forests’, Machine Learning 45, 5–32 Breiman, L., Friedman, J., Olshen, R & Stone, C (1984), Classification and regression trees Cai, T & Han, K (2019), Melbourne airbnb price prediction Chen, T & Guestrin, C (2016), Xgboost: A scalable tree boosting system, pp 785–794 Efron, B., Hastie, T., Johnstone, I & Tibshirani, R (2004), ‘Least angle regression’, The Annals of Statistics 32(2), 407–451 URL: http://www.jstor.org/stable/3448465 Friedman, J (2002), ‘Stochastic gradient boosting’, Computational Statistics Data Analysis 38, 367–378 35 BIBLIOGRAPHY 36 Gibbs, C., Guttentag, D., Gretzel, U., Yao, L & Morton, J (2017), ‘Use of dynamic pricing strategies by airbnb hosts’, International Journal of Contemporary Hospitality Management 30, 00–00 Goodman, A C (1998), ‘Andrew court and the invention of hedonic price analysis’, Journal of Urban Economics 44(2), 291 – 298 URL: http://www.sciencedirect.com/science/article/pii/S0094119097920714 Guttentag, D (2015), ‘Airbnb: disruptive innovation and the rise of an informal tourism accommodation sector’, Current Issues in Tourism 18, 1192–1217 Hastie, Trevor;James, G M R D (2017), An introduction to statistical learning: with applications in R, Springer texts in statistics, corrected at 8th printing edn, Springer : Springer Science+Business Media Investopedia.com (2020), ‘Using hedonic pricing to determine the factors impacting home prices’ URL: https://www.investopedia.com/terms/h/hedonicpricing.asp (accessed: 17.07.2020) J Hamari, M Sjoklint, A U (2015), ‘The sharing economy: why people participate in collaborative consumption’, Journal of the association for information science and technology James H Stock, M W W (2020), Introduction to Econometrics, Global Edition, edn, Pearson Education Limited Kalehbasti, P R., Nikolenko, L & Rezaei, H (2019), ‘Airbnb price prediction using machine learning and sentiment analysis’ Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q & Liu, T.-Y (2017), Lightgbm: A highly efficient gradient boosting decision tree, in I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan & R Garnett, eds, ‘Advances in Neural Information Processing Systems 30’, Curran Associates, Inc., pp 3146–3154 BIBLIOGRAPHY URL: 37 http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient- boosting-decision-tree.pdf Kingma, D & Ba, J (2014), ‘Adam: A method for stochastic optimization’, International Conference on Learning Representations Lewis, L (2019), ‘Predicting airbnb prices with machine learning and deep learning’ URL: https://towardsdatascience.com/predicting-airbnb-prices-with-machinelearning-and-deep-learning-f46d44afb8a6 (accessed: 30.05.2020) Oliphant, T E (2006), A guide to NumPy, Vol 1, Trelgol Publishing USA Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J & Chintala, S (2019), Pytorch: An imperative style, high-performance deep learning library, in H Wallach, H Larochelle, A Beygelzimer, F dAlch´e-Buc, E Fox & R Garnett, eds, ‘Advances in Neural Information Processing Systems 32’, Curran Associates, Inc., pp 8024–8035 URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high- performance-deep-learning-library.pdf Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M & Duchesnay, E (2011), ‘Scikit-learn: Machine learning in python’, Journal of Machine Learning Research 12, 2825–2830 Richardson, L (2007), ‘Beautiful soup documentation’, April Selim, H (2009), ‘Determinants of house prices in turkey: Hedonic regression versus artificial neural network’, Expert Syst Appl 36, 2843–2852 Tang, E & Sangani, K (2015), ‘Neighborhood and price prediction for san francisco airbnb listings’, CSS 229 Final Project Report BIBLIOGRAPHY 38 Tibshirani, R (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society Series B (Methodological) 58(1), 267–288 URL: http://www.jstor.org/stable/2346178 Tibshirani, R., Hastie, T & Friedman, J (2010), ‘Regularized paths for generalized linear models via coordinate descent’, Journal of Statistical Software 33 Trevor Hastie, Robert Tibshirani, J F (2009), The elements of statistical learning: Data mining, inference, and prediction, Springer Series in Statistics, 2nd ed 2009 corr 3rd printing 5th printing edn, Springer URL: http://gen.lib.rus.ec/book/index.php?md5=0161e6689920acb72e562a5b8d726f4d Wang, D & Nicolau, J (2017), ‘Price determinants of sharing economy based accommodation rental: A study of listings from 33 cities on airbnb.com’, International Journal of Hospitality Management 62, 120–131 Yates, D S., Moore, D S & Starnes, D S (2003), The practice of statistics: TI-83/89 graphing calculator enhanced, W.H Freeman ... applying Artificial Intelligence on predicting rental prices, this paper attempts to build ma-chine learning models for that purpose using the Luxstay listings in Hanoi R2 score is used as the main... criterion for the model performance and the results show that Extreme GradientBoostings (XGB) is the model with the best performance with R2 = 0.62, beating the most sophisticated machine learning. .. and listings, Hanoi can be a great example for the study of Luxstay Pricing In this paper, we build a price prediction model and compare the performance of different methods using R2 as the main

Ngày đăng: 28/08/2020, 14:08

Mục lục

  • Declaration of original work

  • Some special mathematical notations

    • Vector Norm

Tài liệu cùng người dùng

Tài liệu liên quan