2017 visual sentiment analysis by attending on local image regions, quangzeng you et al

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	7
Dung lượng	716,47 KB

Nội dung

Visual Sentiment Analysis by Attending on Local Image Regions Abstract Visual sentiment analysis, which studies the emotional response of humans on visual stimuli such as images and videos, has been an interesting and challenging problem. It tries to understand the highlevel content of visual data. The success of current models can be attributed to the development of robust algorithms from computer vision. Most of the existing models try to solve the problem by proposing either robust features or more complex models. In particular, visual features from the whole image or video are the main proposed inputs. Little attention has been paid to local areas, which we believe is pretty relevant to human’s emotional response to the whole image. In this work, we study the impact of local image regions on visual sentiment analysis. Our proposed model utilizes the recent studied attention mechanism to jointly discover the relevant local regions and build a sentiment classiﬁer on top of these local regions. The experimental results suggest that 1) our model is capable of automatically discovering sentimental local regions of given images and 2) it outperforms existing stateoftheart algorithms to visual sentiment analysis.

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) khám phá liên quan setiment tới vùng cục ảnh xây dựng công cụ phân lớp sentiment với đặc trưng thị giác cục bộ, mơ hình họ tự động khám phá sentiment vùng cục địa phương ảnh cho hiệu phân tích sentiment tốt phương pháp đại vào thời điểm Visual Sentiment Analysis by Attending on Local Image Regions Quanzeng You Hailin Jin Jiebo Luo Department of Computer Science University of Rochester Rochester, NY 14627 qyou@cs.rochester.edu Adobe Research 345 Park Avenue San Jose, CA 95110 hljin@adobe.com Department of Computer Science University of Rochester Rochester, NY 14627 jluo@cs.rochester.edu khám phá Abstract Visual sentiment analysis, whichsự studies the emotional rekích thích sponse of humans on visual stimuli such as images and videos, has been an interesting and challenging problem It tries to understand the high-level content of visual data The cho success of current models can be attributed to the development of robust algorithms from computer vision Most of the [prə'pəʊzɪŋ] existing models try to solve the problem by proposing either robust features or more complex models In particular, visual features from the whole image or video are the main dành cho proposed inputs Little attention has been paid to local areas, thích hợp, thích đáng which we believe is pretty relevant to human’s emotional response to the whole image In this work, we study the impact tác động of local image regions on visual sentiment analysis Our pro/ `mekə,nɪzəm /kỹ thuật posed model utilizes the recent studied attention mechanism to jointly discover the relevant local regions and build a sentiment classifier on top of these local regions The experimental results suggest that 1) our model is capable of automatically discovering sentimental local regions of given images and 2) it outperforms existing state-of-the-art algorithms to visual sentiment analysis Introduction Visual sentiment analysis studies the emotional response of humans on visual stimuli such as images and videos It is different from textual sentiment analysis (Pang and Lee 2008), which focus on human’s emotional response on textual semantics Recently, visual sentiment analysis ss đc has achieved comparable performance with textual sentiment analysis (Borth etchoal 2013; Jou et al ; You et al 2015) This can be attributed to the success of deep learning on vision tasks (Krizhevsky, Sutskever, and Hinton 2012), which makes the understanding of high-level visual seman/i:s'θetik/ tics, such as image aesthetic analysis (Lu et al 2014), and visual sentiment analysis (Borth et al 2013), tractable The studies on visual sentiment analysis have been focused on designing visual features, from pixel-level (Siersdorfer et al 2010a), to middle attribute level (Borth et al 2013) and to recent deep visual features (You et al 2015; Campos, Jou, and Giro-i Nieto 2016) Thus, the perfordần dần mance of visual sentiment analysis systems has been gradually improved due to more and more robust visual features However, almost all of these approaches have been trying to reveal the high-level sentiment from the global phối cảnh perspective of the whole images Little attention has been paid to research from which local regions have we obtain the sentimental response and how is the local regions to'tō-ərd đối vớiwards the task of visual sentiment analysis In this work, we are trying to solve these two challenging problems We employ the recent proposed attention model (Mnih et al 2014; Xu et al 2015) to learn the correspondence between local image regions and the sentimental visual attributes In such a way, we are able to identify the local image regions, which sau is relevant to sentiment analysis Subsequently, a sentimentælbəm/ their images to create high quality albums and share with other users In this section, we simulate users’ curated visual attributes by randomly selecting different level of correct visual attributes This experiment follows the same steps in previous section However, we manually change the predicted visual attributes of the previously trained attribute detector We study Table 3: Performance of the two sentiment classifier using global and local visual features respectively much better than the top-1 accuracy, we use attributes instead of attribute to train our sentiment classifier For each /i'kweiʃn/ of the predicted attribute, we use Eq.(5) to compute the attention local visual features Then, these local visual features are concatenated and passed to the sentiment classifier /kɔn'kætineit/ as inputs To compare the performance of this sentiment classifier, we also train another deep CNN model using global visual features Specifically, we follow the mask-task settings proposed in (Jou and Chang 2016) to train the global visual 235 ['strætɪdʒɪ] the performance of two strategies: 1) For the incorrectly predicted visual attribute (top-1), we randomly replace some of them with the ground truth visual attribute We study the performance of providing different percentages of correctly top-1 visual attributes3 2) Instead of provide correct top-1 visual attribute, we provide the correct attribute to randomly replace one of the top-5 predicted attributes Specifically, for samples where all top-5 attributes are incorrect, we just randomly replace one of them with the ground truth visual attribute In such a way, we are able to manually curate visual attributes for all the images in the three splits that the proposed attention model needs good attributes in order to have better visual sentiment analysis results How'me-kə-ˌni-zəm ever, it is interesting to see that the proposed attention mechanism make the localization of sentiment related image re'pä-sə-bəl gions possible, which is another interesting and challenging research problem Conclusions Visual sentiment analysis is a challenging and interesting problem Current state-of-the-art approaches focus on using visual features from the whole image to build sentiment classifiers In this paper, we adopt attention mechanism to discover sentiment relevant local regions and build sentiment classifiers on these localized visual features The key hợp/làm cho khớp idea is to match local image regions with the descriptive visual attributes Because visual attribute detector is not our main problem to solve, we have experimented with differ['strætɪdʒɪ] ent strategies of generating visual attributes to evaluate the effectiveness of the proposed model The experimental results suggest that more accurate visual attributes will lead to better performance on visual sentiment analysis In particular, the studied attribute detector, which is a basic and direct fine-tuning strategy on CNN, could lead to comparable performance of CNN using global visual features More importantly, the utilization of attention model enables us to match the local regions in an image, which is much more interesting.[ɪn'kʌrɪdʒ]cổ We hope that our work on using local image regions can vũ; động viên encourage more studies on visual sentiment analysis In the kết hợp chặt chẽ future, we plan to incorporate visual context and large scale user generated images for building rich and robust attribute detector, localizing sentiment relevant local image regions and learning robust visual sentiment classifier Performance 0.95 0.9 0.85 Accuracy F1 0.8 0.75 0.7 50% 60% 70% 80% Percentage of correct topí1visual attribute 90% (a) Manually curated on top-1 0.8 Performance 0.75 0.7 Accuracy F1 0.65 Acknowledgment 0.6 80% 90% Percentage of correct topí5visual attribute This work was generously supported in part by Adobe Research and New York State through the Goergen Institute for Data Science at the University of Rochester 100% (b) Manually curated on top-5 [prə'pəʊz] References Figure 4: Performance of the proposed model on visual sentiment analysis with different level of manually curated visual attributes Bahdanau, D.; Cho, K.; and Bengio, Y 2014 Neural machine translation by jointly learning to align and translate In ICLR 2015 Borth, D.; Chen, T.; Ji, R.; and Chang, S.-F Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content Borth, D.; Ji, R.; Chen, T.; Breuel, T.; and Chang, S.-F 2013 Large-scale visual sentiment ontology and detectors using adjective noun pairs In Proceedings of the 21st ACM international conference on Multimedia, 223–232 ACM Campos, V.; Jou, B.; and Giro-i Nieto, X 2016 From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction arXiv preprint arXiv:1604.03489 Cao, D.; Ji, R.; Lin, D.; and Li, S 2014 A cross-media public sentiment analysis system for microblog Multimedia Systems 1–8 Chen, T.; Yu, F X.; Chen, J.; Cui, Y.; Chen, Y.-Y.; and Chang, S.-F 2014 Object-based visual sentiment concept Next, we train a local sentiment prediction model using [‚ɪndɪ'vɪdjʊəl] the two curated datasets individually Figure 4(a) shows the accuracy and the F1 score of the proposed model given different percentages of correct top-1 visual attributes As expected, the model performs better when more correct visual attributes are provided In particular, the performance almost 'li-nē-ər-li linearly increases with the percentage of correct top-1 visual attributes Meanwhile, the performance of our model is also increased with more correct/'inkri:s/ top-5 manually curated visual attributes However, the increase is not as significant as the top-1 case This is expected given the fact that the top-1 [ə'tʃiːv] ['ækjʊrəsɪ] accuracy can only achieve 35.8% even when we manually curate the top-5 accuracy to 100% These results indicate The samples with correct visual attributes include both the correctly predicted samples by the visual attribute detector and the randomly replaced samples 236 analysis and application In Proceedings of the ACM International Conference on Multimedia, 367–376 ACM Escorcia, V.; Niebles, J C.; and Ghanem, B 2015 On the relationship between visual attributes and convolutional networks In CVPR 2015, 1256–1264 IEEE Frome, A.; Corrado, G S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T 2013 Devise: A deep visualsemantic embedding model In Advances in Neural Information Processing Systems (NIPS), 2121–2129 Jou, B., and Chang, S.-F 2016 Deep cross residual learning for multitask visual recognition arXiv preprint arXiv:1604.01335 Jou, B.; Chen, T.; Pappas, N.; Redi, M.; Topkara, M.; and Chang, S.-F Visual affect around the world: A large-scale multilingual visual sentiment ontology Karpathy, A., and Li, F 2015 Deep visual-semantic alignments for generating image descriptions In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128–3137 Kiros, R.; Salakhutdinov, R.; and Zemel, R S 2014 Unifying visual-semantic embeddings with multimodal neural language models CoRR abs/1411.2539 Krizhevsky, A.; Sutskever, I.; and Hinton, G E 2012 Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems, 1097–1105 Lu, X.; Lin, Z.; Jin, H.; Yang, J.; and Wang, J Z 2014 Rapid: Rating pictorial aesthetics using deep learning In Proceedings of the ACM International Conference on Multimedia, 457–466 ACM Ma, L.; Lu, Z.; Shang, L.; and Li, H 2015 Multimodal convolutional neural networks for matching image and sentence In The IEEE International Conference on Computer Vision (ICCV) Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G S.; and Dean, J 2013 Distributed representations of words and phrases and their compositionality In Advances in Neural Information Processing Systems 26 (NIPS), 3111–3119 Mnih, V.; Heess, N.; Graves, A.; and kavukcuoglu, k 2014 Recurrent models of visual attention In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N D.; and Weinberger, K Q., eds., Advances in Neural Information Processing Systems 27 Curran Associates, Inc 2204–2212 Oquab, M.; Bottou, L.; Laptev, I.; and Sivic, J 2014 Learning and transferring mid-level image representations using convolutional neural networks In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1717–1724 Pang, B., and Lee, L 2008 Opinion mining and sentiment analysis Foundations and trends in information retrieval 2(1-2):1–135 Pennington, J.; Socher, R.; and Manning, C D 2014 Glove: Global vectors for word representation In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543 Siersdorfer, S.; Minack, E.; Deng, F.; and Hare, J 2010a Analyzing and predicting sentiment of images on the social web In Proceedings of the 18th ACM international conference on Multimedia, 715–718 ACM Siersdorfer, S.; Minack, E.; Deng, F.; and Hare, J S 2010b Analyzing and predicting sentiment of images on the social web In ACM MM, 715–718 Simonyan, K., and Zisserman, A 2014 Very deep convolutional networks for large-scale image recognition arXiv preprint arXiv:1409.1556 Socher, R.; Karpathy, A.; Le, Q V.; Manning, C D.; and Ng, A Y 2014 Grounded compositional semantics for finding and describing images with sentences TACL 2:207–218 Srivastava, N., and Salakhutdinov, R 2014 Multimodal learning with deep boltzmann machines Journal of Machine Learning Research 15(1):2949–2980 Tai, K S.; Socher, R.; and Manning, C D 2015 Improved semantic representations from tree-structured long short-term memory networks In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 1556–1566 Vinyals, O.; Kaiser, Ł.; Koo, T.; Petrov, S.; Sutskever, I.; and Hinton, G 2015 Grammar as a foreign language In Advances in Neural Information Processing Systems, 2755– 2763 Wang, M.; Cao, D.; Li, L.; Li, S.; and Ji, R 2014 Microblog sentiment analysis based on cross-media bag-ofwords model In ICIMCS, 76:76–76:80 ACM Wang, Y.; Wang, S.; Tang, J.; Liu, H.; and Li, B 2015 Unsupervised sentiment analysis for social media images In 24th International Joint Conference on Artificial Intelligence IJCAI Weston, J.; Bengio, S.; and Usunier, N 2011 WSABIE: scaling up to large vocabulary image annotation In IJCAI, 2764–2770 Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A C.; Salakhutdinov, R.; Zemel, R S.; and Bengio, Y 2015 Show, attend and tell: Neural image caption generation with visual attention In ICML, 2048–2057 You, Q.; Luo, J.; Jin, H.; and Yang, J 2015 Robust image sentiment analysis using progressively trained and domain transferred deep networks In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, 381–388 You, Q.; Jin, H.; Wang, Z.; Fang, C.; and Luo, J 2016a Image captioning with semantic attention In CVPR 2016 You, Q.; Luo, J.; Jin, H.; and Yang, J 2016b Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM), 13–22 237 ... sentiment classifiers on these localized visual features The key hợp/làm cho khớp idea is to match local image regions with the descriptive visual attributes Because visual attribute detector is not... studies on visual sentiment analysis In the kết hợp chặt chẽ future, we plan to incorporate visual context and large scale user generated images for building rich and robust attribute detector, localizing... International Conference on Multimedia, 457–466 ACM Ma, L.; Lu, Z.; Shang, L.; and Li, H 2015 Multimodal convolutional neural networks for matching image and sentence In The IEEE International Conference

Ngày đăng: 25/12/2020, 08:17