Adaptive multimodal fusion based similarity measures in music information retrieval

ADAPTIVE MULTIMODAL FUSION BASED SIMILARITY MEASURES IN MUSIC INFORMATION RETRIEVAL ZHANG BINGJUN (B.Sc., Hons, Tsinghua University) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE 2010 Acknowledgement First and foremost, I should express my deepest gratefulness to my lovely supervisor, Dr Wang Ye. He has been guiding me since the beginning of my research journey. His enormous passion, deep knowledge, and great personality have been my strong support through each stage of my research journey. All the virtues I learned from him will light up the rest of my life. During my research journey, my wife, Gai Jiazi, my parents, and my parents in law are my strongest spiritual support. I always have their warm arms when I go through difficult times. I am deeply indebted to their love and support. I also would like to thank my lab mates, who has been together with me to work on the same projects, to discuss tough questions, and to enjoy happy college times. Xiang Qiaoliang, Zhao Zhendong, Li Zhonghua, Zhou Yinsheng, Zhao Wei, Wang Xinxi, Yi Yu, Huang Yicheng, Huang Wendong, Zhu Jia, Chatlotte Tan, Wu Zhijia, and many more. I miss you guys and wish you all a very bright future. Last but not least, I would like to thank School of Computing and National University of Singapore. I feel very lucky to my PhD program in this great school and university. Their inspiring research environment and excellent support have been part of the foundation for my research achievements. i Contents Acknowledgement i Contents ii Summary v List of Publications viii List of Tables x List of Figures xii Abbreviations xiv Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Multimodal Fusion based Similarity Measures . . . . . . . . 1.1.2 Adaptive Multimodal Fusion based Similarity Measures . . . ii CONTENTS iii 1.2 Research Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Customized Multimodal Music Similarity Measures 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 The Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Fuzzy Music Semantic Vector - FMSV . . . . . . . . . . . . 15 2.2.2 Adaptive Music Similarity Measure . . . . . . . . . . . . . . 17 2.2.3 CompositeMap: From Rigid Acoustic 2.3 2.4 Features to Adaptive FMSVs . . . . . . . . . . . . . . . . . 18 2.2.4 iLSH Indexing Structure . . . . . . . . . . . . . . . . . . . . 23 2.2.5 Composite Ranking . . . . . . . . . . . . . . . . . . . . . . . 25 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Design of Database and Query . . . . . . . . . . . . . . . . . 26 2.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Effectiveness Study . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Efficiency Study . . . . . . . . . . . . . . . . . . . . . . . . . 32 Query-Dependent Fusion by Regression-on-Folksonomies 37 CONTENTS iv 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Automatic Query Formation . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Folksonomies to Social Query Space . . . . . . . . . . . . . . 44 3.2.2 Social Query Sampling . . . . . . . . . . . . . . . . . . . . . 46 Regression Model for QDF . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Model Definition . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Regression Pegasos . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.3 Online Regression Pegasos . . . . . . . . . . . . . . . . . . . 50 3.3.4 Class-based v.s. Regression-based QDF . . . . . . . . . . . . 51 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 Test Collection . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.2 Multimodal Search Experts . . . . . . . . . . . . . . . . . . 55 3.4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.1 Effectiveness Study . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.2 Efficiency Study . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.3 Robustness Study . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 3.4 3.5 Multimodal Fusion based Music Event Detection and its Applications in Violin Transcription 67 4.1 67 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS v 4.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.1 Audio-only Onset Detection . . . . . . . . . . . . . . . . . . 71 4.3.2 Audio-only Pitch Estimation . . . . . . . . . . . . . . . . . . 74 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Bowing Analysis for Onset Detection . . . . . . . . . . . . . 76 4.4.2 Fingering Analysis for Onset Detection . . . . . . . . . . . . 79 Audio-Visual Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5.1 Feature Level Fusion . . . . . . . . . . . . . . . . . . . . . . 83 4.5.2 Decision Level Fusion . . . . . . . . . . . . . . . . . . . . . . 85 4.5.3 Audio-Visual Violin Transcription . . . . . . . . . . . . . . . 89 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.6.1 Audio-Visual Violin Database . . . . . . . . . . . . . . . . . 90 4.6.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 91 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4 4.5 4.6 4.7 Conclusions and Future Research Bibliography 98 102 Summary In the field of music information retrieval (MIR), one fundamental research problem is the measuring of the similarity between music documents. Based on a viable similarity measure, MIR systems can be made more effective to help users retrieve relevant music information. Music documents are inherently multi-faceted. They contain not only multiple sources of information, e.g., textual metadata, audio content, video content, images, etc. but also multiple aspects of information, e.g., genre, mood, rhythm, etc. Fusing the multiple modalities effectively and efficiently is essential in discovering good similarity measures. In this thesis, I propose and investigate a comprehensive adaptive multimodal fusion framework to construct more effective similarity measures for MIR applications. The basic philosophy is that music documents with different content require different fusion strategies to combine their multiple modalities. Besides, the same multiple documents in different contexts need adaptive fusion strategies to derive effective similarity measures in certain multimedia tasks. Based on the above philosophy, I proposed a multi-faceted music search engine that allows users to customize their most preferred music aspects in a search opervi Summary vii ation so that the similarity measure underlying the search engine is adapted to the users’ instant information needs. This adaptive multimodal fusion based similarity measure allows more relevant music items to be retrieved. On this multi-faceted music search engine, a query-dependent fusion approach was also proposed to improve the adaptiveness of the music similarity measure to different user queries. Revealed in the experimental results, the proposed adaptive fusion approach improved the search effectiveness by combining the multiple music aspects with customized fusion strategies for different user queries. We also investigated state-of-the-art fusion techniques in audio-visual violin transcription task and built a prototype system for violin tutoring in a home environment based on the audio-visual fusion techniques. Future plans are proposed to investigate the adaptive fusion approaches in semantic music similarity measures so that a more user-friendly music search engine can be made possible. List of Publications Bingjun Zhang, Qiaoliang Xiang, Huanhuan Lu, Jialie Shen, and Ye Wang, Comprehensive query-dependent fusion using regression-on-folksonomies: a case study of multimodal music search. In ACM Multimedia, 2009. [regular paper] Bingjun Zhang, Qiaoliang Xiang, Ye Wang, and Jialie Shen, CompositeMap: a novel music similarity measure for personalized multimodal music search. In ACM Multimedia, 2009. [demo] Bingjun Zhang, Jialie Shen, Qiaoliang Xiang, and Ye Wang, CompositeMap: a novel framework for music similarity measure. In ACM SIGIR, 2009. [regular paper] Bingjun Zhang and Ye Wang, Automatic music transcription using audio-visual fusion for violin practice in home environment. Technical Report, School of Computing, National University of Singapore, 2009. Huanhuan Lu, Bingjun Zhang, Ye Wang, and Wee Kheng Leow, iDVT: a digital violin tutoring system based on audio-visual fusion. In ACM Multimedia, 2008. [demo] Chee Chuan Toh, Bingjun Zhang, and Ye Wang, Multiple-feature fusion based on- viii Chapter Multimodal Fusion based Music Event Detection and its Applications in Violin Transcription 95 improvement brought by the visual modality in feature concatenation fusion. One way to balance audio and visual dimensions is to apply additional dimensionality reduction techniques, such as PCA, on the audio feature space. However, according to our extra experiments, after reducing the 45 dimensional MFCCs into to dimensions by PCA, the classification performance by the audio-only modality suffers due to the loss of information. After fusing the more balanced audio and visual modalities, the overall onset detection performance is not better than the the performance by the unbalanced feature concatenation fusion. With this dilemma, the feature concatenation fusion in feature level is not suitable for our application. For the linear weighted sum fusion in decision level, the noise of the visual modality propagates to the fused onset detection function more severely than in the SVM based fusion, which results in less improvement by the linear weighted sum fusion. Compared with other fusion methods, the SVM based fusion in decision level improves the onset detection performance by the most amount of F-measure. This not only reveals the advantage of decision level fusion, in which the audio and visual modalities have the same representation (detection function for onsets) and balanced dimensions (1 dimension for each data stream), but also verifies the effectiveness of SVM’s non-linearity and optimal separating hyperplane in fusing audio and visual modalities of a violin playing for onset detection. Performance Comparison of Transcription The best transcription accuracy for audio-only and audio-visual transcription approaches is shown in Fig. 4.7. In the audio-only case, with MFCC GMM onset detection, the overall transcription accuracy is 71%, 65%, 42%, and 30% on the Chapter Multimodal Fusion based Music Event Detection and its Applications in Violin Transcription 96 Audio-visual onset detection Audio-visual transcription F-measure / Accuracy 0.9 0.8 0.7 0.937 0.881 0.857 Audio-only onset detection Audio-only transcription 0.909 0.866 0.833 0.810 0.709 0.793 0.718 0.6 0.623 0.659 0.629 0.504 0.5 0.4 0.424 0.3 0.300 0.2 24dB 15dB SNR 0dB -5dB Figure 4.7: Performance improvement by the visual modality with SVM based decision level fusion in different noisy conditions. four databases, respectively. In fusing audio and visual modalities, the best transcription performance are 85%, 79%, 62%, and 50% on the four databases, which improves over audio-only approaches by 14% to 20% accuracy (shown in Fig. 4.7). As shown in the experimental results, visual modality is helpful in improving onset detection performance and transcription accuracy. Especially for violin practice at home, where the acoustic conditions are far from ideal, introducing visual modality is beneficial to high performance music transcription system. 4.7 Related Works Few works have been published on music transcription by fusing multimodal features. Drum transcription in [28] is the first system we found dealing with percussive sounds using both audio and visual modalities. Tempo analysis of sitar performance based on multimodal sensor fusion are found in [8]. Our previous work in [69] is the first attempt for violin transcription with audio-visual inputs. Chapter Multimodal Fusion based Music Event Detection and its Applications in Violin Transcription 97 However, the previous system used markers to aid bowing and fingering analysis, which is less practical compared with the system in this chapter. One attempt to automatic fingering analysis without markers has also been conducted by us in [79]. Nevertheless, the finger tracking algorithm in [79] is more computationally expensive and less suitable for practical applications compared with the work in this chapter. The correlation between violin music and the visual modality, bowing and fingering, has been shown in cognitive brain research [5] and other violin literature [6]. Inspired by those works, we introduced the visual modality to utilize the complementary information between the audio and visual modalities in violin transcription. Audio-visual fusion based approach significantly improves violin transcription performance based on our experimental results. Superior performance has also been observed by using multiple modalities in audio-visual speech recognition [54], audio-visual biometric [26], concept detection in multimedia data [70], etc. Chapter Conclusions and Future Research In summary, multimedia content and user contexts will affect the importance of different facets of multimedia documents in deriving multimedia similarity measures. The research work in this thesis was on investigating the effectiveness of different adaptive multimodal fusion approaches with the vision that a more effective similarity measure for music information retrieval (MIR) applications can be made possible. In our research work, we built an multi-faceted music search engine for users to search more relevant music items based on their changing preference in different search contexts. We presented CompositeMap, a novel framework of multimodal music similarity measure to facilitate various music retrieval tasks such as organizing, browsing, and searching in a large data set. We detailed the FMSV which can map any existing audio features into high-level concepts such as genre and mood. CompositeMap unified content-based, metadata-based, and semantic description-based music retrieval approaches. It combined different music facets 98 Chapter Conclusions and Future Research 99 into a compact signature which can enable customized services for users with different information needs, background knowledge, and expectations. For a case study, we employed CompositeMap in a music search engine to evaluate its effectiveness, efficiency, adaptiveness and scalability using two separate large scale music collections extracted from YouTube. Our objective evaluation and user study showed the clear advantages of the proposed framework compared with other approaches [22, 41, 75]. Furthermore, our project led to several innovations including an efficient SVM training algorithm with multi-class probability estimates and an incremental Locality Sensitive Hashing algorithm, which could be used to improve the previous multimedia information retrieval systems as a whole. We also proposed a query-dependent fusion approach based on this multimodal music search engine to improve the search effectiveness, adaptiveness, and robustness. We outlined a novel query-dependent fusion (QDF) method using regressionon-folksonomies to facilitate multimodal music search in large databases. Previous QDF approaches [15, 73, 33, 72, 71, 32] relied on manually designed queries, which imposed expensive human involvement in system development. We pursued the alternative automatic query formation to easily generate a large number (in millions) of comprehensive queries from readily available online folksonomy data. This approach not only provided better generalization performance for real-life search systems in accommodating future user queries, but also offered great feasibility and practicality in real-life system development. In addition, we proposed an online-regression model for query-dependent fusion (RQDF). The model represented a further step towards the optimal query-dependent fusion, which unleashes the power of multimodal music search. Its superior modeling capability not only Chapter Conclusions and Future Research 100 enhanced the effectiveness of query-dependent fusion systems, but also significantly improved the system efficiency, scalability, and robustness. Due to the generality of RQDF, we believe that it can be easily extended to other multimodal search applications such as text/video/image search and metasearch as well. We also examined state-of-the-art multimodal fusion techniques in the task of audio-visual violin transcription. We built an audio-visual fusion based music transcription system for violin practice in home environment. To address the difficulties in onset detection of PNP sounds, such as from the violin, we proposed an audio-only onset detection approach based on supervised learning. Two GMMs are used to classify onset and non-onset audio frames based on MFCC features. MFCC feature models the spectrum envelop effectively, which forms the basis of superior classification performance. In addition, due to the efficient modeling approach by GMM and the low dimensionality of MFCCs, the proposed audio-only onset detection method is computationally efficient, making it practical applications. To further enhance audio-only onset detection, the visual modality of violin playing, including bowing and fingering, was introduced into our system. Two webcams of the system can be easily placed to capture bowing and fingering videos in home environment. Fully automatic and real-time algorithms were devised to conduct bowing and fingering analysis. These algorithms maximize the practicality of the system. State-of-the-art multimodal fusion techniques were evaluated to fuse the audio and visual modalities for enhanced performance of onset detection and overall transcription. SVM based decision level fusion was verified to be superior to feature concatenation fusion in feature level and linear weighted sum fusion in decision level. Compared with the previous audio-only based transcrip- Chapter Conclusions and Future Research 101 tion systems [7, 17, 36, 42], the visual modality and SVM based decision level fusion improved the transcription performance significantly. Especially in home environment, where the acoustic conditions are far from ideal, the performance improvement by the visual modality is more substantial. Based on the above contributions and extensive evaluations, the violin transcription system achieved good performance even in acoustically inferior conditions. This transcription system is able to provide more accurate transcribed results as feedback to students when they practice violin at home. With efficient and automatic audio-visual analysis algorithms, the system can be easily set up once and for all in a home environment. Even though the effectiveness and efficiency of multimodal similarity measures have been improved based on our research results, the semantic aspects of the similarity measures in music domain are not sufficiently addressed. In the future, more research efforts should be allocated to investigate the semantic aspects of music similarity measures. This can be investigated by referring more to human perceived importance in fusing multiple information sources from textual metadata, audio content, video content, and other modalities. More user-friendly music search engines will be made possible based on more semantically effective music similarity measures. Bibliography [1] http://opencvlibrary.sourceforge.net. [2] http://www.last.fm. [3] http://www.YouTube.com. [4] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS’06, 2006. [5] A. P. Baader, O. Kazennikov, and M. Wiesendanger. Coordination of bowing and fingering in violin playing. Cognitive Brain Research, 23:436–443, 2005. [6] A. Bachmann. An encyclopedia of the violin. Da Capo Press, 1975. [7] J. Bello and M. Sandler. Phase-based note onset detection for music signals. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 49–52, 2003. [8] M. S. Benning, A. Kapur, B. C. Till, and G. Tzanetakis. Multimodal sensor analysis of sitar performance: Where is the beat? Multimedia Signal Processing, pages 74–77, 2007. 102 In IEEE Workshop on BIBLIOGRAPHY 103 [9] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975. [10] A. Berenzweig, B. Logan, D. Eills, and B. Whitman. A large-scale evaluation of acoustic and subjective music-similarity measures. Comput. Music J., 2004. [11] E. Brookner. Tracking and Kalman Filtering Made Easy. John Wiley & Sons, 1998. [12] C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998. [13] M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney. Content-based music information retrieval: current directions and future challenges. Proc. of the IEEE, 2008. [14] C. Chang and C. Lin. Libsvm: a library for support vector machines, 2001. [15] T.-S. Chua, S.-Y. Neo, K.-Y. Li, G. Wang, R. Shi, M. Zhao, and H. Xu. Trecvid 2004 search and feature extraction task by nus pris. In NIST TRECVID Workshop, 2004. [16] N. Collins. A comparison of sound onset detection algorithms with emphasis on psycho-acoustically motivated detection functions. In Proceedings of AES118 Convention, 2005. [17] N. Collins. Using a pitch detector for onset detection. In Proceedings of International Conference on Music Information Retrieval, 2005. [18] B. Cui, L. Liu, C. Pu, J. Shen, and K. L. Tan. Quest: querying music databases by acoustic and textual features. In ACM Multimedia, 2007. BIBLIOGRAPHY 104 [19] C. V. Damme, M. Hepp, and K. Siorpaes. Folksontology: An integrated approach for turning folksonomies into ontologies. In the ESWC Workshop, 2007. [20] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv., 40(2):1–60, April 2008. [21] S. J. Downie. The scientific evaluation of music information retrieval systems: Foundations and future. Computer Music Journal, 28(2):12–23, 2004. [22] S. J. Downie. The music information retrieval evaluation exchange (2005 007): A window into music information retrieval research. Acoustical Science and Technology, 2008. [23] S. J. Downie. The music information retrieval evaluation exchange (mirex). In ISMIR’08, 2008. [24] N. R. Draper and H. Smith. Applied Regression Analysis. Wiley-Interscience, 1998. [25] S. Essid, G. Richard, and B. David. Instrument recognition in polyphonic music based on automatic taxonomies. IEEE Trans. Acoust., Speech, Signal, 2006. [26] J. Fierrez-Aguilar, J. Ortega-Garcia, D. Garcia-Romero, and J. GonzalezRodriguez. A comparative evaluation of fusion strategies for multimodal biometric verification. Audio- and Video-based Biometric Person Authentication, pages 1056–1056, 2003. BIBLIOGRAPHY 105 [27] R. Garcia and O. Celma. Semantic integration and retrieval of multimedia metadata. In 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media, 2005. [28] O. Gillet and G. Richard. Automatic transcription of drum sequences using audiovisual features. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 205–208, 2005. [29] D. L. Hall. Mathematical Techniques in Multisensor Data Fusion. Artech House, Inc., 2004. [30] T. Huang, R. Weng, and C. Lin. Generalized bradley-terry models and multiclass probability estimates. J. Mach. Learn. Res., 2006. [31] I.-H. Kang and G. Kim. Query type classification for web document retrieval. In SIGIR ’03, 2003. [32] L. Kennedy, S. F. Chang, and A. Natsev. Query-adaptive fusion for multimodal search. Proc. of the IEEE, 2008. [33] L. Kennedy, A. P. Natsev, and S. F. Chang. Automatic discovery of queryclass-dependent models for multimodal search. In ACM Multimedia, 2005. [34] N. Kiryati, X. Eldar, and A. Bruckstein. A probabilistic hough transform. Pattern Recognition, 1991. [35] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Trans. on Signal Processing, 2004. BIBLIOGRAPHY 106 [36] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 6, pages 3089–3092, 1999. [37] A. Lacoste and D. Eck. A supervised classification algorithm for note onset detection. EURASIP Journal of Applied Signal Processing, 2007(1):153–153, 2007. [38] M. Lew, N. Sebe, C. Djeraba, and R. Jain. Content-based multimedia information retrieval: State of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl., 2006. [39] D. Li, N. Dimitrova, M. Li, and I. K. Sethi. Multimedia content processing through cross-modal association. In Proceedings of ACM Conference on Multimedia, pages 604–611, 2003. [40] B. Logan. Mel frequency cepstral coefficients for music modeling. In Proc. of the ISMIR, 2000. [41] B. Logan and A. Salomon. A music similarity function based on signal analysis. In Proc. of IEEE ICME, 2001. [42] A. Loscos, Y. Wang, and W. Boo. Low level descriptors for automatic violin transcription. In Proceedings of International Conference on Music Information Retrieval, 2006. [43] L. Lu, D. Liu, and H. Zhang. Automatic mood detection and tracking of music audio signals. IEEE Trans. Acoust., Speech, Signal, 2006. BIBLIOGRAPHY 107 [44] C. D. Manning, P. Raghavan, and H. Sch¨ utze. Introduction to Information Retrieval. Cambridge University Press, 2008. [45] Q. Mei, J. Jiang, H. Su, and C. Zhai. Search and tagging: Two sides of the same coin? Technical report, UIUC, 2007. [46] P. Mika. Ontologies are us: A unified model of social networks and semantics. Web Semantics, 2007. [47] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995. [48] W. Ng, D. Yeung, M. Firth, E. Tsang, and X. Wang. Feature selection using localized generalization error for supervised classification problems using RBFNN. Pattern Recognition, 2008. [49] X. Olivares, M. Ciaramita, and R. van Zwol. Boosting image retrieval through aggregating search results based on visual annotations. In ACM Multimedia, 2008. [50] N. Orio. Music retrieval: a tutorial and review. Found. Trends Inf. Retr., 1:1–96, 2006. [51] P. Over, G. M. Awad, T. Rose, J. Fiscus, W. Kraaij, and A. F. Smeaton. Trecvid 2008 - goals, tasks, data, evaluation mechanisms and metrics. In TRECVid 2008, 2008. [52] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 2000. BIBLIOGRAPHY 108 [53] M. F. Porter. An algorithm for suffix stripping. Program, 1980. [54] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):1306–1326, 2003. [55] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26(2):195–239, 1984. [56] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72–83, 1995. [57] S. E. Robertson, S. Walker, M. M. Beaulieu, and M. Gatford. Okapi at trec-4. In TREC-4, 1995. [58] Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 10:39–62, 1999. [59] S. S. Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated subgradient solver for svm. In ICML’07, 2007. [60] B. Schölkopf and A. J. Smola. Learning with Kernels. Cambridge, MA: MIT Press, 2001. [61] J. A. Shaw and E. A. Fox. Combination of multiple searches. In TREC-2, 1994. BIBLIOGRAPHY 109 [62] J. Shi and C. Tomasi. Good features to track. In Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition, pages 593–600, 1994. [63] S. Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In ICML’08, 2008. [64] A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. In MIR ’06, 2006. [65] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Towards musical query-by-semantic-description using the cal500 data set. In Proc. of ACM SIGIR, 2007. [66] G. Tzanetakis and P. Cook. Marsyas a framework for audio analysis. Organized Sound, 2000. [67] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Trans. on Speech and Audio Proc., 2002. [68] V. Vezhnevets, V. Sazonov, and A. Andreeva. A survey on pixel-based skin color detection techniques. In Proceedings of Graphicon, pages 85–92, 2003. [69] Y. Wang, B. Zhang, and O. Schleusing. Educational violin transcription by fusing multimedia streams. In Workshop of Educational Multimedia and Multimedia Education, 2007. [70] Y. Wu, E. Y. Chang, K. C. Chang, and J. R. Smith. Optimal multimodal fusion for multimedia data analysis. In Proceedings of ACM Conference on Multimedia, pages 572–579, 2004. BIBLIOGRAPHY 110 [71] L. Xie, A. Natsev, and J. Tesic. Dynamic multimodal fusion in video search. In IEEE ICME’07, 2007. [72] R. Yan and A. G. Hauptmann. Probabilistic latent query analysis for combining multiple retrieval sources. In SIGIR’06, 2006. [73] R. Yan, J. Yang, and A. G. Hauptmann. Learning query-class dependent weights in automatic video retrieval. In ACM Multimedia, 2004. [74] J. Yin, Y. Wang, and D. Hsu. Digital violin tutor: An integrated system for beginning violin learners. In ACM Multimedia, pages 976–985, 2005. [75] J. You, S. Park, and I. Kim. An efficient frequent melody indexing method to improve the performance of query-by-humming systems. Journal of Information Science, 2008. [76] B. Zhang, J. Shen, Q. Xiang, and Y. Wang. Compositemap: A novel framework for music similarity measure. In Proc. of ACM SIGIR, 2009. [77] B. Zhang and Y. Wang. Automatic music transcription using audio-visual fusion for violin practice in home environment. Technical Report, School of Computing, National University of Singapore, 2009. [78] B. Zhang, Q. Xiang, H. Lu, J. Shen, and Y. Wang. Comprehensive querydependent fusion using regression-on-folksonomies: A case study of multimodal music search. In review in ACM Multimedia, 2009. [79] B. Zhang, J. Zhu, Y. Wang, and W. Leow. Visual analysis of fingering for pedagogical violin transcription. In ACM Multimedia, 2007. [...]... Publications ix set detection for solo singing voice In International Conference on Music Information Retrieval, 2008 Ye Wang, and Bingjun Zhang, Application-specific music transcription for instrument tutoring, In IEEE MultiMedia, 2008 Olaf Schleusing, Bingjun Zhang, and Ye Wang, Onset detection in pitched nonpercussive music using warping-compensated correlation, In ICASSP, 2008 Bingjun Zhang, Jia Zhu, Ye Wang,... effective similarity measures for MIR applications by improving the adaptiveness of similarity measures within a comprehensive adaptive multimodal fusion framework I investigate the multiple modalities in music documents that are informative to end users In addition, I propose an adaptive fusion framework to derive similarity measures, which can combine the multiple modalities optimally depending on the... query-dependent fusion approach for the multimodal music search and investigate the in uence of the music content on the fusion weights (Chapter 3); • Evaluate the effectiveness of multimodal fusion approaches in multimedia content analysis tasks and violin music transcription Introduce a visual modality, i.e., bowing and fingering of the violin playing, to infer onsets, thus enhancing the audio-only violin music. .. Mapping Composite Ranking Mahalanobis, Euclidean, KL divergence, etc Indexing Structure Inverted list, Hashing Highdimensional indexing tree, linear search KL divergence, etc Linear search Inner product of document vectors, Euclidean Index for social dim.: Title Hybrid: inverted list + iLSH Personalization No No No Yes Offline Mapping & Indexing Online music DB (LastFM, YouTube, etc.) DV FMSV Index... multiple and cross-modal music dimensions into a unified representation These music dimensions further span a music space, in which adaptive music similarity can be measured between any two music items Each dimension can be indexed separately using incremental Locality Sensitive Hashing (iLSH) or inverted list in the indexing module This framework facilitates flexible retrieval by involving user’s personalization... The investigation of the multi-faceted music similarity measure should be helpful in determining whether adaptive or user-customized similarity measures are useful to improve search relevancy The query-dependent fusion approach should Chapter 1 Introduction 8 shed light on how to further improve the adaptiveness of music similarity measures The evaluation of the fusion techniques in multimodal violin... currently interested in; and 3) multimedia browsing systems can represent a collection of multimedia documents as a meaningful cluster hierarchy for users’ easy navigation As its important position revealed in multimedia information retrieval in general, similarity measures also play a key role in music 1 Chapter 1 Introduction 2 information retrieval (MIR) [50] which is a sub-area of multimedia information. .. review on multimodal fusion based music similarity measure, we can see that in music information retrieval (MIR) field the most significant music modalities for achieving effective MIR performance are not clear In addition, how to combine different music aspects (e.g., genre, mood, tempo, etc.) in an optimal way regarding the online queries or the music content is not well addressed Different fusion approaches... thesis and outlines future research directions 11 Chapter 2 Customized Multimodal Music Similarity Measures 2.1 Introduction Over the past decade, empowered by advances in networking, data compression and digital storage, modern information systems dealt with ever-increasing amounts of music data from various domain applications Consequently, the development of advanced Music Information Retrieval (MIR)... used as musical content representation to facilitate applications [22, 41, 75] for searching similar music recordings in a database by content-related queries (audio clips, humming, tapping, etc.) However, the previous research on music content similarity measures focused mainly on a single aspect similarity measure or a holistic similarity measure In single aspect similarity, only limited retrieval . multimedia information retrieval specialized in dealing with music documents and their related information. 1.1.1 Multimodal Fusion based Similarity Measures Early works on multimedia similarity measures. important position revealed in multimedia information retrieval in general, similarity measures also play a key role in music 1 Chapter 1 Introduction 2 information retrieval (MIR) [50] which. ADAPTIVE MULTIMODAL FUSION BASED SIMILARITY MEASURES IN MUSIC INFORMATION RETRIEVAL ZHANG BINGJUN (B.Sc., Hons, Tsinghua University) A THESIS SUBMITTED FOR

Định dạng
Số trang	127
Dung lượng	1,3 MB