automatic sound scene control using image sensor network

Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2014, Article ID 621805, pages http://dx.doi.org/10.1155/2014/621805 Research Article Automatic Sound Scene Control Using Image Sensor Network Changhee Cho,1 Jaehyung Park,2 and Kwangki Kim3 Graduate School of Interdisciplinary Program of E-Commerce, Chonnam National University, Yongbong-dong, Buk-gu, Gwangju 500-757, Republic of Korea School of Electronics and Computer Engineering, Chonnam National University, Yongbong-dong, Buk-gu, Gwangju 500-757, Republic of Korea Department of Digital Contents, Korea Nazarene University, Cheonan 331-718, Republic of Korea Correspondence should be addressed to Jaehyung Park; hyeoung@chonnam.ac.kr and Kwangki Kim; k2kim@kornu.ac.kr Received 28 February 2014; Accepted 14 April 2014; Published May 2014 Academic Editor: Carlos Ramos Copyright © 2014 Changhee Cho et al This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited We proposed the automatic sound scene control system using the image sensor network for preserving the constant sound scene without respect to the users’ movement In the proposed system, the image sensor network detects the human location in the multichannel playback environment and the SSC (sound scene control) module automatically controls the sound scene of the multichannel audio signals according to the estimated human location which is the angle information To estimate the direction of the human face, we used the normalized RGB (red, green, and blue) and the HSV (hue, saturation, and value) calculated from the images obtained by the image sensor network The direction of the human face can be easily decided as the image sensor to capture the image with the highest number of pixels to satisfy the thresholds of the normalized RGB and the HSV The estimated direction of the human face is directly fed to the SSC module, and the controlled sound scene can be simply generated Experimental results show that the image sensor network successfully detected the human location with the accuracy of about 98% and the controlled sound scene by the SSC according to the detected human location was perceived as the original sound scene with the accuracy of 95% Introduction With increase in multichannel audio sources such as DVD and consumer’s demand on more realistic audio service, multichannel audio signals are getting more important in the audio coding and the audio service In addition, as the multichannel audio signals need very high bit-rate to be transmitted, there have been many efforts to efficiently handle the multichannel audio signals with respect to the bit-rate, and the sound quality and spatial cue based multichannel audio coding schemes such as binaural cue coding (BCC), MPEG Surround, and sound source location coefficient coding have been introduced and developed [1–6] These multichannel audio coding schemes not attempt to provide an approximate reconstruction of the original multichannel signals’ waveforms and instead focus on delivering perceptually satisfying replica of the original sound scene by exploiting knowledge about human perception [7–9] In the spatial cue based multichannel audio coding, the spatial image of the multichannel audio signals is captured by a compact set of parameters, that is, the spatial cues, and a down-mix signal In other words, the multichannel audio signals are represented as a down-mix signal and small amount of side information while successfully preserving the sound image of the multichannel audio signals Accordingly, the spatial cue based multichannel audio coding can dramatically reduce the bit-rate and provide an extremely efficient representation of the multichannel audio signals Apart from the coding efficiency and the sound quality of the spatial cue based multichannel audio coding determined by the spatial cues, there is another merit to create valuable functionality in the multichannel audio coding through the usage of spatial cues Since the spatial image of the multichannel audio signals can be preserved by the spatial cues, we can control the sound scene by the modification of the spatial cues In other words, the spatial cues can be utilized not only to keep the sound quality of the multichannel audio signals but also to change the sound scene of the multichannel audio signals We call this sound scene control (SSC) based on the spatial cues and this functionality can provide users with interactivity [10] Moreover, the SSC can be implemented in the frequency domain and it only needs a few multiplications and additions; the complexity of the spatial cue based multichannel audio coding is rarely affected by the SSC [10] One of the possible applications of the SSC is to combine sound scene controller with multiview video In the case of multiview broadcasting, which will be advent in near coming days, it is expected that multichannel sound scene control can provide interactive audio playback systems and realistic audio sound by synchronizing sound scene with moving video scene Meanwhile, users can perceive the original sound scene of the multichannel audio signals produced by contents providers when they locate at the center of the multichannel speaker layout If users change their position, especially their heads’ direction, they feel the different sound scene from the original one due to the binaural effect [3, 11] In other words, when the users’ position has changed, the original sound scene should be controlled according to the users’ new position so that the users can perceive the constant sound scene without respect to their movement To achieve this goal, we proposed an automatic sound scene control system using image sensor network In the proposed system, the human location (or the direction of the human face) is detected by the image sensor network and the sound scene of the multichannel audio signals is automatically controlled by the previously mentioned SSC module according to the estimated human location which is the angle information The image sensor network consists of twelve image sensors to be uniformly arranged in the multichannel playback environment and it has 30-degree resolution for detecting the direction of the human face To estimate the direction of the human face, we used the normalized RGB (red, green, and blue) and the HSV (hue, saturation, and value) calculated from the images obtained by the image sensor network [12–14] It is because the normalized RGB and the HSV are useful for detecting the human skin region in the images In addition, as the image obtained by the image sensor to be located at the direction of the human face includes many pixels with the normalized RGB and the HSV values to satisfy thresholds for detecting the human skin, the direction of the human face can be easily decided as the image sensor to capture the image with the highest number of pixels to satisfy the thresholds of the normalized RGB and the HSV The estimated direction of the human face is directly fed to the SSC module, and the controlled sound scene can be simply generated The paper is organized as follows In Section 2, the sound scene control method in MPEG Surround which is representative of the multichannel audio coders is presented In Section 3, the estimation of the direction of the human face using the image sensor network and the proposed automatic International Journal of Distributed Sensor Networks sound scene control system are described In Sections and 5, experimental results and conclusion are drawn, respectively Sound Scene Control in MPEG Surround MPEG Surround is a technology to represent multichannel audio signals as the down-mix signal and spatial cues [1– 3] The MPEG Surround only uses the down-mix signal and the additional side information, that is, spatial cues, for the transmission of the multichannel audio signals through wired/wireless network system Therefore, users can enjoy a realistic audio sound by the multichannel audio signals through multichannel audio services such as digital audio broadcasting and digital multimedia broadcasting in wired/wireless network environment The MPEG Surround uses channel level difference (CLD) and interchannel correlation (ICC) as spatial cues The CLD is a main parameter in the MPEG Surround because it determines the spectral power of the reconstructed multichannel audio signals and occupies a considerable amount of the side information [4] However, the ICC is an ancillary parameter in the MPEG Surround because it reflects the spatial diffuseness of the recovered multichannel audio signals and takes a small portion of side information As the multichannel audio sound is compressed and recovered using the down-mix signal and the spatial parameters, the performance of the MPEG Surround in the aspects of the coding efficiency and the sound quality is determined by the spatial parameters In other words, the sound image formed by the multichannel audio signals is captured and recovered by the CLD and the ICC Under this knowledge, we can control the sound scene of the multichannel audio signals by the modification of the spatial parameters, alternatively The SSC is a new tool to reproduce a new sound scene of the multichannel audio signals according to the global panning position which is freely inputted by a user or another system By the given panning angle, denoted by 𝜃pan , the multichannel audio signals are rotated with a degree of 𝜃pan To control the sound scene, the SCC modifies spatial cues according to inputted sound scene information and then generates the modified spatial cues Finally, the MPEG Surround decoder generates the multichannel audio signals with the controlled sound scene using the modified spatial cues Figure shows the structure of MPEG Surround with the SSC The procedure of the SSC in the MPEG Surround is shown in Figure At first, the spatial parameters such as the CLD and the ICC are parsed from the transmitted spatial parameter bit-stream And then they are modified according to 𝜃pan which is sound scene information Here, the CLD and the ICC are separately controlled and it is assumed that 𝜃pan is fed to each modification module These modified spatial parameters are formatted again and finally the modified spatial parameter bit-stream is generated To modify the CLD, it is processed by gain factor converter, constant power panning (CPP), and CLD converter, sequentially In the gain factor converter, the CLD is converted to each channel level gain per each subband International Journal of Distributed Sensor Networks Down-mix Down-mix bitstream Down-mix encoder Reconstructed down-mix Down-mix decoder MPEG Surround decoder MPEG Surround encoder Multichannel input Reconstructed Multichannel output Sound scene control Spatial cue bitstream Modified spatial cue bitstream Sound scene information Figure 1: MPEG Surround with sound scene control module CLD Spatial cue bitstream Bitstream deformatter Gain factor converter Constant power panning CLD converter Bitstream formatter 𝜃pan ICC Modified CLD ICC modification Modified spatial cue bitstream Modified ICC Figure 2: Procedure of sound scene control The gain factors are simply calculated from the CLD as the following formula: 𝐺𝑏𝑖 = √1 + 10CLD𝑏 /10 𝜃1 Ls , (1) 𝐺𝑏𝑖+1 = 𝐺𝑏𝑖 ⋅ 10CLD𝑏 /20 , 𝜃m Lf 𝜃pan Aperture 𝐺𝑏𝑖 where the is the gain factor, the superscription index 𝑖 is channel index, and subscription index 𝑏 is subband index It is known that one CLD per subband can provide two channels’ power gain This gain converter is applied to all CLDs and each channel level gain can be easily obtained by multiplying all gain factors related to each channel In CPP module, the CPP law is applied to manipulate the position of each channel according to desired sound scene [15, 16] Let us assume that if the channel gain 𝐺𝑏𝑖 is desired to be positioned at 𝜃pan which is located between the left front (Lf) and the left surround (Ls) as shown in Figure 3, the 𝐺𝑏𝑖 is projected to Lf and Ls channels as follows: 𝜃𝑚 = Lf 𝐺𝑏,new (𝜃pan − 𝜃1 ) (aperture − 𝜃1 ) = 𝐺𝑏Lf × + cos (𝜃𝑚 ) ⋅ 𝜋 , 𝐺𝑏𝑖 , (2) Ls = 𝐺𝑏Ls + sin (𝜃𝑚 ) ⋅ 𝐺𝑏𝑖 , 𝐺𝑏,new where 𝜃𝑚 is the normalized angle limited to 90 degrees and aperture is the angle between two channels In the same Figure 3: An example of constant power panning law between two channels manner, any other channel gain can be flexibly handled to form the desired sound scene After the CPP processing, the modified CLDs are newly estimated using all new channel gains in CLD converter Here, the CLD converter is exactly the same as the CLD extractor of the MPEG Surround encoder If the CLD is estimated between Lf and Ls channels, the modified CLD is calculated as follows: CLDLf,Ls 𝑏,new = 10 log10 ( Ls 𝐺𝑏,new Lf 𝐺𝑏,new ) (3) To perfectly modify the ICC, the ICC must be reestimated according to the controlled sound scene But, different from the CLD, the ICC cannot be reestimated in parameter domain since the degree of correlation between the channels is only International Journal of Distributed Sensor Networks able to be estimated in signal domain Due to this problem, the ICC cannot be perfectly controlled and it could result in the degradation of overall sound quality after changing the sound scene In spite of this restriction, two kinds of ICC parameter could be modified in the case of sound scene rotation: ICCLs,Lf = (1 − 𝜂) ICCLs,Lf + 𝜂ICCRs,Rf , ICCRs,Rf = (1 − 𝜂) ICCRs,Rf + 𝜂ICCLs,Lf , (4) where 𝜂 is denoted by 𝜃pan { { , 𝜃pan ≤ 𝜋 { 𝜂={ 𝜋 𝜃 −𝜋 { pan { 1− , 𝜃pan > 𝜋 𝜋 { (5) These equations mean that left and right half plane ICC parameters are totally cross-changed if the degree of scene rotation is equal to the 180 degrees In the case that rotation angle is increased greater than 180 degrees to 360 degrees, the reverse cross-changed is ocurred and the modified ICC parameters are equal to the original ones at 360 degrees This concept of modification originated from common smoothing technique used in the MPEG Surround [3, 4] The Proposed Automatic Sound Scene Control System Using Image Sensor Network where the range of each RGB component is to 255 As the color information is very sensitive to the brightness values of the pixel, the RGB components are normalized as 𝑟= As the human perception of the sound scene is decided by the human localization in the multichannel playback environment, the image sensors are located around the multichannel speaker layout as shown in Figure Figure shows that the twelve image sensors are uniformly distributed in the multichannel playback environment and the resolution of the human localization is 30 degrees Although the direction of the human face is an important factor for the perception of the sound scene, the precise recognition of the human face is not necessary in our proposed system As the image obtained by the image sensor to be located at the direction of the human face includes many pixels with the normalized RGB and the HSV values to satisfy thresholds for detecting the human skin, the direction of the human face can be easily decided as the image sensor to capture the image with the highest number of pixels to satisfy the thresholds of the normalized RGB and the HSV From the past research results, it has been confirmed that human skin colors cluster in a small region in RGB color space and human skin colors differ more in brightness than in colors [12–14] Therefore, the normalized RGB value can be used to detect the human faces with less variance in color Generally, colors of each pixel in image are represented by the combination of 𝑅, 𝐺, and 𝐵 components and the brightness value is calculated as 𝐼 = 𝑅 + 𝐺 + 𝐵, Figure 4: Image sensor network in the multichannel playback environment (6) 𝑅 , 𝐼 𝑔= 𝐺 , 𝐼 𝑏= 𝐵 , 𝐼 (7) where the sum of 𝑟, 𝑔, and 𝑏 is Thus, the normalized color values can be expressed only with 𝑟 and 𝑔 In addition to the normalized RGB value, we use the HSV (hue, saturation, and value) as additional parameter for the direction recognition of the human face [12] It is because the HSV is more similar to the human perception of color At first, the hue (𝐻) indicates a measure of the spectral composition and it is represented as an angle which varies from to 360 degrees Second, the saturation (𝑆) is the purity of colors and it varies from to At last, the value (𝑉) is defined as the darkness of a color and it ranges also from to The HSV values can be simply calculated from the RGB values using the following equations: { } { 0.5 [(𝑅 − 𝐺) + (𝑅 − 𝐵)] } , 𝐻1 = cos−1 { } } {√ (𝑅 − 𝐺)2 + (𝑅 − 𝐵) (𝐺 − 𝐵) } { 𝐻1 if 𝐵 ≤ 𝐺 𝐻={ 360 − 𝐻1 if 𝐵 > 𝐺, 𝑆= max (𝑅, 𝐺, 𝐵) − (𝑅, 𝐺, 𝐵) , max (𝑅, 𝐺, 𝐵) 𝑉= max (𝑅, 𝐺, 𝐵) 255 (8) International Journal of Distributed Sensor Networks Table 1: Test items Obtained images by sensors Index A B C D E RGB reading RGB Normalized RGB calculation HSV calculation Material Applause Chostakovitch Fountain music Glock Rock concert Description Ambience Music (back: direct) Pathological Pathological Music (back: ambience) Table 2: Recognition rate of the image sensor network Normalized RGB HSV Position (degree) Skin-like pixel counting Number of skin-like pixels Decision of human localization Location of decided image sensor (angle) Figure 5: Procedure of the human localization using image sensor network We used the following threshold values of the normalized RGB and the HSV for the decision of the human skin-like pixels [12]: 0.36 ≤ 𝑟 ≤ 0.465 0.28 ≤ 𝑔 ≤ 0.363, ≤ 𝐻 ≤ 50 (9) 0.20 ≤ 𝑆 ≤ 0.68 0.35 ≤ 𝑉 ≤ 1.0 A pixel of the obtained image by sensor is judged as the human skin-like pixel only if its normalized RGB and HSV values satisfy all thresholds of (9) Under this knowledge, all skin-like pixels in the captured image by sensor are counted and the location of the image sensor with the highest number of skin-like pixels is determined as the direction of the human face Figure shows the whole procedure for the human localization using image sensor network The proposed automatic sound scene control system using image sensor network is shown in Figure Compared to Figure 1, the input angle to the sound scene control module is only replaced by the estimated direction of human face according to the human movement Therefore, the explained sound scene control module in Section can be directly used for the proposed automatic sound scene control system without any changes in the operation The proposed system has the sound scene control error as maximum 15 degrees since the image sensor network has 30-degree resolution for the estimation of the direction of the human face True recognition False recognition Recognition rate (recognized angle) (%) 30 60 90 120 150 180 210 240 270 300 330 24 24 23 24 22 24 23 24 24 23 24 0 (60) (120, 120) (240) 0 (270) 100 100 95.8 100 91.7 100 95.8 100 100 95.8 100 Total 259 98.1 Experimental Results To validate the performance of the proposed automatic sound scene control system using the image sensor network, we performed a subjective listening test which focused on the sensing ability of the image sensor network and the controllability of the SSC according to the result of the image sensor network For the listening test, the five test items offered by MPEG audio subgroup were used and are listed in Table [17] The items were sampled at 44.1 kHz with 16bit resolution and were all shorter than 20 seconds Eight listeners participated in the listening test To check the sensing ability of the image sensor network, the estimated result by the image sensor network was compared to the listeners’ position, that is, the direction of their face, when they moved Here, for the clarification of the test, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, and 330 degrees are only allowed as the listeners’ position All listeners changed their position into the given angles times and the total number of trials was 264 Table and Figure show the recognition result by the image sensor network The recognition rate of the image sensor network is about 98.1% and only trials were recognized as the wrong position Here, a main reason for the false recognition was the listeners’ wrong head direction To check the controllability of the SSC, we used two kinds of audio sounds—the original and controlled ones The original sound scene was given as the reference signal and the listeners decided whether the controlled sound scene according to their position was equal to the original sound International Journal of Distributed Sensor Networks Down-mix Down-mix Reconstructed down-mix bitstream Down-mix encoder Multichannel input MPEG Surround encoder Down-mix decoder Image sensor network Multichannel playback system MPEG Surround decoder Sound scene control Realistic audio sound Modified spatial cue bitstream Spatial cue bitstream Human movement Reconstructed multichannel output with controlled sound scene Direction of human face (angle) Human localization Captured images Recognition rate (%) Figure 6: The proposed automatic sound scene control system using image sensor network Table 3: Controllability result of the proposed system using the image sensor network 100 90 80 70 60 50 40 30 20 10 Position (degree) 60 120 180 240 300 30 60 90 120 150 180 210 240 270 300 330 Total Position (deg) Conclusion In this paper, we proposed the automatic sound scene control system using the image sensor network for preserving 228 Total Accuracy (%) Figure 7: Recognition rate of the image sensor network scene or not when they moved For the simplification of the test, 60, 120, 180, 240, and 300 degrees were used as the listeners’ position All listeners changed their position into the given angles per each test item and the total number of trials was 200 Here, if the estimated listeners’ position was wrong, it was deleted and the listeners tried again at the same position Table and Figure show the result for checking the controllability of the SCC The ratio that the controlled sound scene by the SCC was perceived as the original sound scene was 95% Because the SCC has a problem about the ICC modification as previously described, the newly generated audio sound by the SCC showed the different sound scene in some trials Same sound scene Different sound scene Accuracy (%) 38 95.0 36 90.0 40 100 38 95.0 35 87.5 12 95.0 100 90 80 70 60 50 40 30 20 10 60 120 180 240 Position (deg) 300 Total Figure 8: Controllability result of the proposed system using the image sensor network the constant sound scene without respect to the users’ movement In the proposed system, the image sensor network detects the human location in the multichannel playback environment and the SSC module automatically controls the sound scene of the multichannel audio signals according to the estimated human location which is the angle information International Journal of Distributed Sensor Networks To estimate the direction of the human face, we used the normalized RGB and the HSV calculated from the images obtained by the image sensor network The direction of the human face can be easily decided as the image sensor to capture the image with the highest number of pixels to satisfy the thresholds of the normalized RGB and the HSV The estimated direction of the human face is directly fed to the SSC module, and the controlled sound scene can be simply generated Experimental results show that the image sensor network can successfully detect the human location with the accuracy of about 98% Moreover, the controlled sound scene by the SSC according to the detected human location was perceived as the original sound scene with the accuracy of 95% To enhance the performance of the image sensor network and the SCC, more precise human localization using the eye detection in the image remains as a future work Conflict of Interests The authors declare that there is no conflict of interests regarding the publication of this paper Acknowledgments This study was funded by the research fund of Korea Nazarene University in 2014 (Kwangki Kim) and supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (Grant no 2012R1A1A4A01004195) References [1] C Faller and F Baumgarte, “Binaural cue coding—part II: schemes and applications,” IEEE Transactions on Speech and Audio Processing, vol 11, no 6, pp 520–531, 2003 [2] F Baumgarte and C Faller, “Binaural cue coding—part I: psychoacoustic fundamentals and design principles,” IEEE Transactions on Speech and Audio Processing, vol 11, no 6, pp 509–519, 2003 [3] J Herre, H Purnhagen, and J Breebard, “The reference model architechture for MPEG spatial audio coding,” in Proceedings of the 118th AES Convention, Barcelona, Spain, 2005 [4] ISO/IEC 23003-1, “Information Technology—MPEG AudioTechnologies—Part 1: MPEG Surround,” 2007 [5] H.-G Moon, J.-I Seo, S Baek, and K.-M Sung, “A multichannel audio compression method with virtual source location information for MPEG-4 SAC,” IEEE Transactions on Consumer Electronics, vol 51, no 4, pp 1253–1259, 2005 [6] S Beack, J Seo, H Moon, K Kang, and M Hahn, “Angle-based virtual source location representation for spatial audio coding,” ETRI Journal, vol 28, no 2, pp 219–222, 2006 [7] D A Burgess, “Techniques for Low Cost Spatial Audio,” UIST, 1992 [8] S H Foster, E M Wenzel, and R M Taylor, Real-Time Synthesis of Complex Acoustic Environments, Crystal River Engineering, Groveland, Calif, USA [9] J Blauert, Spatial Hearing The Psychophysics of Human Sound Localization, MIT Press, Cambridge, Mass, USA, 1983 [10] K Kim, “Sound scene control of multi-channel audio signals for realistic audio service in wired/wireless network,” International Journal of Multimedia and Ubiquitous Engineering, vol 9, no 2, 2014 [11] E Zwicker and H Fastl, Psychoacoustics, Springer, Berlin, Germany, 1999 [12] Y Wang and B Yuan, “A novel approach for human face detection from color images under complex background,” Pattern Recognition, vol 34, no 10, pp 1983–1992, 2001 [13] S.-H Kim and H.-G Kim, “Facial region detection using range color information,” IEICE Transactions on Information and Systems, vol 81, no 9, pp 968–975, 1998 [14] J Yang and A Waibel, “Tracking human faces in real time,” CMU-CS-95-210, 1995 [15] V Pulkki, “Virtual sound source positioning using vector base amplitude panning,” AES: Journal of the Audio Engineering Society, vol 45, no 6, pp 456–465, 1997 [16] M A Gerzon, “Panpot laws for multispeaker stereo,” in Proceedings of the 92nd Convention of the AES, 1992 [17] ISO/IEC JTC1/SC29/WG11 (MPEG), “Procedures for the Evaluation of Spatial Audio Coding Systems,” Document N6691, Redmond, Wash, USA, 2004 Copyright of International Journal of Distributed Sensor Networks is the property of Hindawi Publishing Corporation and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission However, users may print, download, or email articles for individual use ... localization using image sensor network The proposed automatic sound scene control system using image sensor network is shown in Figure Compared to Figure 1, the input angle to the sound scene control. .. proposed automatic sound scene control system using the image sensor network, we performed a subjective listening test which focused on the sensing ability of the image sensor network and the controllability... controlled sound scene Direction of human face (angle) Human localization Captured images Recognition rate (%) Figure 6: The proposed automatic sound scene control system using image sensor network

Định dạng
Số trang	8
Dung lượng	573,3 KB