Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 19 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
19
Dung lượng
8,41 MB
Nội dung
Chapter Computational Modelling 4.1 Methods We used a Spatio-Temporal Coherency model, which is an extension of the user attention model previously described by Ma et al and colleagues (2002) Our model used motion intensity, spatial coherency, temporal coherency, face detection and gist modulation to compute saliency maps (see Figure 4.1) The critical part of this model is motion saliency, defined as the attention due to motion Abrams and Christ (2003) have shown that the onset of motion captures visual attention Motion saliency for a frame is computed over multiple preceding frames According to our investigations, overt orienting of the fovea to any motion induced salient locations in any given frame, n, is influenced by the saliency at that location in the preceding frames (n 1, n 2, · · · ) This is known as saccade latency in the literature, with latencies typically in the order of 200-250 milliseconds (Becker and Jărgens, 1979; Hayhoe et al., 2003) We investigated the influence of up to u ten preceding frames (approximately 500 milliseconds), however beyond the fifth frame (i.e., n 5) we did not see any significant contributions to overt orientation This means that the currently fixated location in frame n was indeed selected based 63 4.1 Methods on the saliency at that location up to n preceding frames The on-screen time for frames was about 210 milliseconds (the video frame rate in our experiments were 24 frames per second) which is well within the bounds of the reported saccade latencies 4.1.1 Spatio-Temporal Saliency We computed motion vectors using Adaptive rood pattern search (ARPS) algorithm, for fast block-matching motion estimation (Nie and Ma, 2002) The motion vectors are computed by dividing the frame into a matrix of blocks The fast ARPS algorithm leverages on the fact that general motion is usually coherent For example, if we see a block surrounded by other blocks, and the surrounding blocks moved in a particular direction, there is a high probability that the current block will also have a similar motion vector Thus, the algorithm estimates the motion vector of a given block using the motion vector of the macro block to its immediate left The Motion Intensity map, I (Figure 4.2), which is a measure of motion induced activity, is computed using motion vectors (dx, dy) normalized by the maximum magnitude in the motion vector field I(i, j) = q dxi,j + dyi,j Max of Magnitude (4.1) Spatial coherency (Cs) and temporal coherency (Ct) maps were obtained by convolving the frames with the Entropy filter The Cs maps captured regularity at a spatial scale of 9x9, with in a frame, while the Ct maps captured regularity at the same spatial scale (ie., 9x9 pixels) but over a temporal scale of frames Spatial coherency (see Figure 4.3) measured the consistency of pixels around the point of interest, otherwise known as the correlation The higher the correlation 64 a B/U 65 erences Saliency Map I x Ct x ( - I x Cs ) Features combina on and normaliza on -scale 3-6,3-7,4-7,4-8,5-8,5-9) Temporal Coherency(Ct) and entropy tering Scale Gaussian pyramid l Coherency(Cs) Center- Surround Suppression ( n Intensity (I) Frame Original frame using mo on es Original frame Saliency Map with face n Face channel Viola and Jones (2004) B/U T/D n per region PC - with B/U : Bottom-up T/D : Top-down map of winning category Modula on Classifying frame Gist vector to scene category PC - dimensionality re PCA Filtering each region using -scale oriented Gabor rs Gist channel 4.1 Methods Figure 4.1: Spatio-Temporal saliency model architecture diagram 4.1 Methods Frame(n-1) Frame(n) Frame(n-1) Frame(n) Frame(n-1) Frame(n) Compute motion vectors Motion Vector Field Motion Vector Field Low Velocity High Motion Vector Field Compute motion intensity High Motion Energy Motion Intensity Map Motion Intensity Map Motion Intensity Map Low Motion Energy Figure 4.2: Computation of motion intensity on adjacent frames Three examples from di↵erent movies are shown the more probability they are from the same object This is computed as the entropy over the block of pixels Higher entropy showed more randomness in the block structure, indicating lower correlation among pixels and, hence lower spatial coherency Figure 4.3 shows five examples of the spatial coherency computation using the following equation Cs(x, y) = n X Ps (t) log Ps (t) (4.2) i=1 where Ps (t) is the probability of occurrence of the pixel intensity i and n corresponds to the 9x9 neighbourhood Similarly, to compute consistency in pixel correlation over time, we used Ct(x, y) = n X Ps (t) log Ps (t) (4.3) i=1 where Ps (t) is the probability of occurrence of the pixel i at the corresponding 66 4.1 Methods Frame(n) Frame(n) Frame(n) Frame(n) Frame(n) Entropy Filtering - Spatial High Entropy Low Entropy Figure 4.3: Examples of spatial coherency map computed on five di↵erent movie frames location in preceding five frames (m = 5) Higher entropy implies greater motion and thus higher saliency at that location The temporal coherency map (see Figure 4.4), in general, signifies motion energy in each fixated frame, contributed by the five preceding frames (with exception in cases for boundary frames where motion vectors are invalid due to scene or camera transitions) Once all three feature maps are computed, we apply centre-surround suppression to these maps to highlight regions having higher spatial contrast This is akin to simulating the behaviour of ganglion cells in retina (Hubel and Wiesel, 1962) To achieve this, we first compute dyadic Gaussian pyramid (Burt and Adelson, 1983) for each map by repeatedly low-pass filtering and subsampling the map (see Figure 4.5) For low-pass filtering, we used a ⇥ separable Gaussian kernel (Walther and Koch, 2006) defined as K = [1 10 10 1] 32 (see Walther, 2006, Appendix A.1 for more details) We start with level (L1) which is the actual size of the map Image for each successive level is obtained by first low-pass filtering the image This step results in a blurry image with supressed higher spatial frequencies The resulting image is then subsampled to half of its current size to obtain the level (L2) image The process continue until the map cannot be further subsampled (L9 in Figure 4.5) 67 4.1 Methods Frames Frames Frames n-5 n-5 n-4 n-4 n-3 n-3 n-2 n-2 n-1 n-1 n n n-5 n-4 n-3 n-2 n-1 n Entropy Filtering - Temporal High Entropy Low Entropy Figure 4.4: Examples of temporal coherency map computed over the previous five frames, shown from three di↵erent movie examples 68 4.1 Methods L2 L1 L3 20 40 60 80 100 120 140 100 200 300 400 500 600 700 40 60 50 100 150 200 250 300 350 50 L6 20 150 100 L5 L4 10 High Saliency 20 Normalized Histogram 50 100 150 200 250 10 30 15 20 40 60 80 L7 10 40 10 L8 0.5 30 20 15 20 L9 0.5 1.5 4 10 2.5 1.5 0.5 1.5 2.5 Low Saliency Figure 4.5: Example of a temporal coherency map at nine di↵erent levels of Gaussian pyramid Starting at level (L1 in above figure),having same size as the original map, each successive level is obtained by low-pass filtering and subsequently subsampling the map to half of its size at the current level To simulate the behaviour of centre-surround receptive field, we take the di↵erence among di↵erent levels of the pyramid for a given feature map as was previously described in Itti et al (1998) We select di↵erent levels of pyramid to represent centre C {2, 3, 4} and surround S {C + } where {3, 4} This results in six intermediate maps as shown in Figure 4.6 To get a point-wise differences across scale, the images are interpolated to a common size All of these six centre surround maps are then added, across scales to get a single map per feature as shown in Figure 4.7 All three feature maps are then combined linearly to produce a standard saliency map SaliencyM ap = I ⇥ Ct ⇥ ((1 I) ⇥ Cs) (4.4) Since higher entropy in the temporal coherency map indicates greater motion 69 4.1 Methods C-S = L2 - L5 C-S = L2 – L6 C-S = L3 – L6 High Saliency 5 10 10 15 15 20 20 20 25 25 25 30 30 30 35 35 35 20 40 60 80 20 C-S = L3 - L7 40 60 80 40 20 C-S = L4 – L7 60 80 C-S = L4 – L8 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 35 35 Normalized Histogram 10 15 20 60 40 80 35 20 40 60 Low Saliency 80 20 40 60 80 M n Intensity Map 50 50 50 100 100 150 150 200 200 200 High Saliency 100 150 250 250 200 With Center-surround suppression Tempora Coherency Map Spa Coherency Map 400 250 200 600 400 200 600 50 50 100 600 200 400 600 50 100 400 100 150 150 150 200 200 200 250 250 250 Normalized Histogram No Center-surround suppression Figure 4.6: Taking point-wise di↵erences across scales (2-5, 2-6, 3-6, 3-7, 4-7, 4-8) results in six intermediate maps for a given feature map 200 400 600 200 400 600 Low Saliency Figure 4.7: Final feature maps obtained after adding across-scale centre-surround differences The top panel shows the feature maps before centre-surround suppression is applied The bottom row shows the final feature maps after the application of the centresurround suppression via across scales point-wise di↵erence followed by the summation The example shown for one movie frame clearly demonstrates the e↵ectiveness of the centre-surround suppression in producing sparser feature maps 70 4.1 Methods over a particular region, intensity maps are directly multiplied with temporal coherency maps This highlights the contribution of the motion salient regions in the saliency maps On the contrary higher entropy in the spatial coherency map indicates randomness in the block structure, suggesting that region does not belong to any single entity or object Since we are interested in motion saliency induced by the spatially coherent object, we assign higher value to the pixels belonging to the same objects In Figure 4.8 we show the standard saliency map computed for randomly chosen frame from each of the movie in our database 71 4.1 Methods Animals Cats High Saliency Matrix BigLebowski Everest Hitler ForbiddenCityCop Normalized Histogram Galapagos IRobot KungFuHustle WongFeiHong Low Saliency Figure 4.8: Saliency maps shown for randomly selected frame from every movie in the database Column and show movie frames while column and show the saliency map for the corresponding frame A saliency values is indicated by a warmer colour as illustrated by the colour map on the right 72 4.1 Methods 4.1.2 Face Modulation We modulate the standard saliency map with high-level semantic knowledge, such as faces using a state of the art face detector (Viola and Jones, 2004) This accounts for the fact that overt attention is frequently deployed to faces (Cerf et al., 2009), and it can be argued that faces are a part of the bottom-up information as there are cortical areas specialized for faces, in particular the fusiform gyrus (Kanwisher et al., 1997) Viola and Jones (2004) face detector is based on training a cascaded classifier using a learning technique called AdaBoost on a set of very simple visual features These visual features have Haar-like feature properties as they are computed by subtracting the sum of the sub region from the sum of the remaining region The Figure 4.9 shows example of Haar-like rectangle features Panels A and B show tworectangle features (horizontal / vertical) and panels C and D show three-rectangle and four-rectangle features respectively The value of the feature is computed by subtracting the sum of the pixel values in the white region from the sum of the pixel values in the grey region These Haar-like features are simple and very e cient to compute using integral of image representation The integral of image representation allows the computation of the rectangle sum in constant time The Haar-like features are extracted over a 24 ⇥ 24 pixel sub-window resulting in thousands of features per image The goal here is to construct a strong classifier by selecting a small number of discriminant features from the limited set of labelled training images This is achieved by employing an AdaBoost technique to learn the cascade of weak classifiers Each weak classifier in the cascade is trained on a single feature The term weak signifies the fact that no single classifier in the cascade can classify all the examples accurately For each round of boosting, the AdaBoost method allows the selection of the weak classifier with the lowest error rate controlled by the desired hit and miss rate This is followed by the re-assignment of 73 4.1 Methods Figure 4.9: Example of four basic rectangular features as shown in Viola and Jones 2004 IJCV paper Panel A and B show two-rectangle features while panels C and D show three-rectangle and four-rectangle features Panel E shows example of two features overlaid on a face image The first feature is a two-rectangle feature measuring the di↵erence between the eye and upper cheek region while the second feature, a threerectangle feature, measures the di↵erence between the eye region and upper nose region the weights to emphasize the examples with poor classification for the next round The Adaboost method is regarded as a greedy algorithm since it associates a large weight to every good features and a small weight to poor features The final strong classifier is then a weighted combination of the weak classifiers Panel E of Figure 4.9 demonstrates how the selected features reflect useful properties of the image The example shows a two-feature classifier (top row shows the two selected features) trained over 507 faces The first feature measures the di↵erence in the luminance value between the regions of the eyes and upper cheeks The second selected feature measures the di↵erence in the luminance value between the eye region and the nose bridge An intuitive rationale behind the selection of these features is that the eye region is generally darker than the skin region Previous findings on static images suggest that people look at face components (eyes, mouth, and nose) preferentially, with the eyes being given more preference over other components (Buswell, 1935; Yarbus et al., 1967; Langton et al., 2000; Birmingham and Kingstone, 2009) However, recent study (V˜ et al., 2012) on o 74 4.1 Methods gaze allocation in dynamic scenes suggests that eyes are not fixated preferentially V˜ et al (2012) showed that the percentage of overall gaze distribution is not o significantly di↵erent for any of the face components for vocal scenes However, for mute scenes, they did find a significant drop in gaze distribution for the mouth compared to the eyes and nose In fact, the nose was given priority over the eyes regardless of whether the person in the video made eye contact with the camera or not However these di↵erences were found to be insignificant To detect faces in our video database, we used trained classifiers from the OpenCV library (Bradski and Pisarevsky, 2000) The classifier detects a face region and returns with a bounding box encompassing the complete face This is followed by convolving the face region with a gaussian having a size h equal to the width of the box and peak value at the centre of the box This automatically assigns the highest feature value to nose compared to other face components Figure 4.10 shows the process of face modulation for an example frame from the movie “The Matrix” (1999) Note that the bottom right corner highlights the salient regions in the movie frame by overlaying the face modulated saliency map on the movie frame 75 4.1 Methods Saliency map Saliency overlayed on movie frame Normalized histogram Movie frame with face detected High Saliency Low Saliency Figure 4.10: Example of saliency map modulation with detected face region of interest (ROI) Top left column shows the original movie frame with face ROI bounding box Subsequent columns show how the face modulation is applied to the spatio-temporal saliency map The bottom right column overlays the face modulated saliency map on the movie frame signifying hot spots in the frame 4.1.3 Gist Modulation We investigated an improvement to the bottom-up influenced spatio-temporal saliency model by incorporating top-down semantics of the scene Our hypothesis is that variability in eye movement patterns for di↵erent scene categories (O’Connell and Walther, 2012) can help in improving saliency prediction for the early fixations Earlier research experiments have shown the influence of scene context in guiding visual attention (Neider and Zelinsky, 2006; Chen et al., 2006) In Neider and Zelinsky (2006) the scene-constrained targets were searched faster with a higher percentage of initial saccades directed to target-consistent scene regions Moreover, they found that contextual guidance biases eye movement towards target-consistent regions (Navalpakkam and Itti, 2005) as opposed to excluding target-inconsistent scene regions (Desimone and Duncan, 1995) Chen et al (2006) showed that in the presence of both top-down (scene preview) and bottom-up (colour singleton) cues, top-down information prevails in guiding eye movement They observed faster 76 4.1 Methods manual reaction times and more initial saccades were made to the target location In comparison, the colour singleton attracted attention only in the absence of a scene preview Currently, there are di↵erent ways to compute the Gist descriptor of an image (Oliva and Torralba, 2001; Renninger and Malik, 2004; Siagian and Itti, 2007; Torralba et al., 2003) In the framework proposed by Oliva and Torralba (2001), for the purpose of scene classification, an input image is subdivided into 4x4 equal-sized, nonoverlapping segments A magnitude spectrum of the windowed fourier transform is then computed over each of these segments This is followed by the feature dimension reduction, using Principal Component Analysis Siagian and Itti (2007), computed the gist descriptor from the hierarchal model (Itti et al., 1998) A 4x4 non-overlapping grid is placed over 34 sub-channels, from color, orientation and intensity An average value over each grid box is computed, yielding a 16 dimensional vector per sub-channel The resulting 544 raw gist values are then reduced using PCA/ICA to 80 dimensional gist descriptor or feature vector Subsequently, the scene classification is done using neural networks In experiments conducted by Renninger (Renninger and Malik, 2004), subjects were asked to identify the scenes that they were exposed to very briefly (