Figure 4.12: Examples of movie frames with computed gist descriptors. Gist descriptors are colour coded for spatial scales. 82 Gist descriptor with PC2 and PC3 Gist descriptor Gist descriptor with PC2 and PC3 Gist descriptor Gist descriptor with PC2 and PC3 Gist descriptor N=4 N=3 N=2 Movie frames 4.1 Methods 4.1 Methods Examples from scene category Examples from scene category Figure 4.13: Example scenes belonging to different clusters. The unsupervised clustering method has allocated scenes to two di↵erent and semantically meaningful clusters. had either nature scenes (mountains, forests, landscapes, etc.,) or manmade scenes (tall buildings, railways stations, houses, etc.,), as shown in Figure 4.13. Secondly, the employed gist descriptor method uses the spatial frequency signature to quantify scene gist, thus resulting in a much simpler final gist vector. In fact, Oliva and Torralba (2001) also defined very few scene categories (mountain, beach, forest, highway, street, and indoor) over a much more variable scene database. To assess the quality of the clusters found, we used an isolation distance metric (Schmitzer-Torbert et al., 2005). The isolation distance metric is shown for di↵erent clusters in Figures 4.14 and 4.15. The method basically gives a measure of separation between clusters by computing the mahalanobis distance of the K th closest point outside the cluster. Here K is the total number of points inside the cluster. Thus, a larger number for any given cluster implies that it is more isolated from its neighboring clusters. 83 4.1 Methods Simulation:927 −− Regions : 2x2 Cluster space of training frames : Cluster1 (936 frames) Cluster2 (208 frames) Isolation distance [ 47.14 , 16.68 ] dim−1 vs dim−2 0.2 −0.2 0.5 dim−1 vs dim−8 0.5 dim−2 vs dim−3 0.5 0.5 dim−2 vs dim−8 0.5 0.5 dim−2 vs dim−4 dim−1 vs dim−6 0.2 0.5 0.5 dim−2 vs dim−5 0.5 −0.2 0.5 −0.5 −0.5 −0.2 0.2 0.5 dim−4 vs dim−5 dim−4 vs dim−6 0.2 0.5 0.2 −0.5 0.5 dim−4 vs dim−7 0.5 0.5 0.5 0.5 dim−2 vs dim−6 0.5 dim−2 vs dim−7 0.5 −0.2 0.5 0.5 0.5 dim−4 vs dim−8 0.2 −0.2 −0.5 0.2 −0.2 −0.5 0.2 −0.5 0.5 dim−5 vs dim−7 0.5 0.5 0.5 Cluster1 Cluster2 0.5 dim−5 vs dim−6 −0.2 −0.5 −0.2 −0.5 0.5 −0.5 0.5 −0.5 0.5 −0.5 0.5 dim−5 vs dim−8 dim−6 vs dim−7 dim−6 vs dim−8 dim−7 vs dim−8 0.5 0.5 0.5 dim−1 vs dim−7 −0.5 −0.2 −0.2 0.2 −0.2 0.2 −0.2 0.2 −0.2 0.2 −0.2 0.2 dim−3 vs dim−4 dim−3 vs dim−5 dim−3 vs dim−6 dim−3 vs dim−7 dim−3 vs dim−8 0.5 0.2 0.5 0.5 −0.5 dim−1 vs dim−5 0.5 dim−1 vs dim−4 0.5 0.5 0.5 −0.5 dim−1 vs dim−3 0.5 Figure 4.14: Example of clusters found using the reduced dimension of the gist descriptor for the training frames. The example is shown for simulation 927, for which we found two scene categories (labeled using blue and red colours) 84 4.1 Methods Simulation:930 −− Regions : 2x2 Cluster space of training frames : Cluster1 (155 frames) Cluster2 (495 frames) Cluster3 (494 frames) Isolation distance [ 14.91 , 13.93 , 26.60 ] dim−1 vs dim−2 0.2 −0.2 0.5 dim−1 vs dim−8 0.5 dim−2 vs dim−3 0.5 0.5 dim−2 vs dim−8 0.2 0.5 dim−2 vs dim−4 dim−1 vs dim−6 0.2 0.5 0.5 dim−2 vs dim−5 0.5 −0.2 0.5 −0.5 −0.2 −0.2 0.2 0.5 dim−4 vs dim−5 dim−4 vs dim−6 0.2 0.5 0.2 −0.5 0.5 dim−4 vs dim−7 −0.2 0.5 0.5 dim−4 vs dim−8 −0.2 −0.5 0.2 −0.2 0.2 −0.5 0.2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 dim−5 vs dim−7 Cluster2 Cluster3 Cluster1 0.5 −0.5 0.5 dim−2 vs dim−7 0.5 0.5 0.5 0.5 0.5 dim−5 vs dim−6 0.5 0.5 0.5 0.5 0.5 0.5 0.5 dim−2 vs dim−6 −0.2 −0.5 −0.2 −0.2 0.2 −0.2 0.2 −0.2 0.2 −0.2 0.2 dim−5 vs dim−8 dim−6 vs dim−7 dim−6 vs dim−8 dim−7 vs dim−8 0.5 0.5 0.5 dim−1 vs dim−7 −0.2 −0.2 −0.2 0.2 −0.2 0.2 −0.2 0.2 −0.2 0.2 −0.2 0.2 dim−3 vs dim−4 dim−3 vs dim−5 dim−3 vs dim−6 dim−3 vs dim−7 dim−3 vs dim−8 0.2 0.2 0.5 0.5 −0.2 dim−1 vs dim−5 0.5 dim−1 vs dim−4 0.2 0.5 0.5 −0.5 dim−1 vs dim−3 0.5 0.5 0.5 0.5 0.5 Figure 4.15: Additional examples of clusters found by the algorithm. These examples show the discovery of and categories of scenes in the training data for di↵erent simulations. 85 4.1 Methods 4.1.3.4 Categorical Fixation Maps Following the clustering of 1150 scenes (represented by the first frame of the corresponding scenes) into scene categories, we computed categorical fixation maps. In total three types of categorical fixation maps were computed; a map with the centre bias intact, an average map, and a map without the centre bias. The map with the centre bias intact was built using early fixations in the training data. For each scene category, we aggregated the early fixations on a blank 2D map, from all subjects and over all the scenes in that particular category. We then convolved them with a 2D Gaussian kernel. The standard deviation ( ) of the kernel was set to of visual angle (⇡ 40 x 40 pixels), corresponding to the high acuity foveal zone. An average fixation map was computed in a similar fashion except that early fixations were aggregated across scene categories thus yielding one average map per simulation and per choice of N . A third type of fixation map, with no centre bias, was computed to illustrate di↵erences in the fixation pattern for each scene category. The centre bias was removed by subtracting the average fixation map from the map with the centre bias intact map of each scene category (O’Connell and Walther, 2012). Figure 4.16 shows examples of all three types of categorical fixation maps for two simulations and di↵erent choices of N . As evident, the maps with no center bias exhibits a distinguishing fixation patterns for di↵erent scene categories. This encourages further investigation of the idea that saliency maps, modulated with scene category appropriate fixation maps, would yield a better prediction of human visual attention. It is important to mention that few frames, in each simulation, were categorized to noise cluster by the clustering algorithm (Harris et al., 2000). Thus categorical fixation maps were computed from total number of frames less than 1150, as also evident by the number listed over average fixation maps in 86 4.1 Methods Figure 4.16. 4.1.3.5 Scene Classification in Test Data All scenes (1150 in total) of the test data were classified into one of the scene categories obtained from training data. The classification process was carried out as follows. For each learned scene category we computed a cluster centroid in gist descriptor space. These centroid represented the centre of gravity for each scene category. Subsequently to classify a scene in test data we first computed an euclidian distance of scene’s gist descriptor (computed from first frame) to each centroid. This was followed by labeling the test scene with scene category having the shortest distance. Figure 4.17 shows classification results on four simulations 4.1.3.6 Control Conditions for Gist Modulation We had two control conditions for the gist dependent modulation of saliency maps. In first condition we modulated the saliency maps in test scene with the fixation map from di↵erent scene category. As an example, a test scene classified to scene category would be modulated by a fixation map from scene category 2, in two scene category case. We termed this as Gist scrambled condition. In second condition we modulated the saliency maps in test scene using average fixation map and termed it as average condition. These two control conditions enabled us to assess that improvements in model’s prediction, after the integration of scene’s gist, was not completely due to inclusion of the centre bias. 87 4.1 Methods Categorical fixation maps Simulation : 361 Regions: 2x2 Frames : 155 Frames : 495 Frames : 494 Frames : 104 Frames : 187 Frames : 116 Frames : 741 Frames : 155 Frames : 495 Frames : 494 Frames : 104 Frames : 187 Frames : 116 Frames : 741 Frames : 1148 Frames : 1144 Average With centre bias No centre bias Simulation : 930 Regions: 2x2 Regions: 3x3 Frames : 224 Frames : 288 Frames : 634 Frames : 224 Frames : 678 Frames : 466 Frames : 678 Frames : 466 Frames : 1144 Average Frames : 1146 With centre bias No centre bias Frames : 634 Regions: 4x4 Frames : 644 Frames : 498 Frames : 1142 With centre bias No centre bias Frames : 498 Frames : 653 Frames : 494 Frames : 653 Frames : 494 Frames : 1147 Average Average With centre bias No centre bias Regions: 4x4 Frames : 644 Normalized histogram Average With centre bias No centre bias Regions: 3x3 Frames : 288 More fixations No fixation Figure 4.16: Examples of categorical fixation maps from two simulations. A unique human fixation pattern emerges for each scene category, as seen in the maps with no centre bias. This confirms previous findings that fixation patterns are indicative of dierent scene categories (O’Connell and Walther, 2012). 88 4.1 Methods Scene Classification Simulation : 361 2x2 1500 Simulation : 765 584 500 Frame count 4x4 500 500 485 500 362 500 255 4x4 565 222 363 4x4 500 3x3 500 340 512 298 4x4 1000 651 1500 1000 484 382 1000 1500 666 525 243 1500 1000 533 500 3x3 1500 1000 665 293 3x3 1500 1000 1000 435 1500 1500 715 500 2x2 1000 857 500 500 Simulation : 930 1500 182 181175 3x3 1500 1000 1000 612 185 200 181 2x2 1500 1000 1000 2x2 1500 Simulation : 927 499 630 500 520 Class labels Figure 4.17: A histogram of test scene classification for four di↵erent simulations. A given test scene was classified into one of the learned scene category using minimum euclidian distance method. 89 . 2 0 500 1000 1500 857 2 93 2x2 1 2 3 0 500 1000 1500 222 36 3 565 3x3 1 2 0 500 1000 1500 651 499 4x4 Simulation : 765 Simulation : 927 1 2 3 0 500 1000 1500 2 43 525 38 2 2x2 1 2 3 0 500 1000 1500 34 0 512 298 3x3 Simulation. 2 0 500 1000 1500 715 435 3x3 Simulation : 36 1 1 2 0 500 1000 1500 665 485 4x4 Class labels 1 2 3 4 0 500 1000 1500 182 181175 612 2x2 1 2 3 0 500 1000 1500 533 36 2 255 3x3 1 2 0 500 1000 1500 484 666 4x4 1. i sc overy of 3 and 4 cat e gori e s of scenes in the training data for di↵erent simulations. 85 4.1 Methods 4.1 .3. 4 Categorical Fixation Maps Following the clustering of 1150 scenes (represented