Chapter Conclusion and Future Work 5.1 Thesis Conclusion The main aim of this research work Visual Attention in Dynamic Natural Scenes was to discover the governing mechanisms involved in deployment of attention in response to natural visual stimuli. Humans can e↵ortlessly perceive and respond in the natural environment using the principles of selective attention and divided attention. Thus attention deployment mechanism is critical to successful design of artificial system capable of performing at human level of efficiency. However since attention can be deployed using multiple senses at a time, here we constrained our study only to visual modality. We are more interested in how the visual system deploys attention. What properties of the complex natural environment (higher or lower order image statistics) attract attention and what proprieties, even though perceived, are ignored by the brain. Is vision based attention primarily visiosensory driven, prior world knowledge driven or both? What prompts to active scanning of the scene (i.e., need to move gaze from currently fixated location to the new location)? How does visual system react in response to global scene change? To answer these questions we studied temporal (fixation duration) and spatial 107 5.1 Thesis Conclusion (fixation location) properties of visual attention. In investigating temporal properties we looked at how fixation durations varied in response to movie scene transition. We focused our analysis on fixation ending before scene transition (last fixation), on-going fixation at the time of scene transition (cross-over fixation), and first fixation after the scene transition. In general we found first fixation was shorter in duration compared to last fixation and crossover fixation was longer in duration compared to both last and first fixation. We further profiled all the first fixations and looked at the changes in duration with progression over the first 1600 msec. We found that first fixation duration varied with latency from the scene transition onset. Therefore we separated first fixation into two sub-populations; early starting fixations and late starting fixations. We found the largest di↵erence between the early starting and late starting fixations at approximately 180 - 220 milliseconds relative to the onset of the transition. Subsequently we formulated early and late sets for cross-over and last fixations corresponding to early and late first fixations. As expected we did not observe any di↵erence in duration for early and late set of the last fixation. This was because subjects could not anticipate the incoming scene transition. However we did find di↵erences in duration for the early and late set of cross-over fixation. This gave rise to saccade programming hypothesis. If the onset of the new scene occurred in saccade programming phase of the cross-over fixation then it’s too late for it to influence the subsequent saccade thus resulting in first fixation at un-intended location in the new scene. However if the scene onset occurred before the saccade programming phase then visual system had time to analyse new scene and subsequently initiate the saccade to intended first fixation location in new scene. In conclusion these results favoured process Monitoring mechanism (Henderson and Smith, 2009). Any global visual change appeared to a↵ect not only length of ongoing cross-over fixation (immediate control), but also shortened the duration of 108 5.1 Thesis Conclusion the fixation immediately following that change (delayed control). In another analysis we investigated di↵erences in fixation duration in context of visual correlates. Our analysis showed that for early ending cross-over fixation there was significantly large change in local luminance and contrast at cross-over fixation location before and after the scene transition. Although changes in local luminance and contrast were significant for late set as well however those changes were still small compared to early set. These results suggest that despite the large changes in a scene transition, the visual system analyzes the local information in the new scene, and if the contrast at the new location was sufficiently high, appears to be happy to continue the fixation rather than picking a new location. These results show that fixations were under the influence of moment-to-moment visual and cognitive analysis. In a separate analysis we attempted to quantify the centre bias by comparing fixation durations and fixation distance to the frame centre. Such strategy potentially removed any systematic biases such as inconsistent ranking among subjects and selection of the saliency model (Tseng et al., 2009) in quantification process. Moreover the inferred results were purely based on behavioural data. Our analysis show a significant negative correlation between fixation duration and fixation distance to the centre of the frame. Moreover the strongest correlation was observed at the time of scene transition. In summary our results support the hypothesis that centre bias is largely due to photographer’s bias of placing interesting things near the centre. In investigating spatial properties of visual attention we proposed a computational model that combined bottom-up stimulus features with top down contextual/gist modulation for natural and dynamic scenes. The scene specific gist modulation was learned using unlabeled training data of 1150 movie scenes. The learning process was completely unsupervised as compared to previous approaches 109 5.1 Thesis Conclusion of labeled training data (Fei-Fei and Perona, 2005; Oliva et al., 2006). Moreover the modulation was demonstrated for free viewing task in dynamic scenes as compared search task in static images (Torralba et al., 2006). To show the robustness of the proposed scene category learning and subsequent modulation method we ran 1000 simulations, each having unique training and test data permutation. The model’s performance was tested on early fixation from 32 subjects and it demonstrated comparable performance (AUC = 0.9 and KL divergence = 1.6) to well-known models of visual attention. The critical part was the unique way of modulating the movie scenes, using human fixation maps. Although these maps had an inherent centre bias which could potentially undermine the improved performance of the model. However we rigorously tested it against control conditions like modulating with fixation map averaged across the scene categories and with fixation maps of wrong scene category. Our analysis show that with modulation of correct scene specific fixation maps the model’s performance is significantly improved over modulations using average fixation map and incorrect scene fixation maps. These results suggest that Gist plays a significant role in guiding visual attention by mediating early fixation behavior. Another important factor was unsupervised learning of movie scene categories. In past research the scenes were either manually ranked or labeled (Oliva and Torralba, 2001) or the main category label had to be specified to learn the theme model (Fei-Fei and Perona, 2005). In contrast we used unsupervised clustering to discover the latent scene categories in our movies. The clustering was performed for thousand unique permutations of the training data resulting in minimum two to maximum four scene categories over all the simulations. 110 5.2 Future Work 5.2 Future Work In current experiments scene onset was not controlled and mostly occurred in a fixation with random latency. It would be interesting to extend the paradigm by including the scene transitions in saccade and controlling for the transition onset latency in both fixations and saccades. This would enable us to get a more complete picture of the e↵ects due to global and local visual change on subsequent fixation behaviour. Another potential direction is to propose a model of saccade programming that takes in to account the variability observed in fixation duration for dynamic scenes and integrating it with the proposed saliency model. Regarding improvement in computational model there are several interesting avenues to explore. Does gist play a role in later part of the scene by influencing late fixation behavior? Current model showed improved results for early fixations thus indicating significant role of gist in early scene processing (V˜o and Henderson, 2010) but is it possible to quantify the role of gist over time as scene progresses? What other features can be used to formulate improved gist descriptor? Current model uses spatial frequency signature to build a gist descriptor (Oliva et al., 2006). However there are plenty of other methods to compute gist as well (Renninger and Malik, 2004; Siagian and Itti, 2007). How to improve upon formulation of fixation maps for gist dependent modulation? Currently each scene in the cluster contributes equally towards the formulation of fixation map for subsequent gist modulation. One way to improve upon this could be to formulate fixation maps for each cluster using only top 10% to 15% scenes of the respective clusters. Another approach is to weigh the contribution of each scene. A higher weight is assigned to a scene if it is found closer to the cluster centroid and vice versa. 111 . Chapter 5 Conclusion and Future Work 5. 1 Thesis Conclusion The main aim of this research work Visual Attention in Dynamic N atural Scenes was to discover the governing mechanisms involved in deployment. deployment of attention in response to natural visual stimuli. Humans can e↵ortlessly perceive and respond in the natural environment using the principles of selective attention and divided attention. . . In summary our results support the hypothesis that centre bias is largely due to photographer’s bias of placing interesting things near the centre. In investigating spatial properties of visual