Luận án tiến sĩ: Bayesian video object tracking

BAYESIAN VIDEO OBJECT TRACKINGBy Dehong Ma The University of Wisconsin-Milwaukee, 2006 Under the Supervision of Professor Jun Zhang This is a study of tracking moving objects robustly an

Difficulties of Video Object Tracking and Objectives of Research

The causes that make video object tracking a difficult task can be summarized as follows.

The four types of deformations summarized by David Mumford [55] In video object tracking, the “noise and blur” are mainly from the image capture process, such as sensor noise, lighting artifacts, and lens distortion; the

“superposition” is caused by the occlusions of video objects in tracking; the

“domain warping” refers to internal deformation, such as human body pose change and different facial expression, and external deformation introduced by the projective transform (three dimension to two dimension); the final deformation is “interruption,” for example, the occlusion by clutters in the background.

Over/under segmentations in image processing In most of the tracking process, we need to separate the object of interest from the whole image source to some extent by image segmentation, which is a notoriously difficult task in many situations Especially the over/under segmentations usually cause misleading observations and result in failures of video object tracking.

These difficulties often result in frequent object loss, and/or high false alarm ratios in video object tracking, which can be found in most of the current video object tracking systems [56] [57].

The objectives of this research are to develop reliable, efficient, and fully automatic video object tracking techniques that can significantly reduce the false-alarm ratio Also, for typical commercial applications, these techniques should be highly efficient and easy to.implement on COTS (Commercial-On-The-Shelf) PTZ (Pan-Tilt-Zoom) cameras and general inexpensive computers like personal computers (PC) under a Windows operating system.

The main contributions of this thesis are the following:

(1) A set of new models for video object tracking, which include a system state model with a set of innovatively defined states, and a new observation model. The new models have shown to be able to perform robust, and reliable video object tracking These models also can easily deal with the over/under segmentation problem.

(2) An efficient implementation of the proposed Bayesian video object tracking by using a particle filtering technique This implementation makes the tracking cameras.

(3) A new technique for automatic object initialisation, confirmation, and deletion based on the sequential likelihood ratio test (SLHRT) principle This technique can greatly reduce the false alarm ratio and make automatic video object tracking possible.

(4) A new segmentation technique to separate small objects from cluttered backgrounds in a single frame This technique fits an image by a Mixture Gaussian model, which is then used to predict each pixel from its neighbours. Because the pixels from the background are reflected in the model, the prediction errors reflect the pixels not from the background This prediction error image hence is used to segment the small objects in the background.

(5) A new algorithm for multiple video object tracking by combining sophisticated data association techniques with a particle filtering The new system can reliably and efficiently track multiple video objects.

The thesis consists of six chapters The previous work on general object tracking and video object tracking is reviewed in Chapter 2 Chapter 3 describes the new segmentation technique based on optimal non-linear image prediction The proposed

Bayesian approach for single video object tracking is described in Chapter 4 and the techniques for multiple video object tracking are detailed in Chapter 5 The conclusion and future direction are given in Chapter 6.

Previous Work on Video Object Tracking

Video object tracking is a branch of classic object tracking It uses video sensors, typically video cameras, to acquire observations of objects Classic object tracking does not specify the sources used to acquire the observations The typical sensors it uses to acquire observations are radar, sonar, or the like The essential difference between them is that, in video object tracking, we can obtain more information or extract more features from the objects we are tracking.

A typical video tracking system consists of observation formation (sensor data processing), data-association (for multiple target tracking), track maintenance (initialisation, confirmation, and deletion), and state estimation/prediction or filtering.

Figure 2-1 gives the diagram of a typical video object tracking system.

Most of the previous work on video object tracking can be catalogued into different classes according to their approaches to observation formation, modelling, filtering, and data association Here we will review these approaches.

Figure 2.1 A typical video tracking system.

The direct observation from an imaging sensor is video sequences, which consist of a series of frames Each frame includes the objects that we want to track, the background, and the other foreground objects The process to pick up the objects being tracked from the whole picture is called observation formation, which, in function, overlaps with image/video segmentation.

Semantic image segmentation is to separate the objects of interest from those that are not of interest in an image This is a notoriously difficult problem [58][59], especially when we deal with real world images taken in natural lighting conditions In principle, segmentation is a classification or inference problem and it can be improved by combining a priori information into the inference [58].

Most of the image segmentation methods are based on some homogeneity metrics calculated from images [60][61] The homogeneity is usually measured on a specific feature extracted from an image Typical features used for image segmentation are intensity [62]-[63], colour [64][65], edge [66][67], texture [68][69], self-similarity

[70][71], motion [72]-[73], wavelets [75][76], etc Based on these features, the pixels in an image are then grouped into individual areas.

The typical segmentation techniques used in video object tracking for observation formation are model matching and motion detection We will focus on these two techniques.

In this category, a model image is trained from the images of the objects of interest. Then, in each frame the observations are formed/found by model matching techniques.

Previous Work on Video Object Tracking 8

Introduction ơ eee ne ene EERE EOE READE EES HEEEEE EO DES ED EDP EE DESO SEA ee ESOS EEE EES EEOE EES 8

Video object tracking is a branch of classic object tracking It uses video sensors, typically video cameras, to acquire observations of objects Classic object tracking does not specify the sources used to acquire the observations The typical sensors it uses to acquire observations are radar, sonar, or the like The essential difference between them is that, in video object tracking, we can obtain more information or extract more features from the objects we are tracking.

A typical video tracking system consists of observation formation (sensor data processing), data-association (for multiple target tracking), track maintenance (initialisation, confirmation, and deletion), and state estimation/prediction or filtering.

Figure 2-1 gives the diagram of a typical video object tracking system.

Most of the previous work on video object tracking can be catalogued into different classes according to their approaches to observation formation, modelling, filtering, and data association Here we will review these approaches.

Figure 2.1 A typical video tracking system.

The direct observation from an imaging sensor is video sequences, which consist of a series of frames Each frame includes the objects that we want to track, the background, and the other foreground objects The process to pick up the objects being tracked from the whole picture is called observation formation, which, in function, overlaps with image/video segmentation.

Semantic image segmentation is to separate the objects of interest from those that are not of interest in an image This is a notoriously difficult problem [58][59], especially when we deal with real world images taken in natural lighting conditions In principle, segmentation is a classification or inference problem and it can be improved by combining a priori information into the inference [58].

Most of the image segmentation methods are based on some homogeneity metrics calculated from images [60][61] The homogeneity is usually measured on a specific feature extracted from an image Typical features used for image segmentation are intensity [62]-[63], colour [64][65], edge [66][67], texture [68][69], self-similarity

Observation FOFImafiOn nh nà erent nh nà ki nà hệt 9

an image are then grouped into individual areas.

The typical segmentation techniques used in video object tracking for observation formation are model matching and motion detection We will focus on these two techniques.

In this category, a model image is trained from the images of the objects of interest. Then, in each frame the observations are formed/found by model matching techniques.

An earlier and important approach is template matching [77] An object’s model/template is correlated with the frame at various locations and the location with the highest correlation ratio is picked as the observation’s location This approach works well if the object’s image does not change much at each frame, but it will fail if the object of interest is deformable or has large variations.

A L Yuille et al [31] suggested using a deformable template to deal with the difficulties The deformable template they proposed is a parameterised model The deformable template matching is an optimisation process of the parameters It can also be viewed as a parameter estimation or model fitting process The image corresponding with the best-fit parameters is extracted as the observations A variation of this approach is to describe the face as a Gaussian mixture distributed in 2D space [78].

The above approaches are based on 2D shapes, but a real-world 3D object has different shapes when it is projected into a 2D image plane from different relative positions A direct extension of the above approaches is to sample all of the object’s 2D shapes in the 3D space [79] as reference images One difficulty of this approach is to balance between the number of reference images and the sampling accuracy In some specific situations, such as indoor human-being trajectory tracking, the object’s motion space and activity pattern are highly limited That suggests the feasibility for constructing a highly efficient compact sampling space as in [80].

Rather than use 2D shape models, some researchers use 3D shape models like cylinder human-body model [81], articulated models with robust shape components [82]. These approaches acquire robustness at the cost of higher computation complexity.

A common problem with the above shape features is that they are sensitive to shape changes, which can happen when the object of interest is non-rigid A remedy to that is to select insensitive features from the shape for tracking The Condensation algorithm [83] uses a boundary edge feature to track both ngid and non-rigid objects, because the boundary-edge feature is much less sensitive to an object’s shape change The number of edge points used for tracking can be reduced through applying some smooth constraints on an object’s boundary The Condensation algorithm uses B-spline curves, which are similar to the curves used in Snake [84], and Active Contour [85] One drawback of the

| Condensation algorithm is that it needs the initial curve usually given by hand segmentation, which is not always possible in real time tracking.

Other shape-insensitive features that are often used to form a model are corner points, grey level or colour histograms Corner points [86] are robust to noise and they can be stably detected A model can be observed through detecting its corner point sets.

The histogram of an object is another well-used feature, which is insensitive to shape change The mean-shift algorithm [87] uses a colour-histogram to track non-rigid objects robustly.

Because moving objects are usually of major interest for video object tracking, the

‘motion cue naturally becomes one of the most important observations of the objects.

There are two kinds of motions in-videos: global motion and local motion If there is no global motion, which corresponds with the videos from a static camera, the simplest way to detect local motion is background subtraction [88], or frame differencing [74]. Though these approaches are computationally effective, they are usually very sensitive to background noise An immediate remedy is to model the intensity at each pixel as a _ Gaussian random variable and estimate the pdf's parameters before the background subtraction This approach is hence called adaptive background estimation [89] To compensate for the periodic lighting changes, W E L Grimson, et al extended the single

Gaussian model to a Gaussian mixture model, which is shown to be robust for the local motion detection [90].

If global motion is not zero, which corresponds to the videos from a moving camera, we have to estimate the global motion first and then calculate the local motion after the global motion compensation Usually in the moving camera situation, it is impossible to build a background model as we do when the camera is static.

The techniques for global motion estimation, or in a wider sense, general motion estimation, can be categorized into dense motion estimation and coarse motion estimation Among the dense motion estimation techniques, the gradient-based optical flow [91] approach is the most reliable one This approach calculates the motion field based on the Brightness Constraint Equation (BCE) and the smoothness constraint of the motion field It encounters problems in-the situations when those-constraint assumptions do not hold true Some of the typical situations are occlusions, lighting changes, correspondence, black wall, and aperture problems [88] When the inter-frame motion is large, or equivalently, when the temporal aliasing exists, a direct coarse-to-fine process must be performed with the optical flow calculation in order to obtain the correct motion field [92][93].

Other motion detection techniques include filtering-based approaches, such as Gabor Filtering [94] and wavelet-based motion filtering [95], phase-based approaches

[96], 3D model-based geometry [97], etc Among them the phase-based approach has the best accuracy The 3D model-based geometry approach is also used to reconstruct a 3D world based on a set of correspondence points The estimated 3D camera positions can be used to find the 2D displacement field, which is used to calculate the global motion between two frames.

After we have obtained the global motion, the local motion is found through the motion difference, that is, the classification based on the motion feature.

The other category, coarse motion estimation, is also called region-based motion estimation It estimates the motion field mainly though block-matching techniques to find the coarse displacement field These approaches are easy to implement and widely used in video compression [63] to eliminate the temporal redundancy Many variants [98][99] can be found in the video compression literature The coarse motion it obtains can also be used to estimation the global motion in video [100].

The techniques mentioned above are all automatic in that they need no operator intervention Some semi-automatic techniques such as those used in [83][101][102] have been proposed for video object tracking, where operators initially choose and segment the objects of interest manually in the first one or few frames before automatically tracking them Although they usually can offer more accurate tracking results, this paper is mainly on automatic techniques for their generality.

The detected moving objects are used for both new object initialisation and the observations of existing objects In the next chapter, we also introduce an innovative approach to obtain the initial objects from a single image.

2.3 Representation, Modelling, and Methods of Filtering

As stated in section 2.2.1, using robust features from images appears to be able to increase the tracking algorithm robustness, but in many situations it cannot prevent losing | the objects in tracking The reason for this is that the object’s image changes over time and such changes may be viewed as a random walk or a Brownian motion in the space of images [55] So any measure derived from the object image is likely to change over time and such changes may be viewed as a random walk or Brownian-motion in the space of measurements Based on the theory of stochastic process, the variation of Brownian motion increases as a function of time This means even when a robust measure is-used, the object can still be lost as time increases.

A better and fundamental solution is to update the template over time based on the observed images and tracking results, i.e., the estimation of the object This is related to how to represent the object for tracking, how to model the system, and how to predict and estimate the objects’ states.

Data Association Techmiques HQ nh nh nhe 18

In the situation when we want to track multiple objects through multiple observations in the scene, we need to determine which observation is from which object.This is called the Data Association problem Intuitively, we can associate an observation with the object nearest to it, but that may be wrong So we need to find a method that is more general and that can make the optimal decision.

Multiple Hypotheses Tracking (MHT) is such an optimal algorithm, which gives the minimum decision error [103] MHT does this through an exhaustive search in the hypothesis space for all the possible data associations and then MHT evaluates each hypothesis by its likelihood The final decision is made based on these evaluations.

Each hypothesis consists of an arrangement for all observations up to time t In the hypothesis, some observations are grouped to form an object’s track across different frames and some others are identified to be false alarms A track can be dead due to no observation updates after a certain period Figure 2.2 is an illustration of a hypothesis for a set of 1-dimension observations.

A hypothesis’s likelihood is evaluated according to the underlying tracking models.

These models describe the probability characteristics of the aspects in a hypothesis For example, the object number model describes the probability for a given number within a surveillance region during a period. wos _- ` ` , tee sO ` —

` Ors le) TẦNG ~ x ơ MS ` o ` ` © N ` aN “se “.

` SAL SG ao ~ 2 _— 9V 0 SO TT Q~-

~“ ` NV.” le) cm éS le) fe) N Sy —=~ô fe) _” ` ` _ ae oO" N ie) _ m7 XS `

NT eet sO va ` th = ` ` ce) ack N N 3 Slee ` ` `

S XS ơ fe) N ° Oo 0 ad ` ro) le)

Figure 2.2 A hypothesis The grouped observations are from objects Isolated observations are from false alarms.

The hypothesis evaluation can be done in batch processing [103], but it is impractical because the number of hypotheses grows exponentially over time D B Reid

[110] proposed a recursive way to simplify the evaluation process, which makes MHT a practical solution In order to further reduce the computation load, some hypothesis pruning techniques were adopted at the risk of missing the possible best hypotheses.

In order to further reduce the computation cost, Y Bar-Shalom [53] proposed a highly simplified sub-optimal approach called Joint Probability Data Association (JPDA). This approach assumes that all hypotheses are time-independent, so there is no need to generate and maintenance a hypothesis tree over time Rather, it generates a hypothesis tree within one frame and completes the estimation within that frame before processing the next frame.

Another sub-optimal approach called Probabilistic Multiple Hypothesis Tracking (PMHT) [111] is similar to JPDA in that it does estimation on the frame basis It goes one step further by associating all observations to all objects In this way, it no longer needs the hypothesis generating process Instead, it assumes there are hidden variables associating each object with each observation and then it estimates those association variables using an EM algorithm, which converges to a set of concave association weights for each object [112] A similar idea was adopted in K.J Molnar’s work [113]. Recently this approach was extended to process a non-linear model with non-Gaussian noises by using-particle filtering [114].

In their basic form, both JPDA and PMHT assume that the object’s number and initial states are known Therefore they must-have other independent processes for object initialisation and deletion.

While the data association techniques discussed above are widely used in radar tracking [115], it was not until recently that some works were found in computer vision society [116][118] In video object tracking situations, if the number of objects is not many, and the information from each object’s image is rich enough to distinct each other, the data association process is usually omitted But its importance for tracking multiple video objects in complicated situations, such as in an unmanned aerial vehicle (UAV) video surveillance system, recently has been recognized [119] Even in the case when it is omitted, the data association principle 1s implicitly applied to eliminate impossible data associations.

Objectives of this Research c con ng ng HH kh hs 22 3 Object Detection Based on Non-linear Prediction 23

Our goal is to develop a reliable, efficient, and automatic video object tracking system with less false alarm ratio than current systems It differs from those that need semi-automatic or hand initial segmentation To meet the reliability requirement we need to select robust features with a template update mechanism The integrated template update process differs in our system from most of the current video tracking systems, which usually use fixed templates.

For a real system, our system should be able to deal with object initialisation, confirmation, false-alarm elimination, and out-of-region object deletion automatically.

We proposed the sequential likelihood ratio test algorithm to fulfil this requirement.

To reduce the computation complexity while obtaining the reliable and fully automatic tracking, we need to find a highly efficient object representation and accordingly, a highly efficient implementation of the tracking filter.

Object Detection Based on Non-linear Prediction

Similar to image segmentation, object detection is a process to separate the objects of interest from their background and other objects not of interest Semantic object detection is still an-open research topic [58] [59].

As stated in section 2.2, a large number of image segmentation techniques have been developed, most of them based on classification or pattern recognition theory [120].

In this chapter we introduce a new technique for object detection in highly textured and cluttered backgrounds, which is usually a very difficult task This new technique was motivated by the observation that those complex textures in the backgrounds can fit well into some patterns If we can find those underlying models, the object, which is different from the background, can be identified.

In the following sections we will describe the theoretic basis of our technique, how it works, and discuss the experimental results.

Gaussian Mixture Model and Its Optimal Prediction

Suppose x =[x,,x„,::,x„ ]’ is a random vector whose pdf has the following form pax)= Yim, G1) where p,(.) is a multivariate Gaussian with mean m, and covariance matrix C,, and

Then, the pdf p(x) is called a multivariate Gaussian mixture.

3.2.2 Optimal Prediction of Multivariate Gaussian Mixture

If a random vector is distributed as Gaussian Mixture with known parameters

{(z,,m,,C,), fork =1,2,: ,K}, then when given a partial observation of x, for example, [x,,x,,°*:,x,,]’,we can calculate the optimal estimation of the lost component x, in the sense of mean square error If vector x is causally formed, the above estimation can also be viewed as an optimal prediction from its previous observations '

The optimal prediction is the conditional expectation of x, given

25 Ê, = E1, | Xo Xa)! =|x,pŒ, | Xp Xp 5°04 %q4 OX,ằ (3.3)

‘The conditional pdf p(x, | *¡;x;,:::„x„_¡) can be calculated as following,

Die K Pe Kio Xp90 te XpeXy)

P(X Xp 51 Xp Xn) — T KẾKX 13/2? nde n

PAX, Xp 5°75 X,4) [ PG Xp Xe,

` EAGtee tenet eat Dy (Xyo%o9°01sXq a 9%) k=l [ POs 20 Xp dE, | Pee das ea %y eX, k

' ' Pp (x 2X 5115 Xq 19%, AX, › o's where Z, " — ——— =Zy P.5; Xn) and

D(X s p57 s Xp) = Dp De (Myo Xq0°0 Xp) (3.5) K k=1

In (3.5), p,(x,,x;,'-:,x„¡) 1s the marginal density function of the multivariate

Gaussian 7,(X;,x;,'::,x„¡,x„) with parameters (m,,„;,C, „;+„ ¡), Which is the sub- vector, and sub-matrix of (m,,C,) as (3.6) [122],

bại @ 6) my, = [m¿„¡ Min iF Canin = eĩ

From (3.3), we can easily show that the optimal prediction can be calculated as(3.7).

A ton ‘ T x, = > Hy Xkn =>) Tụ Œ„ + €2 kant int Xụ„ m -4 )) (3.7) k=l k=1

From (3.7), we can see that, conditioned on [x,,x,, ',X,,]', x, is also a Gaussian mixture, and its optimal prediction or expectation is a combination of the optimal prediction from each of its components The combination weight is a non-linear, joint Gaussian function of [x;,x;, -,x„¡]“ The non-linear function (3.7) is the analytical solution for the optimal prediction of a Gaussian mixture.

Small Object Detection Based on Non-linear Optimal Prediction

We applied the above theory into small object detection in highly textured and cluttered backgrounds In these images, the small objects can be viewed as anomalies with respect to the dominating backgrounds In this implementation we tried to model the texture patterns residing in an image background We chose a spatial block as our model data, as illustrated in Figure 3.1, which consists of a pixel I(i,j) and its neighbour N(i,j). The neighbour can be either causal or non-causal A non-causal model is more efficient, and a causal model is more amenable to real-time processing.

The block image data can be organized to be one-dimension vectors as

[x,5%p 9x, 7 , and then a multivariate Gaussian mixture image model can be obtained by fitting the vectors extracted from all sites in the image into a multivariate Gaussian mixture model Notice that in order for such a model to be practically useful, some stationarity conditions are usually imposed For example, it is commonly assumed that the multivariate Gaussian mixture is invariant with respect to sites.

Figure 3.1 Left: site (i,j) and its neighbourhood N(,j) Right: one-dimension vector.

The procedure is stated as following.

(1) Image Block Data Extraction For each pixel we obtain its neighbour pixels and organize them into a one-dimensional vector Here a causal block is chosen.

(2) Training These vectors are input to a model parameter estimation algorithm. Here an EM-like algorithm [123] is chosen for it has the feature to estimate the component number in the data simultaneously.

(3) Prediction For each pixel, we use its causal neighbours to calculate its optimal prediction according to the formula given in section 3.3.

(4) Prediction error image and post-process We generate the prediction error image from the prediction image and the original image The post-processing includes binarization and a morphology process in order to cut out the anomalies, 1.e., the detected small objects.

(5) The parameters important to the performance of the algorithm are chosen as following Neighbourhood size: To capture texture information, the size of the neighbourhood cannot be too small Secondly, to ensure that the predictor is sensitive to small objects, the size of the neighbourhood also cannot be too large Our final selection is based on a compromise between these considerations and some experimentation Preset threshold A: The model parameter estimation algorithm uses this threshold to determine the number of components, or classes, in the Gaussian mixture model The algorithm starts with a relatively big class number to estimate the model parameters by EM algorithm When EM converges, it checks the minimum 7,,,, If it is less than

A, then it decreases the class number and does the model parameter estimation again till all z, is larger than A If A is too small, we end up with too many classes and the small objects can become one of these background classes If it is too large, some small classes will merge to be one large class and the predictor will be not sensitive to the small objects In our experiments, A is chosen to be at least larger than the ratio of the small object size to the whole image size, so that the small objects are not classified to be one background class.

Experimental Results and Comparison with AR model

Some typical experimental results on a synthetic image, a Brodatz texture image, and a real world image are shown respectively in Figure 3.2-3.4 The control parameters used to produce these results are shown in Table 3.1 The local signal-to-background ratio (LSBR), defined for a single object as [124], is used to evaluate the effectiveness of our approach As a comparison, we have also implemented a predictor based on the AR model with the same neighbourhood.

LSBR = > where e, , is the prediction error, /, and ỉ, -are, respectively, the (estimated) mean and variance of the prediction error in the background.

In each of the experimental result figures, the first part (part a) contains two rows of images The first row consists of an input or original image, the prediction error image produced by the non-linear predictor, and a “pre-threshold image,” which is the image to be thresholded to identify objects Here, the pre-threshold image is the square of a lowpass filtered prediction error image The second row consists of the prediction error image produced by the linear predictor, i.e., the AR model-based optimal linear predictor, and the corresponding pre-threshold image The second part (part b) is similar to the first part, except that now the input image is a noisy version of the original image(obtained by adding zero-mean white Gaussian noise) with an SNR (signal-to-noise ratio)' of 20dB The third part (part c) is similar to the second part, except here, the SNR of the input image is 3dB The last part (part d) contains a table that records the LBSRs _ associated with the non-linear and linear predictors under various input image noise conditions, from no noise to a SNR of 3dB To help visualize the numbers in the table, they are also plotted as two curves Notice that although in the table we have recorded LBSRs for the cases of 40dB and 10dB, to save space, the images associated with these experiments are not shown.

‘Here, the SNR is defined in the usual way, as the 101og,, [signal variance/noise variance].

Neighborhood Order A The initial value of K

(For both NL and AR) (For NL only) (For NL only)

(a) Results on the original image First row: optimal non-linear predictor Second row: optimal linear predictor.

(b) Results ơn the noisy image with a 20db SNR First row: optimal non-linear predictor Second row: optimal linear predictor.

SNR of no input image noise 40đb 20đb 10db 3db

1.original image (no noise added) 2:noisy nnage with 40db SNR 3:noisy image with 20db SNR 4:noisy image with 10db SNR S:noisy image with 3db SNR

Figure 3.2 An Object in Synthetic Background.

Original Image Exror Image Pre-threshold Image

Error Image Pre-threshold Image

(b) Results on the noisy image with a 20db SNR First row: optimal non-limear predictor Second row: optimal linear predictor.

(c) Results on the noisy image with a 3db SNR First row: optimal non-linear predictor Second row: optimal linear predictor.

SNR of no input image noise 40db 20db 10db 3db

— Nonlinear Prediction sane Linesn AR) Prediction 44h

‘ Loriginal image (no noise added) 10Ƒ 2:noisy image with 40db SNR

Bs 3:noisy image with 20db SNR

7 4:noisy image with 10db SNR ° Š:noisy image with 3db SNR -

Figure 3.3 An Object in a Brodatz Texture Background.

Error Image Pre-threshold Image

Original Image Error image Pre-threshold Image

(b) Results on the noisy image with a 20db SNR First row: optimal non-linear predictor Second row: optimal linear predictor.

(c) Results on the noisy image with a 3db SNR First row: optimal non-linear predictor Second row: optimal linear predictor.

SNR of | no input image noise 40db 20db 10db 3db

7 === Nonlinear Prediction ones LineartAR) Predistion

5h Horizontal Axis: l.original image (no noise added) 2:notsy mage with 40db SNR 3:noisy image with 20db SNR 4:noisy image with 10db SNR 5:noisy image with 3db SNR ty T

Figure 3.5 Visualizing the Optimal Nonlinear and Linear Predictors - A Simple Example.

In part (a) of Fig 3.2, the original image is a synthetic image It contains an object

(a helicopter) manually planted in the center of a textured background generated by a block-based multivariate Gaussian mixture model For this image, the non-linear predictor performs significantly better than the linear predictor It generates a more white noise like prediction error image, a better pre-threshold image (with better object- background contrast), and a better LBSR (more than 7 times better, see part d of Fig 3.2).

The reason that the non-linear predictor performs better here, we believe, is that the background texture in the original image is highly “non-linear” at places in the image.

That is, the texture there contains fast switches or transitions of dark and light patches.

The linear predictor cannot predict such quick transitions and therefore, does not generate a good prediction error image and consequently, a good pre-threshold image.

The results in parts (b)-(d) of Fig 3.2 demonstrate the robustness (noise resistance) of the proposed non-linear predictor Specifically, at 20dB SNR, the input image contains a significant amount of noise but the LSBR of the non-linear predictor is still not significantly smaller than that of the no-noise case It breaks down only when the noise is very severe (with SNR less or equal to 10đB).

In Fig 3.3, the original image is similar to that of Fig 3.2, except that the background is a real texture, taken from the Brodatz photo album, Since the texture here is again highly non-linear throughout the image, the non-linear predictor performs significantly better than the linear predictor It is also robust to noise, as in Fig 3.2.

Notice that since the original image (in part a of Fig 3.3) contains’ some noise to begin

| with (due to quantisation, etc.), the SNRs in the noisy images understate the amount of noise in these images This may be why, compared with Fig 3.2, a larger reduction of LBSR is observed from the “no noise” case to the 20dB SNR case Finally, one may notice that in Fig 3.3, the prediction error images produced by the non-linear predictor are not as “white noise looking” as those in Fig 3.2 This js because that the Brodatz texture used here has a more complicated structure, and thus, a more complicated Gaussian mixture model such as the hierarchical models of [125] and [126] may be needed to produce better white noise looking prediction error images However, since our application is object detection, rather than generating perfect white noise looking prediction error images, the model we are using is sufficient since the objects are quite salient in our prediction error and pre-threshold images.

In Fig 3.4, the original image is a real-world image that contains a helicopter in the middle (not planted) The helicopter is difficult to see amongst the vegetation and clutter.

Again, the non-linear predictor provides a better prediction error image (more white noise like), a better pre-threshold image, and a better LBSR than the linear predictor, and is

39 robust to noise Notice that in this case, the original image contains more noise than the original image of Fig 3.3 (due to a noisier camera) and the SNRs of the noisy images

(parts b-d of Fig 3.4) understate the amount of noise in these images Hence, similar to Fig 3.3, this may explain the larger drop of LBSR from the “no noise case” to the 20dB case in comparison with that of Fig 3.2 Nevertheless, in the pre-threshold image produced by the non-linear predictor for the 20dB SNR image, the helicopter is still quite visible.

Finally, one may also notice that the net gain of LBSR for the non-linear predictor over the linear predictor is somewhat less in this case than those in Figs 3.2 and 3.3 This may be because the original image here is less non-linear Nevertheless, as can be seen from the pre-threshold images in Fig 3.4, the non-linear predictor still works much better than the linear predictor in detecting the helicopter.

The experimental results presented above show that when the images (i.e., background) contain relatively fast changes of intensity patches and complicated texture patterns, the mixture-model-based optimal non-linear predictor performs better in anomaly detection than the single-Gaussian (1e AR) optimal linear predictor To understand better why this may be the case, we consider a simple example where things can be easily visualized Specifically, suppose a two-dimensional random vector x =[x,,x, ]' has a pdf that is a multivariate Gaussian mixture with 3 components, shown as three sets of concentric ellipses in the left picture of Fig 3.5a Then, we can calculate analytically x,, the optimal prediction of x, based on x,, by using the technique and equations described in Section 3.2 Clearly, Ÿ; = X,(x,) is anon-linear function of x, and can be seen in Fig 3.5a as a highly non-linear curve Now, for each point on the plane,

_ (x,,x,), we can find the prediction error| x, - %, | In Fig 3.5a, this prediction error is indicated by color, with red indicating large values and blue indicating small values.

Inspecting the geometry of Fig 3.5a, we observe that the non-linear predictor passes through the central regions of the ellipses As a result, if a point (x,,x,) is in a

“central region,” it produces a small prediction error; otherwise, the prediction error will be large How does this relate to images and anomaly detection? We can think of the mixture model of Fig 3.5a as a very simple example of the block-based multivariate Gaussian mixture image model.(with n=2) Then, the result in Fig 3.5a indicates that the optimal non-linear predictor behaves in the expected way: producing a small prediction error when a block of pixels are typical background pixels and producing a large prediction error when they are not.

In comparison, consider now the optimal linear predictor This predictor is derived

_based on the assumption that x =[x,,x,]' is described by a single multivariate Gaussian even though, in fact, it is a mixture of multivariate Gaussians, as shown in Fig 3.5a. Specifically, we can calculate the mean vector and covariance matrix of the multivariate Gaussian mixture of Fig 3.5a and use them to produce an optimal linear predictor (as if the mean vector and covariance matrix are for a single multivariate Gaussian) [127] This predictor is a straight line in the x,,x, plane, as shown in Fig 3.5b Since the three

“central regions” in Fig 3.5b (the same as those in Fig 3.5a) are not located along a straight line and are not too close to each other, the optimal linear predictor is only relatively close to one of them As a result, prediction error can be large for many

‘(x,,x,)'s in the other two central regions and small for many (x,,x,)'s that are close to the (predictor) straight line but are far from any central region What this means is that when an image is “from” a multivariate Gaussian mixture model, the optimal linear predictor derived based on a single multivariate Gaussian model may produce large prediction errors on many “background points” and small predictor errors on some anomalies, thereby making prediction error an unreliable indicator for anomaly detection.

In this chapter, we have derived an optimal predictor for an important class of non-

Tiêu đề	Bayesian Video Object Tracking
Tác giả	Dehong Ma
Người hướng dẫn	Jun Zhang, Professor
Trường học	The University of Wisconsin-Milwaukee
Chuyên ngành	Engineering
Thể loại	Dissertation
Năm xuất bản	2006
Thành phố	Milwaukee

Định dạng
Số trang	120
Dung lượng	11,62 MB