1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Tài liệu 54 Video Scanning Format Conversion and Motion Estimation pdf

20 335 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 20
Dung lượng 398,14 KB

Nội dung

de Haan, G. “Video Scanning Format Conversion and Motion Estimation” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton: CRC Press LLC, 1999 c  1999byCRCPressLLC 54 Video Scanning Format Conversion and Motion Estimation Gerard de Haan Philips Research Laboratories 54.1 Introduction 54.2 Conversion vs. Standardization 54.3 Problems with Linear Sampling Rate Conversion Applied to Video Signals TemporalInterpolation • VerticalInterpolationandInterlaced Scanning 54.4 Alternatives for Sampling Rate Conversion Theory Simple Algorithms • Advanced Algorithms 54.5 Motion Estimation Pel-RecursiveEstimators • Block-MatchingAlgorithm • Search Strategies 54.6 Motion Estimation and Scanning Format Conversion Hierarchical Motion Estimation • Recursive Search Block- Matching References 54.1 Introduction The scanning format of a video signal is a major determinant of general picture quality. Specifi- cally, it determines such aspects as stationary and dynamic resolution, motion portrayal, aliasing, scanning structure visibility, and flicker. Various formats have been designed and standardized to strike a particular balance between qualit y, cost, transmissioncapacity,and compatibility withother standards. The field of video scanning format conversion is concerned with the translation of video signals from one format into another. It consists of two basic parts: temporal interpolation and spatial interpolation. A particular case is de-interlacing, which poses an inseparable spatio-temporal inter- polation problem. Verticalandtemporalinterpolationcausepracticalandfundamentaldifficultiesinachievinghigh- quality scanning format conversion. This is because the conditions of the sampling theorem are generally not met in video signals. If they were satisfied, standard conversions of arbitrary accuracy would be possible using suitable linear filters. Theearlierconversionmethodsneglectedthefundamentalproblemsand,consequently,negatively influencedtheresolutionandthemotionportrayal. Morerecentalgorithms apply motion vectorsto predict the position of moving objects at unregistered temporal instances to improve the quality of the picture at the output format. A so-called motion estimator extracts these vectors from the input c  1999 by CRC Press LLC signal. The motion vectors partly solve the fundamental problems, but the demands on the motion estimator for scanning format conversion are severe. In this section we shall first briefly indicate why we can expect that the importance of scanning format conversionwill grow. Then we discuss in more detail the fundamental problemsof temporal interpolation of video signals. Next we provide a concise overview of the basic methods in scanning format conversion, focused on temporal sampling rate conversion and de-interlacing. Finally, we give an overviewof motion estimation algorithms,which are cr ucial in the more advancedscanning format convertors. 54.2 Conversion vs. Standardization Scanning formats have been designed in the past to st rike a particular compromise between quality, cost, transmission capacity, and compatibility with other standards. There were three main formats in use a decade ago: 50 Hz interlaced, 60 Hz interlaced, and 24 (or 25) Hz progressive (film). With the arrival of video-conferencing, HDTV, workstations, and PCs, many new video formats have appeared. These include low end formats such as CIF and QCIF with smaller picture size and lower frame rates, progressive and interlaced HDTV formats at 50 Hz and 60 Hz, and other video formats usedoncomputerworkstationsandenhancedtelevisiondisplayswithfieldratesupto100Hz. Itwill be clear that the problem of scanning format conversion is of a growing importance, despite many attempts to globally standardize video formats. 54.3 Problems with Linear Sampling Rate Conversion Applied to Video Signals High-quality scanning format conversion is difficult to achieve, as the conditions of the sampling theorem are generally not met in video signals. The solution of Sample Rate Conversion (SRC) for systems satisfying the conditions of the sampling theory is well known for arbitrary sampling ratios [1]. Figure 54.1 illustrates the procedure for a ratio of 2. To arrive at the double output sampling r ate, in a first step, zero-valued samples are inserted between every input pair of samples. In a second step, a low-pass filter (LPF) at the output rate is applied toremovethe first repeat spectrumfrom the input data. In case of a temporal SRC, the interpolating LPF has to be a temporal LPF, i.e., a filter including picture delays. Though feasible, this makes it a fairly expensive filter. A more complicated,though still not fundamental, problem occurs at the signal acquisitionstage. Sincescenesdooccurwith almost unlimitedspatial and/or temporal bandwidth, the sampling theo- remrequiresthatthissignalbelow-passfilteredpriortothescanningprocess. Interlacedscanning,as commonly applied, even demands two-dimensional prefiltering in the vertical-temporal frequency plane. In a video system, it is the camera that samples the scene in a vertical and temporal sense; therefore,theprefilterhastoberealizedintheopticalpath. Althoughthereareconsiderablepractical problems achieving this filtering, it would apparently bring down the problem of temporal inter- polation of video images to the common sampling rate conversion problem. The next section will show, however, that in addition to the practical problems there is a fundamental problem as well. 54.3.1 Temporal Interpolation Considering the eye’s sine-wave temporal frequency response for full brightness potential and full field display [2], as shown in Fig. 54.2, temporal prefiltering with a bandwidth of 75 Hz at first sig ht seems sufficient. The fundamental problem now is that the relation shown in Fig. 54.2 holds for c  1999 by CRC Press LLC FIGURE 54.1: Consecutive steps in upsampling with a factor of two. temporal frequencies as the y occur at the retina of the observer. These frequencies, however, equal the frequencies at the display only if the eye is stationary with respect to this display. Particularly with the eyetracking objectsmovingonthescreen,thisassumptionisnolongervalid. Foratracking obserververyhightemporalfrequenciesonthescreencanbetransformed tomuchlowerfrequencies or even DC at the retina. Consequently, suppression of these frequencies, with an interpolating lowpass filter, results in excessive blurring of moving objects as will be discussed next. Figure 54.3 shows, in a time-discrete representation, a simple object, a square, moving with a constant velocity. Again, in this example, we consider up-sampling with a factor of two. Therefore, the true position of the object is available at every second temporal position only (e.g., the odd numbered samples). The “tracking observer” views along the motion tr ajectory, represented with a line in the illustration, which results in a stationary image of the object on the retina. If the output field sampling frequency exceeds the cutoff temporal frequency of the human visual system, 1 the viewer will have the illusion that the object is continuously present. Therefore, the object is actually seen at a position corresponding with the motion trajectory. If now,e.g., in the 6th output field, the object is interpolated according to SRCtheory, weighted copies of the object fromsurroundingfields resulting fromthe interpolating LPF aredisplayed. Figure 54.3 illustrates the case of a symmetr ical transversal lowpass filter. In this situation, the viewer sees the object at the correct position but also variousattenuated and displaced copies(the impulse response oftheinterpolatingtemporalfilter)oftheobject inaneighborhood. Theattenuationdependsonthe coefficientsoftheinterpolatingfilter,andthedistancebetweenthecopiesisrelatedtothedisplacement 1 Actuallythepictureupdatefrequencymaybeevenas lowas 16Hz, toguar antee smoothperceivedmotion (see,e.g., [3]). The higher display rates are merely necessaryto prevent the annoying large area flicker. c  1999 by CRC Press LLC FIGURE 54.2: The contrast sensitivit y of the human observer (y-axis) for large areas of uniform brightness, as a function of the temporal frequency (x-axis). FIGURE54.3: Theeffectoftemporalinterpolationforanobjecttrackingobserver. Thefieldnumbers are counted at the output field rate. of the mov ing object in a field period. For the object-tracking observer, therefore, the temporal LPF is transformed into a spatial LPF. For an object velocity of one pixel per field period (one pel/field), its frequency characteristic equals the temporal frequency characteristic of the interpolating LPF. 2 1 pel/fieldisaslowmotion,asinbroadcastpicturematerial;velocitiesinarangeexceeding16pel/field do occur. Thus,the spatial blur caused by the SRC process becomes unacceptable even for moderate object velocities. 54.3.2 Vertical Interpolation and Interlaced Scanning Much similar to the situation of field rate conversion, it may seem that sequential scan conversion is an up-sampling problem for which SRC-theory provides an adequate solution. However, straight- forward, one-dimensional, up-sampling in the vertical frequency domain is incorrect as the data is clearly sub-Nyquist sampled due to interlace. If, more correctly,thesequential scan conversion is consideredas a two-dimensional up-sampling problem in the vertical-temporal frequency domain, we arrive at a discussion similar to the one 2 It is assumedhere thatbothfilters are normalized to their respective samplingfrequency. c  1999 by CRC Press LLC in Section 54.3.1: the problem cannot be solved as we do not know the temporal frequency at the retina of a movement-tracking observer. It is possible to disregard this problem and to perform a two-dimensional SRC, implicitly assuming a stationary v iewer and prefiltered information. Such systems were described and have been implemented for studio applications. With the older image pick-up tubes the results can be satisfactory, as these devices have a poor dynamic resolution. When modern (CCD-)cameras are used, however, the limitations of the assumptions become obvious. 54.4 Alternatives for Sampling Rate Conversion Theory With the problem of linear interpolation of video signals clarified, we will discuss alternative algo- rithms developed over time. These algorithms fall into two categories. A first category simplifies the interpolation filter prescribed by SRC-theory, considering that a completely correct solution is impossible anyway. The resulting “simple algorithms” are more attractive for hardware realization than the method from which theyare derived and under certain conditions can perform quite simi- larly. Thesecondcategoryincludesthemost“advancedalgorithms”forscanningformatconversion. These methods can be characterized by their common attempt to inter polate the 3-D image data in the direction in which the correlation is highest. The difference between the var ious options lies mainly in the number of possible directions, and dimensions, which are considered. The imple- mentation can show various linear interpolation filters controlled by one or more detectors, or a multi-dimensional nonlinear filter that has an inherent edge adaptivity. As this description allows a large number of algorithms, we will illustrate it with some important examples. 54.4.1 Simple Algorithms SRC-theory in the temporal and vertical frequency domain is not applicable due to the missing prefilter in common video systems. A sophisticated linear interpolation filter therefore makes little sense. Anyinterpolating (spatio-)temporal low-pass filter will suppress originaltemporal frequency components as well as aliased signal components, as they occupy, by definition, the same spectrum. Asthefirsteffectisdesiredandthesecondnot,thetransferfunctionofthefilterstrikesacompromise between alias and blurring. Repetition of the most recent sample in this sense is optimal for the dynamic resolution and worst for alias. A strong temporal low-pass filter suppresses much (not necessarily all) alias and yields a poor dynamic resolution. The annoyance of the temporal alias depends on the input and output picture frequency, and particularly their difference. In the easiest case, both frequencies are high and their difference 50 Hz or more. In the worst case, input and output picture rate are low and their difference in the order of 10 Hz. In case of an annoying beat frequency, an interpolating LPF usually improves picture quality, otherwise the best compromise is closer to repetition of the most recent sample. 54.4.2 Advanced Algorithms Asindicatedbefore,thesemethodsarecharacterizedbytheircommonattempttointerpolatethe3-D imagedatainthedirectioninwhichthecorrelationishighest. Tothisendtheyeitherhaveanexplicit orimplicitdetectortofindthisdirection. Incaseof(1-D)temporalinterpolationtheexplicitdetector is usually called a motion detector, for 2-D spatial interpolation it is called an edge detector, while the most advanced device estimating the optimal spatio-temporal (3-D) interpolation direction is usually called a motion estimator. The interpolation filter can be recursive or transversal, and can have any number of taps, but a transversal filter with one or two taps is the most common choice. For a two taps FIR approach we can write the inter polated video signal F int , in picture n, at spatial c  1999 by CRC Press LLC position x = (x, y) T as a function of the input v ideo signal F(x,n): F int (x,n)= 0.5  F  x +  δ 1 δ 2  ,n+ δ 3  + F  x −  δ 1 δ 2  ,n− δ 3  (54.1) In this terminology a motion detector controls δ 3 ,anedgedetectorδ 1 , and δ 2 , while a motion estimator can be applied to determine δ 1 ,δ 2 , and δ 3 . Algorithms with a Motion Detector Todetectmotion, the differencebetween two successive pictures is calculated. It is too simple, however,toexpectthissignaltobecomezeroinapicturepartwithoutmovingobjects. Thecommon problems with the detection are noise and alias. Additional problems occurringin some systems are colorsubcarriers causing non-stationarities incolored regions, interlacecausing nonstationarities in vertically detailed picture parts,and timing jitter of the sampling clock which is particularlyharmful in detailed areas. All these problems imply that the output of the motion detector usually is not a binary, but r ather a multi-level signal, indicating the probability of motion. Usual (but not always valid) assumptions made to improve the detector are: 1. Noise is small and signal is large. 2. The spectrum part around the color carrier carries no motion information. 3. Low-frequency energy in the signal is larger than in the noise and alias. 4. Moving objects are large compared to a pixel. The general structure of the motion detector resulting from these assumptions is depicted in Figure54.4. As can be seen, the difference signalisfirst low-pass (and car rierreject)filteredtoprofit FIGURE 54.4: Gener al structure of a motion detector. from (54.2) and (54.3). It also makes the detector less “nervous” for timing jitter in detailed areas. After the rectification another low-pass filter improves the consistency of the motion signal, based on assumption (54.4). Finally, the nonlinear (but monotonous) transfer function in the last block translates the signal in a probability figure for the motion P m , using (54.1). This last function may have to be adapted to the expected noise level. Low-pass filters are not necessarily linear. More than onedetectorcanbeused,workingonmorethanjusttwopicturesintheneighborhoodofthecurrent image, and a logical or linear combination of their outputs may lead to a more reliable indication of motion. The motion detector (MD) is applied to switch or fade between two processing modes, one of which is optimal for stationary and the other for moving image parts. Examples are: • De-interlacing. The MD fades between intra-field interpolation (line-averaging,or edge c  1999 by CRC Press LLC dependent spatial interpolation) and inter-fieldinterpolation (repetition of the previous field, averag ing of neighboring fields, etc.). • Field rate doubling on interlaced video: The MD fades between repetition of fields (best dynamic resolution without motion compensation for moving picture parts) and repe- tition of frames (best spatial resolution in stationary image parts). To slightly elaborate on the first example of de-interlacing, we define the interpolated pixel X m (x,n)in a moving picture part as: X m  x ,n  = 0.5  F  x −  0 1  ,n  + F  x +  0 1  ,n  (54.2) while for stationary picture parts the interpolated pixel X s (x,n)is taken as: X s  x ,n  = F  x,n− 1  (54.3) and taking the probability of motion P m , from the motion detectorintoaccount,the output is given by: F int  x ,n  = P m X m  x ,n  + (1 − P (m))X s  x ,n  (54.4) In most practical cases the output P m has a nonlinear relation with the actual probability. Algorithms with an Edge Detector To detect the orientation of a spatial edge, usually the differences between pairs of spatially neighboring pixels are calculated. Again it is a bit unrealistic to expect that a zero difference is a reliable indication of a spatial direction in which the signal is stationary. The same problems (noise, alias, carriers, timing-jitter) occur as with motion detection. The edge detector (ED) is applied to switch or fade between at least two but usually more processing modes, each of them optimal for interpolation of a certain orientation of the spatial edge. Examples are: • De-interlacing. The ED fades between vertical line-averaging and diagonal averaging (+/ − 45 ◦ , or even more angles). • Up-conversion to a higher resolution format. A simple bi-linear interpolation filter is applied with its coefficients adapted to the output of the edge detector. FIGURE 54.5: Identification of pixels as applied for direction dependent spatial interpolation. c  1999 by CRC Press LLC In Fig. 54.5, X is the pixel to be interpolated for the sequential scan conversion and the result applying pixels in a neighborhood (A, B, C, D, E and F ) is either X a ,X b ,orX c , where: X a = 0.5[A + F ]=0.5  F  x −  1 1  ,n  + F  x +  1 1  ,n  (54.5) and: X b = 0.5[B + E]=0.5  F  x −  0 1  ,n  + F  x +  0 1  ,n  (54.6) and: X c = 0.5[C + D]=0.5  F  x +  +1 −1  ,n  + F  x +  −1 +1  ,n  (54.7) The selection of X a ,X b ,orX c to the interpolated output F int is controlled by a luminance gradient indication calculated from the same neighborhood: F int  x ,n  =     X a , ( |A − F | < |C − D|∧|A − F | < |B − E| ) X b , ( |B − E|≤|A − F |∧|B − E|≤|C − D| ) X c , ( |C − D| < |A − F |∧|C − D| < |B − E| ) (54.8) In this example, the gradient is calculated on the same pixels that are used in the interpolation step. Thisisnotnecessarilythecase. Similartotheearlierdescribedmotiondetector,itisadvantageous to filter the video signal prior to and/or after the rectification in Eq. (54.8). Also the decision, i.e., the optimal interpolation angle, can be low-pass filtered to improve the consistency of the interpolation angle. Finally, the edge dependent interpolation can be combined with (motion adaptive or motion compensated)temporal interpolation to improve the interpolation quality of near horizontal edges. Implicit Detection in Nonlinear Interpolation Filters Many nonlinear interpolation methods have been described. Most popular is the class of order statistical filters. Combinations w ith linear (bandsplitting) filters are known, optimizing the interpolation for individual spectrum parts. We will limit ourselves to some basic examples here. An illustration of a basic inherently adapting filter is shown in Figure 54.6. The line to be inter- FIGURE 54.6: Sequential scan conversion with three-tap vertical-temporal median filtering. The thin lines show which pixels are input for the median filter. c  1999 by CRC Press LLC polated is found as the median of the spatially neighboring lines (a and b) and the corresponding line (c) from the previous field: F int (x,n)= median [a, b, c]= median  F  x +  0 1  ,n  ,F  x −  0 1  ,n  ,F  x,n− 1   (54.9) with: median ( X, Y, Z ) =     X, ( Y ≤ X ≤ Z ∨ Z ≤ X ≤ Y ) Y, ( X<Y ≤ Z ∨ Z ≤ Y<X ) Z, (otherwise) (54.10) The inherent adaptation to edges is understood as follows: In case of a temporal edge (i.e., motion) larger than the spatial edge (i.e., vertical detail), the difference between a and b is relatively small compared to their difference withc. Therefore, an intra-fieldinterpolation results (a or b is copied). Incaseofanon-movingverticaledge,thedifferencebetweenaandbwillberelativelylargecompared to the difference between c and a or b. In this case, the inter-field interpolation (c is copied) is most likely. It is possible to combine edge detectors with non-linear filters, e.g., a so-called weighted median filter. In a weighted median filter, the (integer) weight given to a sample indicates the number of times its value is included in the input of the filter to the ranking stage. An increase of this weight increases the chance this sample value is selected as the median. It therefore provides a method, using the output of an edge detector with uncertainties, to statistically improve the performance of the interpolation. We will again use Fig. 54.5 to identify the location of the pixels used in the interpolation. The output value for the pixel position indicated with X results as: F int  x ,n  = median  A, B, C, D, E, F, α · X −1 ,β· B + E 2  , ( α, β ∈ N ) (54.11) with: X −1 = F  x,n− 1  ,A= F  x −  1 1  ,n  ,B= F  x −  0 1  ,n  , (54.12) as illustrated in Fig. 54.5. The weighting (α and β) implies that an assumed “important” pixel is fed more than once to the median calculating circuit: α · A = A, A, A A, A α times (54.13) The combinationarises if a motion detectorisusedtocontrol the weighting factors of the pixelfrom the previous field and that of the value found by line averaging. A large value of α increases the probability of field insertion, while a large β causes an increased probability of line averaging. Althoughtheexamplesinthissectionarelimitedtode-interlacing,itshouldbenotedthatproposals exist for field rate conversion as well. Algorithms with a Motion Estimator The idea to interpolate picture content in the direction in which it is most correlated can be extended to a three-dimensional case. This results in an interpolation along the motion trajectory. Figure54.7definesthemotiontrajectoryasthelinethatconnectsidenticalpicturepartsinasequence c  1999 by CRC Press LLC [...]... prefiltering the video information prior to the motion estimation, but this introduces inaccuracies in detailed picture parts If the prefiltering and the block size are adapted separately for every step in the search procedure, we arrive at the hierarchical block-matching algorithms, dealt with in the next subsection 54. 6 Motion Estimation and Scanning Format Conversion In situations where motion vectors are... restrictions, motion compensated interpolation techniques for field rate upconversion and de-interlacing provide the most advanced option However, they require nontrivial algorithms to measure object displacements between consecutive images These motion estimation methods therefore shall be discussed more extensively in the next section 54. 5 Motion Estimation This section provides an overview of motion estimation. .. n (54. 16) where the DF D is defined as: DF D x, D i−1 , n = F x, n − F x − D i−1 , n − 1 and: Di = i Dx i Dy (54. 17) (54. 18) As before, n stands for the field or picture number The constant α is positive and determines the speed of convergence and the accuracy of the estimate The value of α is limited to a maximum, since instability or a noisy estimation result can occur for higher values Equation (54. 16)...FIGURE 54. 7: Identical picture parts of successive images lie on the motion trajectory Its projection in the image plane is the motion vector of pictures The projection of this motion trajectory between two successive pictures on the image plane, called the motion vector, is also shown in this figure Not all temporal information changes can be described adequately as object velocities: e.g., fades and. .. The estimators applicable for scanning format conversion require additional constraints which are discussed in the last part of this section 54. 5.1 Pel-Recursive Estimators The category of pel-recursive motion estimators can be derived from iterative methods that use a previously calculated motion vector D i−1 to find the result vector D i according to: D i = D i−1 + update (54. 15) Several algorithms based... D X, − X 0 0 Y , n − C + β · D X, − , n − C (54. 28) (where the values of α and β determine the smoothness and it is proposed to adapt their value in the neighborhood of edges in the image), or implicit through hierarchy or recursion, which will be discussed separately Again, both classes can be combined 54. 6.1 Hierarchical Motion Estimation Hierarchical motion estimators realize a consistent velocity... more than two steps In sub-band coding terminology, a resolution pyramid is built and coarse vectors are estimated on the low frequency band The result is used as a prediction for a more accurate estimate at the next sub-band, which contains higher frequencies, etc At the top of the pyramid, the signal is strongly prefiltered and sub-sampled The bandwidth of the filter increases and the sub-sampling factors... D(X, n) To find D(X, n), a number of candidate vectors C are evaluated applying an error measure ∈ (C, X, n) to quantify block similarity Figure 54. 8 illustrates the procedure FIGURE 54. 8: Block of size X × Y in current field n and trial block in search area SA(X) in previous field n − 1, shifted over candidate vector C More formally, CS max is defined as the set of candidate vectors C, describing all possible... inherent smoothness constraint, very coherent and close to true -motion vector fields, most suitable for scanning format conversion References [1] Engstrom, E.W., A study of television image characteristics Part Two Determination of frame frequency for television in terms of flicker characteristics, Proc of the I.R.E., 23 (4), 295-310, 1935 [2] van den Enden, A.W.M and Verhoeckx, N.A.M., Discrete-Time Signal... previous image: (54. 21) CS max = C | − N ≤ Cx ≤ +N, − M ≤ Cy ≤ +M where N and M are constants limiting SA(X) Furthermore, a block B(X) centered at X and of size X × Y consisting of pixel positions x in the present field n, is now considered: B X = x|Xx − X/2 ≤ x ≤ Xx + X/2 ∧ Xy − Y/2 ≤ y ≤ Xy + Y/2 (54. 22) The displacement vector D(X, n) resulting from the block-matching process is a candidate vector . 1999 c  1999byCRCPressLLC 54 Video Scanning Format Conversion and Motion Estimation Gerard de Haan Philips Research Laboratories 54. 1 Introduction 54. 2 Conversion vs. Standardization 54. 3. Estimation Pel-RecursiveEstimators • Block-MatchingAlgorithm • Search Strategies 54. 6 Motion Estimation and Scanning Format Conversion Hierarchical Motion Estimation • Recursive Search Block- Matching References 54. 1 Introduction The

Ngày đăng: 27/01/2014, 03:20

TỪ KHÓA LIÊN QUAN