Kĩ thuật rút trích face-track từ điểm theo vết 58- 123docz.net

Để gom nhóm các mặt phát hiện được trong một video shot thành những face-track sao cho mỗi face-track chứa nhiều biểu cảm khuôn mặt khác nhau của duy nhất một nhân vật, ta có thể đi theo các hướng tiếp cận được áp dụng trong kĩ thuật nhận dạng mặt người đương đại, ví dụ tính độ tương đồng giữa các mặt trong không gian con eigen hoặc dùng histogram màu. Bên cạnh đó, một trong những cách đơn giản nhất là sử dụng vị trí và kích thước của mặt phát hiện được. Nếu vị trí và kích thước của hai mặt trong hai khung hình liền nhau đủ gần, chúng có khả năng thuộc về cùng một nhân vật. Tuy nhiên, phương pháp này thường thất bại khi mặt có những biến chuyển lớn như biểu cảm mạnh hay ranh giới khuôn mặt bị xác định không chính xác do lỗi của bộ phát hiện mặt người. Một hướng tiếp cận khác là theo vết vùng khuôn mặt và liên kết chúng lại để gom nhóm, do Sivic et al. [26] đề xuất. Phương pháp này chính xác nhưng đòi hỏi chi phí tính toán cao. Để giảm chi phí tính toán trong khi vẫn duy trì độ chính xác, Everingham et al. [7] đề xuất phương pháp sử dụng điểm theo vết (tracked point) thu được từ bộ theo vết Kanade- Lucas-Tomasi (KLT) [25]. Với mỗi cặp mặt trong các khung hình khác nhau, những điểm theo vết nào đi qua cả hai mặt sẽ được đếm, và nếu số điểm chung đủ lớn so với tổng số điểm trong hai mặt thì các mặt này được gom vào một face-track.

Mặc dù phương pháp trên hoạt động tốt trong đa số tình huống, nhưng face- track có thể bị phân mảnh vì điểm theo vết rất nhạy cảm với sự thay đổi chiếu sáng, tình trạng che khuất và lỗi phát hiện mặt sai. Ngo et al. [16] đề xuất phương pháp

mới có thể khắc phục những trở ngại này (xem Hình A-7). Một cách cụ thể, nhóm tác giả đã thực hiện những cải tiến như sau:

• Sử dụng bộ phát hiện đèn flash để xác định các khung hình bị chói flash và loại bỏ chúng ra khỏi chuỗi khung hình khi tiến hành gom nhóm mặt. Để phát hiện khung hình bị chói flash, nhóm tác giả áp dụng một phương pháp đơn giản dựa trên sự khác biệt về độ sáng giữa các khung hình liên tiếp nhau. Nếu độ sáng của một khung hình tăng đột biến so với độ sáng của các láng giềng thì đó là khung hình bị chói sáng và nó bị loại bỏ khỏi chuỗi video.

• Duy trì danh sách L chứa các điểm quan tâm (interest point) được dò tìm từ mọi vùng khuôn mặt trong các khung hình. Khởi tạo L bằng các điểm quan tâm dò tìm được trong các vùng khuôn mặt của khung hình đầu tiên. Trong những khung hình kế tiếp, các điểm mới nằm trong vùng mặt của những khung hình này, vốn chưa tồn tại trong danh sách, sẽ được thêm vào. Các điểm trong danh sách sẽ được theo vết cho đến khi chúng mất tích. Theo cách này, phương pháp có thể giải quyết tốt trường hợp che khuất một phần và xuất hiện mặt mới.

• Với mỗi mặt phát hiện được, giữ một danh sách chỉ mục tham chiếu đến các điểm trong L. Cho trước một tập face-track tìm được trong những khung hình trước đó, mỗi mặt trong khung hình hiện tại được đối chiếu với mọi face- track để xác định mặt thuộc về face-track nào. Việc đối chiếu giữa một mặt đơn và một face-track được thực hiện bằng cách tính số điểm đi qua cả hai mặt, mặt vừa phát hiện và mặt cuối cùng trong face-track. Nếu số điểm chung tương đối lớn so với tổng số điểm của hai mặt, mặt được gom vào face-track. Theo cách này, phương pháp đề xuất có thể giải quyết trường hợp phát hiện mặt sai, ngoài ra còn tránh so sánh mọi cặp mặt trong các khung hình khác nhau như trong Everingham [7].

Phương pháp Ngo et al. [16] đã chứng tỏ khả năng hoạt động mạnh mẽ và hiệu quả thông qua thực nghiệm trên nhiều chuỗi video dài khác nhau, bao gồm 340.844 mặt. Kết quả này (94,17%) vượt trội so với của Everingham et al. (81.19%) [7].

• Input:

- Frame in one shot: FrameSet = {Framei, i=1,..N}. N is number of frames within one shot. - Detected faces: FaceSet = {Facesi,j, i=1, …, N; j=1, …Ni}. Ni is the number of faces of Framei. • Output:

- Face tracks: FaceTrackSet = {FaceTracki, i=1, …, M} where FaceTracki = {Facem,n}

Step 1:

- Set FaceTrackSet = Empty.

- Set TrackedPointList = Empty.

Step 2:

- Detect flash frames and remove them from the FrameSet.

Step 3:

- Find the first frame Framekin FrameSet containing at least one face.

- Set CurrentFrame = Framek.

- Create one face-track for each face of CurrentFrame and add it to FaceTrackSet. - Call the KLT Tracker to find the interest points Pi for faces of CurrentFrame. - Add the interest points Pi to the TrackedPointList.

- For each face-track, update indexes of points in TrackedPointList that are inside the last face of this face-track.

- Set PreviousFrame = CurrentFrame.

Step 4:

- Get CurrentFrame.

- Call the KLT Tracker to find correspondence between points in PreviousFrame and CurrentFrame. - For each face in CurrentFrame

If this face can be grouped into FaceTracktof FaceTrackSet. Add this face into FaceTrackt

Else

Create a new face-track and add into FaceTrackSet.

- Call the KLT tracker to find interest points for faces of CurrentFrame and only add points that does not exist into TrackedPointList.

- For each face-track, update indexes of points in TrackedPointList that are inside the last face of this face-track.

- If Framek = N then STOP.

- Set PreviousFrame = CurrentFrame

Step 4:

- If points in TrackedPointList are degraded, go to Step 3. Else go to Step 4.

Hình A - 7 Thuật toán rút trích face-track của Ngo et al.

Hình A - 8 Phát sinh các điểm quan tâm mới khi xuất hiện khuôn mặt mới. Khuôn mặt mới xuất hiện trong khung hình giữa.

TÀI LIỆU THAM KHẢO

[1] Arandjelovic, O. and Zisserman, A. (2005), Automatic face recognition for film character retrieval in feature-length films. Proc. CVPR, vol. 1, pp. 860– 867.

[2] Arman, F., Hsu, A., and Chiu, M-Y (1994), Image processing on encoded video sequences, Multimedia Systems, 1(5), pp.211-219.

[3] Berg, T. L. et al. (2004), Names and faces in the news, Proc. CVPR, vol. 2, pp. 848–854.

[4] Berg, T. L., Berg, A. C., Edwards, J., and Forsyth, D. (2004), Who’s in the Picture?, Neural Information Procesing Systems (NIPS).

[5] Boreczky, J. S., and Rowe, L. A. (1996), Comparison of video shot boundary detection techniques, Journal of Electronic Imaging, 5(2), pp. 122–128, April.

[6] Collins, B., Deng, J., Li, K., Li Fei-Fei (2008), Towards scalable dataset construction: An active learning approach, Proceedings of the 10th European Conference on Computer Vision, pp. 86-98.

[7] Everingham, M., Sivic, J., and Zisserman, A. (2006), “Hello! My name is... Buffy” - Automatic Naming of Characters in TV Video, Proc. British

Machine Vison Conf.

[8] Freund, Y. and Schapire, R. E. (1999), A Short Introduction to Boosting,

Journal of Japanese Society for Artificial Intelligence, 14(5), pp. 771-780, September.

[9] Hadid, A. and Pietikäinen, M. (2004), From still image to videobased face recognition: An experimental analysis, Proceedings of the Sixth IEEE

International Conference on Automatic Face and Gesture Recognition (FGR04), pp.813-818.

[10] Hampapur, A., Jain, R., and Weymouth, T. (1994), Digital video segmentation, Proc. ACM Multimedia 94, pp. 357-364, San Francisco, CA.

[11] Indyk, P. and Motwani, R. (1998), Approximate nearest neighbor -towards removing the curse of dimensionality, Proceedings of the 30th Symposium on

Theory of Computing, pp. 604-613.

[12] Le, D. D. and Satoh, S. (2008), Unsupervised Face Annotation by Mining the Web, Proc. IEEE International Conference on Data Mining (ICDM08), pp. 383-392, Dec, Pisa , Italy.

[13] Little, T. D. C., Ahanger, G., Folz, R. J., Gibbon, J. F., Reeve, F. W., Schelleng, D. H., and Venkatesh, D. (1993), A digital on-demand video service supporting content-based queries, Proc. ACM Multimedia 93, pp. 427-436, Anaheim, CA.

[14] Mensink, T. and Verbeek, J. (2008), Improving people search using query expansions: How friends help to find people, ECCV08, vol. II, , pp. 86–99, Marseille, France, Oct.

[15] Nagasaka, A. and Tanaka, Y. (1992), Automatic video indexing and full video search for object appearances, Visual Database Systems II, E. Knuth and L. Wegner, Eds., pp. 113-127, Elsevier Science Publishers.

[16] Ngo, T. D., Le, D. D., Satoh, S., and Duong, D. A. (2008), Robust face track finding in video using tracked points, Proc. International Conference on

Signal-Image Technology & Internet-Based Systems (SITIS08), pp. 59-64,

Bali, Indonesia.

[17] Ojala, T. et al. (2002), Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE TPAMI, 24(7), pp. 971-987.

[18] Ozkan, D. and Duygulu, P. (2006), A graph based approach for naming faces in news photos, Proc. Intl. Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 1477-1482.

[19] Petersohn, C. (2004), Fraunhofer HHI at TRECVID 2004: Shot Boundary Detection System, TREC Video Retrieval Evaluation Online Proceedings, TRECVID.

[20] Ramanan, D., Baker, S., and Kakade, S. (2007), Leveraging archival video for building face datasets, Proc. International Conference on Computer

Vision (ICCV), pp. 1-8.

[21] Rowley, H. A., Baluja, S., Kanade, T. (1996), Neural Network-Based Face Detection, IEEE Transactions On Pattern Analysis and Machine intelligence, 20 (1), pp. 23-38.

[22] Shahraray, B. (1995), Scene change detection and content-based sampling of video sequences, Digital Video Compression: Algorithms and Technologies,

Proc. SPIE 2419, pp. 2-13

[23] Satoh, S. (2000), Comparative evaluation of face sequence matching for content-based video access, Proc. of the 4th Intl Conf. on Automatic Face

and Gesture Recognition(FG2000), pp. 163-168, Grenoble, France, Mar.

[24] Satoh, S. and Katayama, N. (1999), An efficient implementation and evaluation of robust face sequence matching, Proceedings of the 10th

International Conference on Image Analysis and Processing, pp. 266-271,

Venice, Italy, Sept.

[25] Shi, J. and Tomasi, C., Good features to track, Proc. CVPR, pp. 593-600, 1994.

[26] Sivic, J., Everingham, M., and Zisserman, A. (2005), Person spotting: video shot retrieval for face sets, Proc. CIVR, pp. 226-236.

[27] Swanberg, D., Shu, C. F. and Jain, R. (1993), Knowledge guided parsing and retrieval in video databases, Storage and Retrieval for Image and Video

Databases, Proc. SPIE 1908, pp. 173-187.

[28] Ueda, H., Miyatake, T. and Yoshizawa, S. (1991), IMPACT: an interactive natural-motion-picture dedicated multimedia authoring system, Proc. CHI.

1991, pp. 343-350 ACM, New York.

[29] Viola, P. and Jones, M. (2001), Rapid object detection using a boosted cascade of simple features, Proc. CVPR, vol. 1, pp. 511–518.

[30] Zabih, R., Miller, J. and Mai, K. (1995), A feature-based algorithm for detecting and classifying scene breaks, Proc. ACM Multimedia 95, pp. 189- 200, San Francisco, CA.

[31] Zhang, H. J., Kankanhalli, A., and Smoliar, S. W. (1993), Automatic partitioning of full-motion video, Multimedia Systems, 1(1), pp.10-28.

[32] Zhou, Z.-H., Chen, K.-J., Dai, H.-B. (2006), Enhancing relevance feedback in image retrieval using unlabeled data, ACM Transactions on Information

Systems, 24(2), pp. 219-244.

[33] Zhou, X. S., and Huang., T. S. Relevance feedback for image retrieval: a comprehensive review. Multimedia Systems, 8(6), pp. 536-544, 2003.

[34] Zhou, Z.-H., Slide Semi-Supervised Learning with Application to Image

Retrieval. Lamda LAMDA.

[35] Zhu, X. (2006), Semi-supervised learning literature survey, Computer

Science TR 1530, University of Wisconsin – Madison, February.

Kĩ thuật rút trích face-track từ điểm theo vết 58

Kĩ thuật tách biên video shot 33

Kết luận và hướng phát triển 47