Nghiên cứu cải tiến kỹ thuật phát hiện và thay thế đối tượng trong video

BỘ GIÁO DỤC VÀ ĐÀO TẠO ĐẠI HỌC THÁI NGUYÊN TRƯỜNG ĐẠI HỌC CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG LÊ ĐÌNH NGHIỆP NGHIÊN CỨU CẢI TIẾN KỸ THUẬT PHÁT HIỆN VÀ THAY THẾ ĐỐI TƯỢNG TRONG VIDEO LUẬN ÁN TIẾN SĨ KHOA HỌC MÁY TÍNH THÁI NGUYÊN - 2020 BỘ GIÁO DỤC VÀ ĐÀO TẠO ĐẠI HỌC THÁI NGUYÊN TRƯỜNG ĐẠI HỌC CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG LÊ ĐÌNH NGHIỆP NGHIÊN CỨU CẢI TIẾN KỸ THUẬT PHÁT HIỆN VÀ THAY THẾ ĐỐI TƯỢNG TRONG VIDEO Chuyên ngành: Khoa học máy tính Mã số: 48 01 01 LUẬN ÁN TIẾN SĨ KHOA HỌC MÁY TÍNH NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS.TS Phạm Việt Bình PGS.TS Đỗ Năng Toàn THÁI NGUYÊN - 2020 i LỜI CAM ĐOAN Tác giả xin cam đoan cơng trình nghiên cứu thân tác giả Các kết nghiên cứu kết luận luận án trung thực, không chép từ nguồn hình thức Việc tham khảo nguồn tài liệu thực trích dẫn ghi nguồn tài liệu tham khảo quy định Thái Nguyên, ngày 28 tháng 10 năm 2020 Tác giả luận án Lê Đình Nghiệp ii LỜI CẢM ƠN Lời đầu tiên, tơi xin bày tỏ lịng kính trọng biết ơn sâu sắc tới thầy PGS TS Phạm Việt Bình, thầy PGS.TS Đỗ Năng Tồn hướng dẫn, ủng hộ tạo điều kiện tốt để tơi hồn thành luận án Tơi xin chân thành cảm ơn PGS TS Phạm Thế Anh đóng góp ý kiến quý báu học thuật kinh nghiệm nghiên cứu, giúp đỡ suốt q trình thực luận án Tơi xin chân thành cảm ơn Ban lãnh đạo trường Đại học Công nghệ thông tin Truyền thông, Khoa Công nghệ thông tin, Bộ mơn Khoa học máy tính, Bộ phận quản lý nghiên cứu sinh – trường Đại học Công nghệ thông tin Truyền thông – Đại học Thái Nguyên, đặc biệt TS Đàm Thanh Phương tạo điều kiện thuận lợi để tơi hồn thành luận án Xin cảm ơn Ban Giám hiệu Trường Đại học Hồng Đức, đồng nghiệp Phòng Đảm bảo chất lượng Khảo thí, giảng viên khoa Cơng nghệ thơng tin Truyền thông – Trường Đại học Hồng Đức, cán Viện Công nghệ thông tin – Đại học Quốc gia Hà Nội động viên giúp đỡ cơng tác để tơi có thời gian tập trung nghiên cứu, thực luận án Đặc biệt xin bày tỏ lòng biết ơn sâu sắc tới Cha, Mẹ, Vợ, Con anh, chị em gia đình, người ln dành cho tơi tình cảm nồng ấm sẻ chia lúc khó khăn sống, ln động viên giúp đỡ tơi q trình nghiên cứu Luận án quà tinh thần mà trân trọng gửi tặng đến thành viên Gia đình Tơi xin trân trọng cảm ơn! iii MỤC LỤC LỜI CAM ĐOAN i LỜI CẢM ƠN ii DANH MỤC CÁC TỪ VIẾT TẮT VÀ KÝ HIỆU vi DANH MỤC CÁC BẢNG viii DANH MỤC HÌNH VẼ ix PHẦN MỞ ĐẦU 1 Tính cấp thiết Mục tiêu nghiên cứu luận án Đối tượng, phạm vi nghiên cứu luận án Đóng góp luận án Phương pháp nội dung nghiên cứu Cấu trúc luận án CHƯƠNG TỔNG QUAN VỀ BÀI TOÁN PHÁT HIỆN VÀ THAY THẾ ĐỐI TƯỢNG TRONG VIDEO 1.1 Tổng quan video toán phát thay đối tượng video 1.1.1 Khái quát video 1.1.2 Bài toán thay đối tượng video 12 1.1.3 Một số khái niệm 16 1.1.3.1 Dị tìm đối tượng video 16 1.1.3.2 Nhận dạng hình dạng đối tượng video 16 1.1.3.3 Phát đối tượng video 16 1.1.3.4 Phân vùng đối tượng 17 1.1.3.5 Video inpainting 18 1.1.3.6 Thay đối tượng video 19 1.1.4 Các thách thức cho toán thay đối tượng 19 1.2 Tổng quan kỹ thuật áp dụng hệ thống thay đối tượng video 22 iv 1.2.1 Dò tìm đối tượng 22 1.2.1.1 Dựa điểm đặc trưng 22 1.2.1.2 Dựa mơ hình phần đối tượng 23 1.2.1.3 Dựa mạng nơron tích chập 24 1.2.1.4 Phát đối tượng quảng cáo 25 1.2.2 Nhận dạng hình dạng đối tượng 26 1.2.2.1 Lượng tử hóa vector 26 1.2.2.2 Lượng tử hóa tích đề 29 1.2.2.3 Độ đo khoảng cách 30 1.2.2.4 Nhận dạng hình dạng dựa tìm kiếm ANN 32 1.2.3 Các kỹ thuật hoàn thiện video 34 1.2.3.1 Video inpainting dựa lấy mẫu 35 1.2.3.2 Inpainting ảnh sử dụng DCNN cho không gian 2D 36 1.2.3.3 Video inpainting sử dụng DCNN cho không gian 3D 37 Kết luận chương 38 CHƯƠNG PHÁT HIỆN ĐỐI TƯỢNG TRONG VIDEO 39 2.1 Dò tìm đối tượng video 39 2.1.1 Khái qt mơ hình dị tìm đối tượng YOLO 40 2.1.2 Mơ hình dị tìm đối tượng cải tiến YOLO-Adv 42 2.1.2.1 Cải tiến hàm loss 42 2.1.2.2 Cải tiến kiến trúc mạng 46 2.1.2.3 Trích chọn đặc trưng 49 2.1.3 Ước lượng, đánh giá mơ hình cải tiến 49 2.1.3.1 Dữ liệu kiểm thử 49 2.1.3.2 Độ đo ước lượng 50 2.1.3.3 Môi trường cài đặt 52 2.1.3.4 Ước lượng, đánh giá 52 2.2 Nhận dạng hình dạng đối tượng 59 2.2.1 Mơ hình lập mục PSVQ 60 v 2.2.2 Tìm kiếm ANN dựa phân cụm thứ bậc 64 2.2.3 Ước lượng, đánh giá 68 2.2.3.1 Dữ liệu cấu hình hệ thống kiểm thử 69 2.2.3.2 Ước lượng, đánh giá chất lượng mã hóa PSVQ 71 2.2.3.3 Ước lượng, đánh giá tốc độ tìm kiếm với PSVQ 73 2.2.3.4 Ước lượng, đánh giá giải thuật tìm kiếm phân cụm thứ bậc kết hợp PSVQ 75 Kết luận chương 80 CHƯƠNG THAY THẾ ĐỐI TƯỢNG VÀ HOÀN THIỆN VIDEO 81 3.1 Phân vùng đối tượng 81 3.1.1 Các kỹ thuật phân vùng thực thể 82 3.1.2 Mơ hình phân vùng thực thể 84 3.1.2.1 Phát sinh mặt nạ vùng 85 3.1.2.2 Phân vùng thực thể Mask R-CNN 87 3.1.3 Kết thực nghiệm mơ hình phân vùng 90 3.2 Mơ hình hồn thiện video 92 3.2.1 Kiến trúc mơ hình V-RBPconv 94 3.2.2 Mơ hình kiến trúc mạng RBPconv 95 3.2.3 Hàm loss 99 3.2.4 Ước lượng, đánh giá mơ hình hồn thiện video 100 3.2.4.1 Môi trường thực nghiệm 101 3.2.4.2 Kết so sánh định tính 103 3.2.4.3 Kết so sánh định lượng 104 Kết luận chương 109 KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN 110 DANH MỤC CÁC CƠNG TRÌNH KHOA HỌC CÓ LIÊN QUAN ĐẾN LUẬN ÁN 112 TÀI LIỆU THAM KHẢO 113 PHỤ LỤC 122 vi DANH MỤC CÁC TỪ VIẾT TẮT VÀ KÝ HIỆU Từ viết tắt ANN ADC Nghĩa tiếng Anh Nghĩa tiếng Việt Approximate Nearest Neighbor Lân cận xấp xỉ gần Asymmetric distance Tính khoảng cách bất đối xứng computation tệp tin đa phương tiện chứa AVI Audio Video Interleave CAM Class Activation Map Bản đồ kích hoạt lớp CPU Central processing unit Bộ vi xử lý trung tâm CNN Convolution Neural Network Mạng nơron tích chập Deep Convolution Neural Mạng nơron tích chập sâu DCNN âm hình ảnh bên Network FID Frechet Inception Distance FVI Free-form video inpainting FCN Fully Convolutional Network Mạng tích chập đầy đủ Generative Adversarial Mạng sinh đối kháng GAN khoảng cách Frechet hoàn thiện/tái tạo video với mặt nạ Networks GPU Graphics processing unit Bộ xử lý đồ họa HD High Definition chuẩn độ nét cao HOG Histogram of oriented gradients Biểu đồ hướng gradient IoU Intersection over Union IVFADC LPIPS Tỷ lệ trùng khớp hai hộp bao Inverted file index Asymmetric Chỉ mục danh sách ngược distance computation ADC Learned Perceptual Image Chỉ số đo tượng đồng Patch Similarity mẫu ảnh vii Từ viết tắt Nghĩa tiếng Anh Nghĩa tiếng Việt MSE Mean square error MPEG Moving Picture Experts Group NMS Non-Maxima Suppression Loại bỏ điểm không cực trị National Television System Ủy ban quốc gia hệ Committee thống truyền hình PRM Peak Response Mapping Ánh xạ độ nhạy tối đa PSNR Peak signal-to-noise ratio Tỉ số tín hiệu cực đại nhiễu PAL Phase Alternation Line Hệ truyền hình màu xoay pha PQ Product quantization Lượng tử hóa tích đề PSL Peak Simulation Layer Tầng kích hoạt cực đại PSVQ Product sub-vector quantization RGB Red, Green, Blue Hệ màu RGB RoI Region of Interest Vùng chứa đối tượng Region-based Convolutional Mạng nơron tích chập dựa Neural Networks đề xuất vùng Scale-Invariant Feature Biến đổi đặc trưng bất biến tỷ lệ NTSC R-CNN SIFT Lỗi bình phương trung bình Nhóm chun gia hình ảnh động Lượng tử hóa tích đề cụm vector Transform SSD Single Shot Detector Bộ dò điểm đặc trưng SSD SURF Speeded up robust features Đặc trưng SURF SD Standard Definition Độ nét tiêu chuẩn SSIM Structural Similarity Index Chỉ số đồng có cấu trúc VGG Visual Geometry Group Nhóm hình học trực quan YOLO You only look once Mạng nhìn đối tượng lần viii DANH MỤC CÁC BẢNG Số hiệu Tên bảng bảng Trang 2.1 Thơng số phần cứng thực nghiệm mơ hình YOLO-Adv 52 2.2 Hiệu thực thi tập liệu Flickrlogos-47 55 2.3 So sánh mAP mơ hình dị tìm đối tượng tập liệu Flickrlogos-32 58 2.4 Các tập liệu đặc trưng 69 2.5 Các tham số dùng để xây dựng lượng tử 75 So sánh kết mơ hình sử dụng với phương pháp 3.1 khác sử dụng nhiều phương pháp tạo mặt nạ huấn luyện 90 khác 3.2 3.3 A.1 Kết định lượng tập liệu Places2 mơ hình: CA, Pconv, EC RBPConv Kết định lượng tập liệu FVI với mơ hình: EC, CombCN, 3Dgated V- RBPConv Số lượng đối tượng cho tập huấn luyện kiểm thử tập liệu FlickrLogos-47 106 107 124 112 DANH MỤC CÁC CƠNG TRÌNH KHOA HỌC CĨ LIÊN QUAN ĐẾN LUẬN ÁN [CT1] Lê Đình Nghiệp, Phạm Việt Bình, Đỗ Năng Toàn, Phạm Thu Hà, Trần Văn Huy (2019), “Cải tiến kiến trúc mạng Yolo cho toán nhận dạng logo” TNU Journal of Science and Technology, vol 200, no 07, pp 199-205 [CT2] The-Anh Pham, Van-Hao Le, Dinh-Nghiep Le (2018), “A review of feature indexing methods for fast approximate nearest neighbor search” 5th NAFOSTED Conference on Information and Computer Science (NICS), pp 372 – 377 [CT3] Van-Hao Le, The-Anh Pham, Dinh-Nghiep Le (2019), “Hierarchical product quantization for effective feature indexing” ICT, 26th International Conference on Telecommunications, pp 386 – 390 [CT4] The-Anh Pham, Dinh-Nghiep Le, Thi-Lan-Phuong Nguyen (2019), “Product sub-vector quatization for feature indexing” Jounal of Computer Science and Cybernetics, vol 35, no 11, pp 69-83 [CT5] Lê Đình Nghiệp, Phạm Việt Bình, Đỗ Năng Tồn, Hồng Văn Thi (2019), “Hồn thiện vùng phá hủy hình dạng ảnh sử dụng kiến trúc mạng thặng dư nhân chập phần” TNU Journal of Science and Technology, vol.208, no.15, pp.19-26 [CT6] Dinh-Nghiep Le, Van-Thi Hoang, Van-Hao Le, The-Anh Pham (2020), “A study on parameter tuning for optimal indexing on large scale datasets” Journal of Science and Technology on Information and Communications, CS 01, No 113 TÀI LIỆU THAM KHẢO Tiếng Việt [1] Lương Xuân Cương, Ðỗ Xuân Tiến, Ðỗ Trung Tuấn (2004), “Kỹ thuật nâng cao khả phân đoạn liệu video ứng dụng elearning”, Báo cáo khoa học Hội thảo quốc gia “Một số vấn đề chọn lọc Công nghệ thông tin”, Đà Nẵng [2] Phạm Thanh Tùng (2005), “Mơ hình sở liệu video cho lập danh mục khôi phục nội dung”, luận văn thạc sĩ khoa học máy tính, trường Đại học Cơng nghệ, Đại học Quốc gia Hà Nội Tiếng Anh [3] Anh P T (2017), "Pair-wisely optimized clustering tree for feature indexing," Computer Vision and Image Understanding, vol 154, no 1, pp 35-47 [4] Anh P T (2018), "Improved embedding product quantization," Machine Vision and Applications, In Press [5] Anh P T., Toan D N (2018), "Embedding hierarchical clustering in product quantization for feature indexing," Multimed Tools Appl [6] Arafat S Y., Husain S A., Niaz I A., Saleem M (2010), "Logo detection and recognition in video stream," IEEE International Conference on Digital Information Management, pp 163-168 [7] Bao Y., Li H., Fan X., Liu R., Jia Q (2016), "Region-based cnn for logo detection," ACM International Conference on Internet Multimedia Computing and Service, ICIMCS’16, p 319–322 [8] Barnes C., Shechtman E., Goldman D B., Finkelstein A (2010), "The generalized patchmatch correspondence algorithm.," European Conference on Computer Vision,Springer, pp 29-43 [9] Barnes C., Shechtman,E., Finkelstein A., Goldman D B (2009), "Patchmatch: a randomized correspondence algorithm for structural image editing," ACM Transactions on Graphics (TOG), vol 28, p 24 [10] Barnes C., Zhang F L., Lou L., Wu X., Hu S M (2015), "Patchtable: Efficient patch queries for large datasets and applications.," ACM Transactions on Graphics (TOG), vol 34, no 4, p 97 114 [11] Bay H., Ess A., Tuytelaars T., Gool L V (2008), "Speeded-Up Robust Features (SURF)," Computer Vision and Image Understanding, vol 110, no 3, pp 346-359 [12] Bertalmio M., Sapiro G., Ballester C., Caselles V (2000), "Image inpainting," ACM Trans on Graphics (SIGGRAPH), pp 417-424 [13] Bolya D., Zhou C., Xiao F., Lee Y J (2019), "Yolact: Real-time instance segmentation," arXiv preprint arXiv:1904.02689 [14] Bombonato L., Camara-Chavez G., Silva P (2018), "Real-time brand logo recognition," Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, p 111–118 [15] Cai G., Chen L., Li J (2003), "Billboard advertising detection in sport tv," Signal Processing and Its Applications, 2003 Proceedings Seventh International Symposium on, vol 1, pp 537-540 [16] Chang Y L., Liu Z Y., Hsu W (2019), "Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN," arXiv:1904.10247v3 [17] Chen L C., Papandreou G., Kokkinos I., Murphy K., Yuile A L (2018), "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," PAMI, vol 40, no 4, pp 834-848 [18] Chen Y., Guan T., Wang C (2010), "Approximate nearest neighbor search by residual vector quantization," Sensors, vol 10, no 12, pp 1125911273 [19] Covell M., Baluja S., Fink M (2006), "Advertisement detection and replacement using acoustic and visual repetition," Multimedia Signal Processing, 2006 IEEE 8th workshop on, pp 461-466 [20] Dai A., Qi C R., Nießner M (2017), "Shape completion using 3dencoder-predictor cnns and shape synthesis," Proc IEEE Conf on Computer Vision and Pattern Recognition (CVPR), vol [21] Dalal N., Triggs B (2005), "Histograms of oriented gradients for human detection," IEEE Conference on Computer Vision and Pattern Recognition, vol 1, p 886–893 [22] Deng J., Dong W., Socher R., Li L J., Li K., and Fei-Fei L (2009), "Imagenet: A large-scale hierarchical image database," CVPR 115 [23] Efros A A., Leung T K (1999), "Texture Synthesis by Nonparametric Sampling," Computer Vision, 1999 The Proceedings of the Seventh IEEE International Conference, vol 2, pp 1033-1038 [24] Felzenszwalb P F., Huttenlocher D P (2005), "Pictorial structures for object recognition," International Journal of Computer Vision, vol 61, no 1, pp 55-79 [25] Feng Z., Neumann J (2013), "Real time commercial detection in videos," http://fengzheyun.github.io/downloads/projects/before2015/Comcast2013.pdf ngày 08/10/2020 [26] Ge T., He K., Ke Q., Sun J (2014), "Optimized product quantization," IEEE Trans Pattern Anal Mach Intell, vol 36, no 4, pp 744-755 Girshick R (2015), "Fast r-cnn," ICCV [27] [28] Girshick R., Donahue J., Darrell T., Malik J (2014), "Rich feature hierarchies for accurate object detection and semantic segmentation," IEEE Conference on Computer Vision and Pattern Recognition, p 580–587 [29] Gonzalez R., and Wood R (2009), "Digital Image Processing," Pearson Edn Gray R M., Neuhoff D L (1998), “Quantization,” IEEE [30] Transactions on Information Theory, vol 44, pp 2325–2384 [31] Haar A (1910), "Zur theorie der orthogonalen funktionensysteme," Mathematische Annalen, vol 69, no 3, p 331–371 [32] Han X., Li Z., Huang H., Kalogerakis E., Yu Y (2017), "Highresolution shape completion using deep neural networks for global structure and local geometry inference," IEEE International Conference on Computer Vision (ICCV) [33] He K., Gkioxari G., Dollar P.,Girshick R B (2017), "Mask r-cnn," ICCV [34] He K., Zhang X., Ren S., Sun J (2016), "Deep residual learning for image recognition," Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778 [35] Heusel M., Ramsauer H., Unterthiner T., Nessler B., and Hochreiter S (2017), "Gans trained by a two time-scale update rule converge to a local 116 nash equilibrium," Advances in Neural Information Processing Systems, p 6626–6637 [36] Hoi S C H., Wu X., Liu H., Wu Y., Wang H., Xue H., Wu Q (2015), "Logo-net: Large-scale deep logo detection and brand recognition with deep region-based convolutional networks," abs/1511.02462 [37] Hussain Z., Zhang M., Zhang X., Ye K., Thomas C., Agha Z., Ong N., Kovashka A (2017), "Automatic understanding of image and video advertisements," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1100-1110 [38] Iandola F N., Shen A., Gao P., Keutzer K (2015), "Deeplogo: hitting logo recognition with the deep neural network hammer," arXiv preprint arXiv: 1510.02131 [39] Iizuka S., Simo-Serra E., Ishikawa H (2017), "Globally and locally consistent image completion," ACM Transactions on Graphics (TOG), vol 36, no [40] Ioffe S., Szegedy C (2005), "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," Proceedings of the International Conference on Machine Learning, p 448–456 [41] Jégou H., Douze M., Schmid C (2011), "Product Quantization for Nearest Neighbor Search," IEEE Trans Pattern Anal Mach Intell., vol 33, no 1, p 117–128 [42] Johnson J., Alahi A., Fei-Fei L (2016), "Perceptual losses for realtime style transfer and super-resolution," European Conference on Computer Vision, p 694–711 [43] Joly A., Buisson O (2009), "Logo retrieval with a contrario visual query expansion.," ACM International Conference on Multimedia, pp 581584 [44] Kalantidis Y., Avrithis Y (2014), "Locally optimized product quantization for approximate nearest neighbor search," Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, Ohio [45] Ke Y., Sukthankar R (2004), "PCA-SIFT: A More Distinctive Representation for Local Image Descriptors," Proceedings of the IEEE 117 Conference on Computer Vision and Pattern Recognition, vol 2, pp 506513 [46] Kent A., Berry M M., Luehrs Jr., Fred U., Perry J W (1995), "Machine literature searching VIII Operational criteria for designing information retrieval systems," American Documentation, vol 6, no 2, p 93 [47] Khoreva A., Benenson R., Hosang J H., Hein M., Schiele B (2017), "Simple does it: Weakly supervised instance and semantic segmentation," In CVPR [48] Kingma D P., Ba J L., Adam (2015), "A method for stochastic optimization," international conference on learning representations [49] Krizhevsky A., Sutskever I., Hinton G E (2012), "Imagenet classifcation with deep convolutional neural networks," Advances in Neural Information Processing Systems, p 1097–1105 [50] Kwatra V., Essa I., Bobick A., Kwatra N (2005), "Texture optimization for example-based synthesis," ACM Transactions on Graphics (ToG), vol 2005, pp 795-802 [51] Laradji I H., Vazquez D., Schmidt M (2019), "Where are the Masks: Instance Segmentation with Image-level Supervision," arXiv:1907.01430 [52] Lienhart R., Maydt J (2002), "An extended set of haar-like features for rapid object detection," IEEE International Conference on Image Processing, vol [53] Liu G., Reda F A., Shih K J., Wang T C., Tao A., Catanzaro B (2018), "Image inpainting for irregular holes using partial convolutions," arXiv preprint arXiv:1804.07723 [54] Liu H., Jiang S., Huang Q., Xu C (2008), "A generic virtual content insertion system based on visual attention analysis," ACM MM’08, pp 379388, [55] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C, Y., and Berg, A, C (2016), "Ssd: Single shot multibox detector", ECCV [56] Long J., Shelhamer E., Darrell T (2015), "Fully convolutional networks for semantic segmentation," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 3431–3440 118 [57] Lowe D G (2004), "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol 60, no 2, pp 91110 [58] Mahajan K S., Vaidya M B (2012), "Image in Painting Techniques: A survey," IOSR Journal of Computer Engineering, vol 5, no 4, pp 45-49 [59] Medioni G., Guy G., Rom H.(1998), "Real-Time Billboard Substitution in a Video Stream," Digital Communications [60] Muja M., Lowe D G (2009), "Fast approximate nearest neighbors with automatic algorithm configuration," VISAPP International Conference on Computer Vision Theory and Applications, p 331–340 [61] Muja M., Lowe D G (2014), "Scalable nearest neighbor algorithms for Scalable nearest neighbor algorithms for," IEEE Trans Pattern Anal Mach Intell 36, p 2227–2240 [62] Muja M., Lowe, D G (2012), "Fast matching of binary features," Proceedings of the Ninth Conference on Computer and Robot Vision (CRV), p 404–410 [63] Nazeri K., Eric Ng., Joseph T., Qureshi F., Ebrahimi M (2019), "EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning," arXiv preprint arXiv:1901.00212 [64] Neubeck A., Van Gool L (2006), "Efficient non-maximum suppression," Proceedings of the International Conference on Pattern Recognition (ICPR); Hong Kong, China 20–24 August 2006, p 850–855 [65] Norouzi M., Fleet D J (2013), "Cartesian k-means," Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR’13, p 3017–3024 [66] Oliveira G., Frazao X., Pimentel A., Ribeiro B (2016), "Automatic graphic logo detection via fast region-based convolutional networks," International Joint Conference on Neural Networks, p 985–991 [67] Pathak D., Krahenbuhl P., Donahue J., Darrell T., Efros A A (2016), "Context encoders: Feature learning by inpainting," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p 2536–2544 [68] Patwardhan K A., Sapiro G., Bertalmio M (2007), "Video inpainting under constrained camera motion," IEEE Trans on Image Proc (TIP), vol 16, no 2, pp 545-553 119 [69] Pinheiro P O., Lin T Y., Collobert R., Dollár P (2016), "Learning to refine object segments," ECCV [70] Real E., Shlens J., Mazzocchi S., Pan X., Vanhoucke V (2017), "Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5296-5305 [71] Redmon J., Divvala S., Girshick R., Farhadi A (2016), "You only look once: Unifed, real-time object detection.," EEE Conference on Computer Vision and Pattern Recognition, p 779–788 [72] Redmon J., Farhadi A (2017), "Yolo9000: better, faster, stronger," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 6517–6525, 21–26 July [73] Redmon J., Farhadi A (2018), "YOLOv3: An Incremental Improvement," arXiv:1804.02767v1 [74] Ren S., He K., Girshick R., Sun J (2015), "Faster r-cnn: Towards real time object detection with region proposal networks," NIPS, pp 91-99 [75] Romberg S., Pueyo L G., Lienhart R., van Zwol R (2011), "Scalable logo recognition in real-world images," ACM International Conference on Multimedia Retrieval, vol 8, pp 1-25 [76] Sharma A., Grau O., Fritz M (2016), "Vconv-dae: Deep volumetric shape learning without object labels," European Conference on Computer Vision, p 236–250 [77] Simonyan K., Zisserman A (2014), "Very deep convolutional networks for large-scale image recognition," CoRR arXiv:1409.1556 [78] Su H., Zhu X., Gong S (2017), "Deep learning logo detection with data expansion by synthesising context," IEEE Winter Conference on Applications of Computer Vision, p 530–539 [79] Szegedy C., Wei L., Yangqing J., Sermanet P., Reed S., Anguelov D., Erhan D., Vanhoucke V., Rabinovich A (2015), "Going deeper with convolutions," IEEE Conference on Computer Vision and Pattern Recognition, pp 1-9 [80] Timothy K., Shih N C., Tan J C., Zhong H J (2003), "Video Falsifying by Motion Interpolation and Inpainting" 120 [81] Tursun O., Kalkan S (2015), "Metu dataset: A big dataset for benchmarking trademark retrieval," IAPR International Conference on Machine Vision Applications, pp 514-517 [82] Tuzko A., Herrmann C., Manger D., Jurgen B (2018), "Open Set Logo Detection and Retrieval," International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications [83] Uijlings J R R., van de Sande K E A., Gevers T., Smeulders A W M (2013), "Selective search for object recognition," International Journal of Computer ViVision, vol 2, no 104, p 154–171, September [84] Venkatesh M V., Cheung S S., Zhao J (2009), "Efficient objectbased video inpainting," Pattern Recognition Letters, vol 30, no 2, pp 168-179 [85] Wang C., Huang H., Han X., and Wang J (2019), "Video inpainting by jointly learning temporal structure and spatial details," Proceedings of the 33th AAAI Conference on Artificial Intelligence [86] Wang W., Huang Q., You S., Yang C., Neumann U (2017), "Shape inpainting using 3d generative adversarial network and recurrent convolutional networks," arXiv preprint arXiv:1711.06375 [87] Watve A., Sural S.( 2008), "Soccer video processing for the detection of advertisement billboards," Pattern Recognition Letters , vol 29, no 7, pp 994-1006 [88] Weber M., Welling M., Perona P (2000), "Towards automatic discovery of object categories," IEEE Conference on Computer Vision and Pattern Recognition, vol 2, p 101–108 [89] Wexler Y., Shechtman E., Irani M (2007), "Space-time completion of video," IEEE Transactions on pattern analysis and machine intelligence, vol 29, no [90] Xie J., Xu L., Chen E (2012), "Image denoising and inpainting with deep neural networks," Advances in neural information processing systems, pp 341-349 [91] Xu N., Yang L., Fan Y., Yang J., Yue D., Liang Y., Price B., Cohen S., Huang T (2018), "Youtube-vos: Sequence-tosequence video object segmentation.," Proceedings of the European Conference on Computer Vision (ECCV), p 585–601 121 [92] Yan W Q., Wang J., Kankanhalli M S (2005), "Automatic video logo detection and removal," Springer-Verlag [93] Yang C., Lu X., Lin Z., Shechtman E., Wang O., Li H (2017), "Highresolution image inpainting using multi-scale neural patch synthesis," The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, p [94] Yu J., Lin Z., Yang J., Shen X., Lu X., Huang T S (2018), "Freeform image inpainting with gated convolution," arXiv preprint arXiv:1806.03589 [95] Yu J., Lin Z., Yang J., Shen X., Lu X., Huang T S (2018), "Generative image inpainting with contextual attention," arXiv preprint arXiv:1801.07892 [96] Zeiler M D., Fergus R (2014), "Visualizing and understanding convolutional networks," In Proceedings of the European Conference on Computer Vision, pp 818-833 [97] Zheng C., Cham T., and Cai J (2019), "Pluralistic Image Completion," CoRR abs/1903.04227 [98] Zhou W., Bovik A C., Sheikh H R., and Simoncelli E P (2004), "Image Qualifty Assessment: From Error Visibility to Structural Similarity.," IEEE Transactions on Image Processing, vol 13, no 4, p 600– 612 [99] Zhou Y., Zhu Y., Ye Q., Qiu Q., Jiao J (2018), "Weakly supervised instance segmentation using class peak response," CVPR [100] Zhu Q., Wang L., Wu Y., Shi J (2008), "Contour context selection for object detection: A set-to-set contour matching approach," European Conference on Computer Vision, pp 774-787 122 PHỤ LỤC A Bộ liệu kiểm thử Flickrlogos-47 Flickrlogos-47 mở rộng, hiệu chỉnh từ liệu flickrlogos-32 phổ biến cho toán truy vấn ảnh logo Do thiết kế cho toán truy vấn logo ảnh nên yếu điểm lớn liệu flickrlogos-32 giải mức đối tượng chưa đầy đủ, chi tiết cho toán phát logo Thêm vào với flickrlogos-32, ảnh xác định thể logo thuộc nhãn ảnh có nhiều thể logo hay nhiều logo khác Điều có ý nghĩa ngữ cảnh truy vấn ảnh hạn chế cho tốn nhận dạng Hình A.1 Chú giải FlickrLogos-32 (bên trên) FlickrLogos-47 (bên dưới) thể bounding box Flickrlogos-47 đời cập nhật thích cịn thiếu cho nhãn ảnh mà tách rời, đánh nhãn riêng cho biểu tượng dòng văn minh họa logo, bổ sung thêm nhiều mẫu liệu khác nhằm khắc phục hạn chế tập liệu flickrlogos-32 Số lớp flickrlogos-47 nâng lên 47 lớp cách bổ sung thêm ảnh, tách số lớp có flickrlogos32 Các nhãn hiệu flickrlogos-32 gồm biểu tượng ký tự tách thành lớp flickrlogos-47 Mỗi ảnh flickrlogos-32 chứa logo thuộc lớp flickrlogos-47 ảnh có nhiều thể thuộc logo lớp logo khác Ảnh nhiễu flickrlogos-32 bị loại bỏ flickrlogos-47 Một khác biệt 123 liệu flickrlogos-47 so với flickrlogos-32 đa dạng kích thước, đặc biệt xuất nhiều ảnh chứa logo nhỏ nhằm tạo thêm độ khó cho việc nhận dạng (hình A.1) Hình A.2 Một số ảnh ví dụ tập liệu flickrlogos-47 Flickrlogos-47 thực thích lại, ảnh chứa nhiều thể logo thuộc vào nhiều lớp khác việc gắn ảnh vào tập huấn luyện tập kiểm thử phải thay đổi, ảnh nằm đồng thời hai tập Tập ảnh huấn luyện lúc hình thành từ 833 ảnh, tập kiểm thử gồm 1402 ảnh Một thử thách lớn phát đối tượng tập liệu flickrlogos-47 thể logo thường có kích thước nhiều tỷ lệ khác nhau, độ chênh lệch tỷ lệ lớn Trong nhiều thể logo có kích thước tương đối nhỏ, mà thể đối tượng có kích thước nhỏ thơng thường khó nhận dạng nhiều so với thể có kích thước lớn Thể logo nhỏ ảnh tập huấn luyện có chiều dài 15px thể lớn có chiều dài 834px Độ dài trung bình 99px Kích thước ảnh tập liệu flickrlogos-47 đa dạng Ảnh có kích thước lớn 1024x768px Một số hình ảnh minh họa thể hình A.2 Chi tiết số lượng đối tượng cho bảng A.1 124 Tên lớp Adidas (Symbol) Aldi Becks (Symbol) BMW Carlsberg (Text) Chimay (Text) Corona (Symbol) DHL Erdinger (Text) Esso (Text) Ferrari Fosters (Symbol) Google Guinness (Text) HP nVidia (Symbol) Paulaner (Symbol) Pepsi (Symbol) Rittersport Singha (Symbol) Starbucks Stellaartois (Text) Tsingtao (Symbol) UPS Huấn Kiểm luyện thử 37 104 38 88 52 98 29 51 40 112 56 83 32 54 51 93 33 50 34 29 44 33 99 33 50 38 103 43 75 40 97 48 69 57 194 87 202 26 56 43 65 33 66 39 91 34 57 Tên lớp Huấn Kiểm luyện thử Adidas (Text) 34 71 Apple 30 47 Becks (Text) 54 118 Carlsberg (Symbol) 30 92 Chimay (Symbol) 45 79 CocaCola 62 91 Corona (Text) 35 59 Erdinger (Symbol) 48 70 Esso (Symbol) 32 63 FedEx 36 60 Ford 30 47 Fosters (Text) 43 98 Guinness (Symbol) 37 80 Heineken 63 103 Milka 89 275 nVidia (Text) 40 92 Paulaner (Text) 30 63 Pepsi (Text) 54 140 Shell 34 66 Singha (Text) 26 57 Stellaartois (Symbol) 43 72 Texaco 33 56 Tsingtao (Text) 49 95 Tổng 1936 4032 Bảng A.1 Số lượng đối tượng cho tập huấn luyện kiểm thử tập liệu FlickrLogos-47 Với lớp đối tượng, tập huấn luyện chiếm khoảng 33% tổng số đối tượng lớp 125 B Kiến trúc mạng Darknet-53 Kiểu Bộ lọc Kích thước Đầu Convolution 32 3x3 256 x 256 Convolution 64 3x3/2 128 x 128 Convolution 32 1x1 1x Convolution 64 3x3 Residual 128 x 128 Convolution 128 3x3/2 Convolution 64 1x1 2x Convolution 128 3x3 Residual 64 x 64 Convolution 256 3x3/2 Convolution 128 1x1 8x Convolution 256 3x3 Residual 512 3x3/2 Convolution 256 1x1 8x Convolution 512 3x3 Residual 1024 3x3/2 Convolution 512 1x1 4x Convolution 1024 3x3 Residual Softmax 16 x 16 16 x 16 Convolution Connected 32 x 32 32 x 32 Convolution Avgpool 64 x 64 8x8 8x8 Global 1000 1000 126 C Chi tiết kiến trúc mạng RBPconv Input: ảnh (512 x 512 x 3) [Tầng 1] ERB(64); [Tầng 2] ERB(128); Max-pooling 2x2, stride = 2; [Tầng 3] ERB(256); Max-pooling 2x2, stride = 2; [Tầng 4] ERB (512); Max-pooling 2x2, stride = 2; [Tầng 5] ERB (512); Max-pooling 2x2, stride = 2; [Tầng 6] ERB (512); Max-pooling 2x2, stride = 2; [Tầng 7] ERB (512); Max-pooling 2x2, stride = 2; [Tầng 8] ERB (512); Max-pooling 2x2, stride = 2; [Tầng 9] ERB (512); Max-pooling 2x2, stride = 2; [Tầng 10] ERB (1024);Max-pooling 2x2, stride = 2; [Tầng 11] DRB(512); up-conv 2x2, stride = 2; Concatenate (tầng 11, tầng 9) [Tầng 12] DRB(512); up-conv 2x2, stride = 2; Concatenate (tầng 12, tầng 8); [Tầng 13] DRB(512); up-conv 2x2, stride = 2; Concatenate (tầng 13, tầng 7); [Tầng 14] DRB(512); up-conv 2x2, stride = 2; Concatenate (tầng 14, tầng 6); [Tầng 15] DRB(512); up-conv 2x2, stride = 2; Concatenate (tầng 15, tầng 5); [Tầng 16] DRB(512); up-conv 2x2, stride = 2; Concatenate (tầng 16, tầng 4); [Tầng 17] DRB(256); up-conv 2x2, stride = 2; Concatenate (tầng 17, tầng 3); [Tầng 18] DRB(128); up-conv 2x2, stride = 2; Concatenate (tầng 18, tầng 2); [Tầng 19] DRB(64); up-conv 2x2, stride = 2; Concatenate (tầng 19, tầng 1); [Tầng 20] DRB(3); Output: ảnh (512 x 512 x 3) Kích thước 512x512x64 256 x 256 x 128 128 x 128 x 256 64 x 64 x 512 32 x 32 x 512 16 x 16 x 512 x x 512 x x 512 x x 512 x x 1024 x x 512 x x 1024 x x 512 x x 1024 x x 512 x x 1024 16 x 16 x 512 16 x 16 x 1024 32 x 32 x 512 32 x 32 x 1024 64 x 64 x 512 64 x 64 x 1024 128 x 128 x 256 128 x 128 x 512 256 x 256 x 128 256 x 256 x 256 512 x 512 x 64 512 x 512 x 128 512 x 512 x ... lập mục cho toán phát đối tượng Nghiên cứu cải tiến mạng DCNN sử dụng pha thay đối tượng hoàn thiện video 39 CHƯƠNG PHÁT HIỆN ĐỐI TƯỢNG TRONG VIDEO Phát đối tượng bao gồm hai tiến trình dị tìm... dạng đối tượng tìm thấy video Vấn đề 3: Nghiên cứu, áp dụng kỹ thuật phân vùng đối tượng dùng để trích chọn vùng hiển thị đối tượng Vấn đề 4: Nghiên cứu, cải tiến kỹ thuật tái tạo/hoàn thiện video. .. toán phát thay đối tượng video nhằm đạt hiệu cao hai phương diện tốc độ độ xác Cải tiến mơ hình dùng để phát đối tượng video gồm: dò tìm nhận dạng hình dáng đối tượng Nghiên cứu cải tiến kỹ thuật

Nghiên cứu cải tiến kỹ thuật phát hiện và thay thế đối tượng trong video

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan