Nghiên cứu xây dựng hệ thống tìm kiếm video thông minh theo thông tin hình ảnh

SỞ KHOA HỌC VÀ CƠNG NGHỆ THÀNH ĐỒN TP HỒ CHÍ MINH TP HỒ CHÍ MINH CHƯƠNG TRÌNH VƯỜN ƯƠM SÁNG TẠO KHOA HỌC VÀ CÔNG NGHỆ TRẺ * BÁO CÁO NGHIỆM THU (Đã chỉnh sửa theo góp ý Hội đồng nghiệm thu ngày 04/01/2018) NGHIÊN CỨU XÂY DỰNG HỆ THỐNG TÌM KIẾM VIDEO THƠNG MINH THEO THƠNG TIN HÌNH ẢNH CHỦ NHIỆM ĐỀ TÀI: NGUYỄN VINH TIỆP CƠ QUAN CHỦ TRÌ: TRUNG TÂM PHÁT TRIỂN KHOA HỌC VÀ CÔNG NGHỆ TRẺ SỞ KHOA HỌC VÀ CÔNG NGHỆ THÀNH ĐỒN TP HỒ CHÍ MINH TP HỒ CHÍ MINH CHƯƠNG TRÌNH VƯỜN ƯƠM SÁNG TẠO KHOA HỌC VÀ CƠNG NGHỆ TRẺ * BÁO CÁO NGHIỆM THU (Đã chỉnh sửa theo góp ý Hội đồng nghiệm thu ngày 04/01/2018) NGHIÊN CỨU XÂY DỰNG HỆ THỐNG TÌM KIẾM VIDEO THƠNG MINH THEO THƠNG TIN HÌNH ẢNH Thủ trưởng Chủ nhiệm đề tài Cơ quan chủ trì đề tài (Họ tên chữ ký) (Họ tên, chữ ký, đóng dấu) Nguyễn Vinh Tiệp Giám đốc Sở Khoa học Công nghệ Chủ tịch Hội đồng xét duyệt Mục lục Giới thiệu 1.1 Tổng quan 1.2 Lý thực đề tài 1.3 1.4 Mục tiêu đề tài Nội dung báo cáo Các cơng trình liên quan 2.1 Tiếp cận biểu diễn ảnh sử dụng đặc trưng cục 2.1.1 So khớp ảnh với đặc trưng cục 2.1.2 Mơ hình túi từ tốn tìm kiếm đối tượng ảnh 2.1.2.1 Mơ hình túi từ truy vấn văn 10 10 2.1.2.2 2.2 Mô hình túi từ thị giác (BOW) tìm kiếm đối tượng ảnh 11 2.1.3 2.1.4 Kiểm tra ràng buộc hình học Tăng cường độ phủ: Mở rộng truy vấn tăng cường đặc trưng 14 16 2.1.5 Kết hợp phương pháp 18 Tiếp cận biểu diễn ảnh sử dụng đặc trưng trích xuất từ mạng DNN 20 2.2.1 2.2.2 Convolutional Neural Network Biểu diễn ảnh đặc trưng cấp cao cho toán truy vấn đối 21 tượng ảnh 24 Dung hợp mơ hình BOW thuật tốn phát đối tượng cho tốn tìm kiếm đối tượng đặc trưng 27 3.1 3.2 Giới thiệu Các cơng trình liên quan 27 30 3.3 Dữ liệu thử nghiệm phương pháp đánh giá 32 3.4 Hệ thống tìm kiếm đối tượng 33 Tổng quan hệ thống 33 3.4.1 i 3.4.2 3.5 3.6 Phát đối tượng với thuật toán DPM 35 Dung hợp mơ hình BOW với thuật toán phát đối tượng sử dụng mạng neural network 36 3.5.1 Thí nghiệm kết 37 3.5.2 Kết hợp BOW DPM 37 3.5.3 Kết hợp với hệ số thích nghi mơ hình BOW DPM Dung hợp mơ hình BOW với thuật toán phát đối tượng khai thác 38 quan hệ điểm đặc trưng đối tượng đề xuất 40 3.6.1 3.7 So sánh phương pháp đề xuất với phương pháp state-of-the-art 42 3.6.2 So sánh với kết đội thi TRECVID INS Kết luận Một số hệ thống tương tác 45 45 48 4.1 Giới thiệu 48 4.2 Xây dựng liệu 48 4.3 4.4 Kiến trúc hệ thống ứng dụng Thử nghiệm tập liệu tự tạo 49 50 4.4.1 Giao diện ứng dụng Web tìm kiếm đối tượng 50 4.4.2 Rút trích đặc trưng xây dựng mục 51 4.4.3 Thí nghiệm đánh giá tập liệu tự thu thập Một số tiềm ứng dụng 52 52 4.5.1 Hệ thống tra cứu thông tin du lịch, sản phẩm 52 4.5.2 Công cụ hỗ trợ gợi nhớ hình ảnh cho người dùng mạng xã hội 53 4.5 Kết luận 54 5.1 5.2 Những kết đạt Một số hướng phát triển đề tài 54 55 5.3 Công bố 55 A Các cơng trình cơng bố 56 Tài liệu tham khảo 66 ii TÓM TẮT Các hệ thống tương tác thông minh cần phải giải nhiều toán liên quan đến kênh liệu đầu vào như: liệu hình ảnh, âm cảm biến khác Đề tài tập trung nghiên cứu giải số vấn đề tốn tìm kiếm đối tượng kho liệu hình ảnh Bài tốn có nhiều ứng dụng thực tế như: ứng dụng tìm kiếm hình ảnh, hệ thống giám sát, quản lý thương hiệu, quảng cáo Tuy nhiên, tốn có nhiều thách thức liên quan đến cách thức người dùng truy vấn loại đối tượng tìm kiếm Đối với loại đối tượng tìm kiếm ảnh mẫu: người sử dụng quan tâm đến tồn cảnh vật ảnh (scene) đối tượng với kích thước lớn nhỏ khác Khi tìm kiếm với đối tượng đặc trưng, ví dụ đối tượng nhỏ khơng có nhiều hoa văn, giả thuyết mơ hình BOW bị vi phạm Cho dù sử dụng kỹ thuật hậu xử lý nâng cao mơ hình BOW kiểm tra ràng buộc hình học, mở rộng truy vấn khơng giải vấn đề Do đó, chúng tơi đề xuất phương pháp kiểm tra ràng buộc dung hợp mơ hình BOW (tiếp cận từ lên hay gọi "bottom-up") phương pháp phát đối tượng (tiếp cận từ xuống hay gọi "top-down") Đóng góp chúng tơi đề xuất khai thác hiệu mối quan hệ vị trí từ thị giác (visual word) với vị trí đề xuất đối tượng (object instance proposal) ước lượng phát đối tượng Trong q trình phát triển thuật tốn phục vụ cho tốn tìm kiếm đối tượng dựa vào thơng tin thị giác, xây dựng hệ thống tương tác kèm để minh họa cho ý tưởng tương tác tiềm ứng dụng thực tế như: Hệ thống tra cứu thơng tin du lịch, văn hố sản phẩm, Hệ thống khuyến nghị hỗ trợ gợi nhớ hình ảnh có liên quan dựa mạng xã hội Danh sách bảng 3.1 So sánh phương pháp kết hợp với tham số cứng 3.2 So sánh ảnh hưởng việc chọn đặc trưng đầu vào cho mạng neural 39 3.3 network lên kết tìm kiếm So sánh phương pháp đề xuất với phương pháp state-of-the-art hai tập INS2013 INS2014 40 3.4 So sánh phương pháp đề xuất với phương pháp state-of-the-art 3.5 tập liệu INS2013 INS2014 Ký hiệu cấu hình 3.6 Ảnh hưởng thành phần công thức lên giá trị độ xác cuối 4.1 43 45 45 Một số lĩnh vực, đối tượng số lượng video thu thập tương ứng kho liệu tự xây dựng 4.2 38 49 Một số lĩnh vực, đối tượng số lượng video thu thập tương ứng kho liệu tự xây dựng iv 52 Danh sách hình vẽ 1.1 Mơ hình hoạt động hệ thống truy vấn đối tượng sở liệu video lớn 1.2 Ví dụ minh họa truy vấn với ảnh ví dụ cho trước Đối tượng tìm kiếm tồn hình (a, b, c) phần hình (vùng khoanh màu đỏ hình d, e, f) 2.1 Phân loại đặc trưng biểu diễn khả biểu diễn đối tượng ảnh Từ thấp lên cao đặc trưng cấp thấp, cấp cao cấp ngữ nghĩa Một cách tương ứng khả biểu diễn phận độc 2.2 2.3 lập, đối tượng độc lập quan hệ đối tượng Hiện tượng burstiness: Minh họa đặc trưng thuộc visual word bùng nổ ảnh Ảnh trích từ [24] 14 Bên trái ảnh truy vấn đố đối tượng cần tìm đánh dấu vùng hình chữ nhật Ở ảnh kết trả truy vấn với mơ hình BOW Ta nhận thấy ảnh tương đối rõ nét xuất đầy đủ so với ảnh truy vấn Bên phía tay phải ảnh kết tìm sử dụng phương pháp AQE mà khơng tìm thấy mơ hình BOW Các ảnh thường nhỏ 2.4 bị che khuất phần so với ảnh truy vấn Ảnh trích từ [11] Biểu đồ giá trị tương đồng (score) theo thứ tự giảm 17 dần sử dụng hai loại vector biểu diễn khác BOW (phía trên) GIST (phía dưới) Đặc trưng BOW cho kết tốt với AP=0.9083 đặc trưng GIST cho kết thấp đáng kể AP=0.0025 Biểu đồ đặc trưng BOW có dạng "L": giá trị độ tương đồng giảm nhanh chuyển từ ảnh có liên quan đến ảnh khơng liên quan Trong đó, biểu đồ đặc trưng GIST giảm chậm khơng có 2.5 nhiều khác biệt hai ảnh có vị trí liên tiếp top đầu Một pha mạng CNN v 19 22 2.6 Mạng CNN sâu bao gồm nhiều pha layer kết nối đầy đủ 23 2.7 layer cuối Kiến trúc mạng CNN sử dụng thi ImageNet Classification 2012 24 3.1 Các ảnh ví dụ đối tượng truy vấn TRECVID INS dataset Đối tượng truy vấn nhỏ, hoa văn chụp góc khác đánh dấu đường viền màu tím 3.2 28 Một số trường hợp minh họa thuật toán kiểm tra ràng buộc hình học xác định sai đối tượng Mỗi ví dụ biểu diễn cặp video frame Ảnh bên trái chứa đối tượng truy vấn khoanh đường trịn màu đỏ Ảnh bên phải trích từ đoạn video có thứ hạng cao sau thực kiểm tra ràng buộc hình học (a) Logo xe Mescedes có đường nét hình dáng tương đồng với ghế frame video bên tay phải (b) Nón cảnh sát có hoa văn với cà vạt (c) Dây chuyền hình chữ ’F’ bị nhầm lẫn phần phức tạp phía sau (d) Các ký tự logo truy vấn bị hiểu nhầm tờ bướm ảnh video tham chiếu 29 3.3 Hệ thống tìm kiếm đối tượng kho liệu video lớn 34 3.4 So sánh phương pháp kết hợp sử dụng hệ số thích nghi với phương 3.5 pháp sở kết hợp trung bình tập truy vấn INS2013 INS2014 39 Bốn loại cặp visual word khai thác thơng tin vị trí đường bao đề xuất đối tượng 42 3.6 Impact of the top K shots on the performance of the system 44 3.7 So sánh phương pháp đề xuất với tất 74 cấu hình đội tham gia thi TRECVID INS 2013 46 3.8 So sánh phương pháp đề xuất với tất 50 cấu hình đội tham gia thi TRECVID INS 2014 47 4.1 Một số ảnh video thu thập từ nguồn internet 49 4.2 Kiến trúc hệ thống 50 4.3 Giao diện ứng dụng Web tìm kiếm đối tượng 51 4.4 Giao diện kết trả ứng dụng web 51 vi PHẦN MỞ ĐẦU Tên đề tài/dự án: Nghiên cứu xây dựng hệ thống tìm kiếm video thơng minh theo thơng tin hình ảnh Chủ nhiệm đề tài/dự án: Nguyễn Vinh Tiệp • Cơ quan chủ trì: Trung tâm phát triển khoa học cơng nghệ trẻ • Thời gian thực hiện: 12 tháng • Kinh phí duyệt: 80 triệu • Kinh phí cấp: theo Thơng báo số /TB-SKHCN Mục tiêu: nghiên cứu phát triển hệ thống phần mềm cho phép người sử dụng tìm kiếm từ kho liệu video lớn phân đoạn video chứa thơng tin tương ứng với hình ảnh/vật mẫu giới thực Nội dung đề tài: TT Nội dung dự kiến Công việc thực Nội dung 1: Khảo sát số hướng nghiên cứu truy vấn video kho video lớn với thông tin thị giác • Nội dung báo cáo trình bày Chương 2 Nội dung 2: Đề xuất quy trình xử lý thuật tốn để truy vấn video kho video lớn với thông tin thị giác • Nội dung báo cáo trình bày Chương 3 Nội dung 3: Cài đặt thuật toán đề xuất thử nghiệm để đánh giá tính xác hiệu thuật toán tập liệu chuẩn quốc tế Nội dung 4: Thử nghiệm dataset chuẩn TRECVID INS • Nội dung báo cáo trình bày Chương • Nội dung báo cáo trình bày Chương Nội dung 5: Xây dựng kho liệu video tiến hành thử nghiệm liệu • Nội dung báo cáo trình bày Chương 4 viii Chương Giới thiệu 1.1 Tổng quan Hiện nay, ứng dụng tương tác thông minh ngày quan tâm có ứng dụng thiết thực sống Các hệ thống sử dụng kênh đầu vào theo hướng tiếp cận tương tự giác quan người như: thị giác (vision), thính giác (audition), vị giác (gustation), khứu giác (olfaction), v.v Trong kênh thơng tin thị giác kênh sử dụng phổ biến có nhiều ứng dụng sống hệ thống liên quan tới ứng dụng thực tăng cường (augmented reality) Các hệ thống địi hỏi phải giải tốn tìm kiếm thơng tin dựa hình ảnh Mặt khác, với phát triển thiết bị ghi hình chuyên nghiệp (như camera) đến thiết bị nghiệp dư (như điện thoại, máy ảnh cầm tay, ) khối lượng liệu hình ảnh, video chia sẻ cổng thông tin mạng xã hội ngày nhiều Điều tất yếu dẫn đến nhu cầu tìm kiếm hình ảnh cách xác thời gian hợp lý Đây vấn đề không giới nghiên cứu mà giới công nghiệp quan tâm Các tổ chức nhóm nghiên cứu lớn giới xây dựng sở liệu video lớn phục vụ cho toán khác Có thể kể đến số dataset như: EVVE[37], Hollywood2[39], TRECVID [46] thu thập liệu video từ nguồn Youtube, phim hãng Hollywood, kênh tin tức BBC News Trong đó, TREC Video Retrieval Evaluation (TRECVID) thi uy tín tổ chức hàng năm Viện Tiêu Chuẩn Quốc Gia Mỹ (NIST) Cuộc thi thu hút tham gia công ty lớn Nikon, IBM, KDDI, Kitware, AT&T, SRI, NTT, NHK trường đại học viện nghiên cứu INRIA, NII, CMU, CU, TITECH, UvA, PKU, NTT Khối lượng video dataset lên đến hàng trăm GB nhớ hàng trăm Fig Proposed system for searching based on semantic description – Main Objects: We extract objects in regions that users may be interested in using saliency map[5] To classify objects in such regions, we use VGG-16 network proposed by Simonyan and Zisserman [6] We sample an original video frame to overlapping 224 × 224 patches, then transfer them to our pre-trained feedforward network Feature maps from the output activation are aggregated with the average pooling approach The five objects with the highest scores are used to represent the video frame – Scene Attributes: We can use descriptions including scenes, e.g indoor, outdoor, building, park, kitchen, etc, to query video shots We use the state-ofthe-art method [3] to extract scene attributes This method was trained on MIT scene and SUN attribute datasets – Object Relationships: We may need to express a complicated query with dense relationships between objects To deal with the problem, we propose to use Convolutional Neural Network-Recurrent Neural Network (CNN-RNN) [7] to generate many sentences from detected objects – Metadata: The metadata of a video, such as its title, content summary, or tags, may be available Such data often reflects the main topic of a video but does not provide many details In some cases, we can exploit such information to improve the performance of our system by combining that information with other semantic concepts as mentioned above 2.2 Action detection from static image A video frame contains not only objects but also relationships and actions of objects To solve a complex retrieval task corresponding to a query sentence related to actions and/or object relationships, we propose an action detector which describes relationships between objects or actions of an object In this way, we can search video shots based on complex sentences which contain actions To learn the model for the action detection, we use the end-to-end network proposed in YOLO[2] Figure illustrates an example of a handshaking action In this figure, we can see that the action is represented in a small region in a video frame As most action datasets not point out exactly which regions contain actions like other object datasets such as ImageNet or Pascal VOC, we manually create our own action dataset with about 100 popular actions 2.3 Building Inverted Index After extracting semantic features, the searching task now becomes to a textbased retrieval one This stage is to index the semantic text returned from the previous stage A standard tf-idf scheme is used to calculate the weight of each word In the online searching stage, the system computes similarity scores between the query text and video semantic features using the inverted index structure 3.1 Locally Regional Object Proposal high-saliency Object Filtering One of the difficult problems in video search is that there can be too many objects in a video frame Therefore, we not know which objects are really focused on by users To tackle this problem, we propose to use a high-saliency Object Detection The purpose of this work is that we just focus on high-saliency objects in a video frame By selecting regions of interest from complex scenes, we can reduce noise in the result Figure shows an example of a saliency map which illustrates regions of an image that users may be interested in 3.2 Locally Regional Object Proposal Fig Objects detected from YOLO object detector and indexed by dividing video frame into × grid Finding an exact scene requires specific and discriminative cues One of the most discriminative cues in a scene is object instances Therefore, we suggest a Locally Regional Object Proposal to search by object instances First, we extract objects from a video frame with OLO (version 2) [2], one of the state-of-the-art object detectors The YOLO network was trained on COCO datasets [8], which comprises of 80 concepts We employ YOLO to get bounding box information of objects in each video frame Second, we proceed soft-indexing by dividing a video frame into a regular × grid Each object detected by YOLO object detector is in one or several cells We index each object with two attributes: the object name and the cells containing that object For instance, Figure shows the detected objects, each of which lays on some cells For example, the dog is indexed with the following cells: (2,1), (2,2), (3,1), (3,2), (4,1), (4,2), (5,1), (5,2), (6,1), and (6,2) Similarly, we can index the bicycle and the truck Note: rows are numbered from (top to bottom) and columns are numbered from (left to right) To search for a video frame with an object, we create a sketch in a blank 7×7 grid to represent a collection of objects in the the video frame of interest For each object, we mark all the possible cells in the grid that contain the object Each cell in our sketch can contain multiple objects Since all data was indexed before, we can search for a video frame quickly By this way, we can search for a frame with an object (in one of the 80 classes) and its position 4.1 Post Processing Color Based Filtering We aim at the Known-Item Search scenario in which users search for a short video segment known either visually or by a textual description Based on raw color sketching, we can search all video frames that have a color distribution similar to a raw color sketching distribution To deal with the problem, we consider using Color Based Searching [9] The retrieval model is based on feature signatures, a flexible image descriptor capturing distinct color regions in video key-frames Fig Example of saliency map 4.2 Fig Action detection example Instance Search In AVS tasks, the output of the system is a ranked list with many shots that are relevant to a given verbal expression After using the semantic concept to retrieve an initial ranked list, we propose to extend the query using an instant search system In this paper, we use the framework that leverages the advantage of a local feature based representation model and a deep feature based object detector [10] In the TRECVID INS task, our method achieve about 42.42% in MAP Conclusion We propose a new hybrid method that takes advantages of semantic concept detectors, an action detector from a static image, and a locally regional object proposal To filter irrelevant shots, we use color-based signatures and spatial information of concepts We also propose an instance search panel to expand the query and improve the recall of the system References Cobˆ arzan, C., Schoeffmann, K., Bailer, W., Hă urst, W., Blazek, A., Lokoc, J., Vrochidis, S., Barthel, K.U., Rossetto, L.: Interactive video search tools: a detailed analysis of the video browser showdown 2015 Multimedia Tools and Applications 76(4) (Feb 2017) 5539–5571 Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger arXiv preprint arXiv:1612.08242 (2016) Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 27 Curran Associates, Inc (2014) 487–495 Patterson, G., Xu, C., Su, H., Hays, J.: The sun attribute database: Beyond categories for deeper scene understanding International Journal of Computer Vision 108(1-2) (2014) 59–81 Liu, N., Han, J.: Dhsnet: Deep hierarchical saliency network for salient object detection In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition CoRR abs/1409.1556 (2014) Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L In: Microsoft COCO: Common Objects in Context Springer International Publishing, Cham (2014) 740–755 Blazek, A., Lokoc, J., Skopal, T.: Video retrieval with feature signature sketches In: Similarity Search and Applications - 7th International Conference, SISAP 2014, Los Cabos, Mexico, October 29-31, 2014 Proceedings (2014) 25–36 10 Nguyen, V.T., Le, D.D., Salvador, A., Caizhi-Zhu, Nguyen, D.L., Tran, M.T., Duc, T.N., Duong, D.A., Satoh, S., i Nieto, X.G.: Nii-hitachi-uit at trecvid 2015 In: TRECVID 2015 Workshop Gaithersburg, MD, USA (2015) Tutorial ICMR’17, June 6–9, 2017, Bucharest, Romania Video Indexing, Search, Detection, and Description with Focus on TRECVID George Awad National Institute of Standards and Technology Gaithersburg, Maryland 20899, USA gawad@nist.gov Vinh-Tiep Nguyen University of Science, Vietnam National University HCMC Ho Chi Minh City, Vietnam nvtiep@fit.hcmus.edu.vn Duy-Dinh Le University of Information Technology, Vietnam National University HCMC Ho Chi Minh City, Vietnam duyld@uit.edu.vn Georges Quénot Univ Grenoble Alpes CNRS, Grenoble INP, LIG F-38000 Grenoble, France Georges.Quenot@imag.fr Chong-Wah Ngo Department of Computer Science, City University of Hong Kong Hong Kong, China cscwngo@cityu.edu.hk Cees Snoek University of Amsterdam Amsterdam, The Netherlands cgmsnoek@uva.nl Shin’ichi Satoh National Institute of Informatics Japan satoh@nii.ac.jp ABSTRACT containing the target This tutorial session will give an overview of the SIN task followed by the description of two main approaches, a “classical” one based on engineered features, classification and fusion, and a deep learning-based one [4] A baseline implementation built by the LIG team and the IRIM group will be introduced and shared There has been a tremendous growth in video data the last decade People are using mobile phones and tablets to take, share or watch videos more than ever before Video cameras are around us almost everywhere in the public domain (e.g stores, streets, public facilities, etc) Efficient and effective retrieval methods are critically needed in different applications The goal of TRECVID is to encourage research in content-based video retrieval by providing large test collections, uniform scoring procedures, and a forum for organizations interested in comparing their results In this tutorial, we present and discuss some of the most important and fundamental content-based video retrieval problems such as recognizing predefined visual concepts, searching in videos for complex ad-hoc user queries, searching by image/video examples in a video dataset to retrieve specific objects, persons, or locations, detecting events, and finally bridging the gap between vision and language by looking into how can systems automatically describe videos in a natural language A review of the state of the art, current challenges, and future directions along with pointers to useful resources will be presented by different regular TRECVID participating teams Each team will present one of the following tasks: Zero-example (0Ex) Video Search (AVS) The TRECVID AVS task models the end user search use-case, who is looking for segments of video containing persons, objects, activities, locations, etc and combinations of the former Zero-example (0Ex) is basically textto-video search, where queries are described in text and no visual example is given Such search paradigm depends heavily on the scale and accuracy of concept classifiers in interpreting the semantic content of videos The general idea is to annotate and index videos with concepts during offline processing, and then retrieve videos with relevant concepts matching query description [12, 13] 0Ex video search started since the very beginning of TRECVid in year 2003, growing from around twenty concepts to currently more than ten thousands of classifiers The queries also evolved from finding a specific thing (e.g., find shots of an airplane taking off) to detecting a complex and generic events (e.g., wedding shower) [18], while dataset size has expanded yearly from less than 200 hours to more than 5,000 hours of videos [17] This tutorial session will give an overview of the AVS task [1] and 0Ex search paradigm, with topics in development of concept classifiers, indexing and feature pooling, query processing and concept selection, and video recounting Interesting problems to be discussed include how to determine the number of concepts for query answering, and how to identify query-relevant fragments for feature pooling and video recounting An overview of the methods used by AVS task participants in 2016 will be presented and a 0Ex baseline system, with a few thousands of concept classifiers (from SIN, ImageNet concept banks) and built on Multimedia Event Detection (MED) and AVS datasets, will be introduced and shared in public domain Semantic INdexing (SIN) The TRECVID SIN task [3] ran from 2010 to 2015 and evaluated methods and systems for automatic content-based video indexing The task was defined as follows: given a test collection, a reference shot segmentation, and concept definitions, return for each target concept a list of at most 2000 shot IDs from the test collection ranked according to their likelihood of Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page Copyrights for third-party components of this work must be honored For all other uses, contact the owner/author(s) ICMR ’17, June 6–9, 2017, Bucharest, Romania © 2017 Copyright held by the owner/author(s) ACM ISBN 978-1-4503-4701-3/17/06 DOI: http://dx.doi.org/10.1145/3078971.3079044 Tutorial ICMR’17, June 6–9, 2017, Bucharest, Romania ICMR ’17, , June 6–9, 2017, Bucharest, Romania G Awad et al Instance Search (INS) The TRECVID INS task [2] aims at exploring technologies that efficiently and effectively search and retrieve specific objects from videos by given visual examples The task is especially focusing on finding "instances" of object, person, or location, unlike finding objects of specified classes as in the case of the SIN task or ad-hoc video search This tutorial section will give an overview of the INS task followed by a standard pipeline including short list result generation by bag of visual word technique [20], handling of geometric information and context, efficiency management such as inverted index, and so on [11, 19] A baseline implementation built by NII team will be introduced and shared ACM Reference format: George Awad, Duy-Dinh Le, Chong-Wah Ngo, Vinh-Tiep Nguyen, Georges Quénot, Cees Snoek, and Shin’ichi Satoh 2017 Video Indexing, Search, Detection, and Description with Focus on TRECVID In Proceedings of ICMR ’17, June 6–9, 2017, Bucharest, Romania, , pages DOI: http://dx.doi.org/10.1145/3078971.3079044 REFERENCES [1] George Awad, Jonathan Fiscus, Martial Michel, David Joy, Wessel Kraaij, Alan F Smeaton, Georges Quénot, Maria Eskevich, Robin Aly, and Roeland Ordelman 2016 Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking In Proceedings of TRECVID, Vol 2016 [2] George Awad, Wessel Kraaij, Paul Over, and ShinâĂŹichi Satoh 2017 Instance search retrospective with focus on TRECVID International Journal of Multimedia Information Retrieval 6, (2017), 1–29 [3] George Awad, Cees GM Snoek, Alan F Smeaton, and Georges Quénot 2016 [Invited Paper] TRECVid Semantic Indexing of Video: A 6-Year Retrospective ITE Transactions on Media Technology and Applications 4, (2016), 187–208 [4] Mateusz Budnik, Efrain-Leonardo Gutierrez-Gomez, Bahjat Safadi, Denis Pellerin, and Georges Quénot 2016 Learned features versus engineered features for multimedia indexing Multimedia Tools and Applications (2016), 1–18 [5] Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees GM Snoek 2016 Early Embedding and Late Reranking for Video Captioning In Proceedings of the 2016 ACM on Multimedia Conference ACM, 1082–1086 [6] Jianfeng Dong, Xirong Li, and Cees GM Snoek 2016 Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction In ArXive [7] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek 2014 Videostory: A new multimedia embedding for few-example recognition and translation of events In Proceedings of the 22nd ACM international conference on Multimedia ACM, 17–26 [8] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek 2015 Discovering semantic vocabularies for cross-media retrieval In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval ACM, 131–138 [9] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek 2017 Video2vec Embeddings Recognize Events when Examples are Scarce IEEE Transactions on Pattern Analysis and Machine Intelligence (2017) [10] Amirhossein Habibian and Cees GM Snoek 2014 Recommendations for recognizing video events by concept vocabularies Computer Vision and Image Understanding 124 (2014), 110–122 [11] Duy-Dinh Le, S Phan, V Nguyen, C Zhu, D M Nguyen, T D Ngo, S Kasamwattanarote, P Sebastien, M Tran, D A Duong, and Shin’ichi Satoh 2014 National Institute of Informatics, Japan at TRECVID 2014 In TRECVID [12] Yi-Jie Lu, Phuong Anh Nguyen, Hao Zhang, and Chong-Wah Ngo 2017 ConceptBased Interactive Search System In International Conference on Multimedia Modeling Springer, 463–468 [13] Yi-Jie Lu, Hao Zhang, Maaike de Boer, and Chong-Wah Ngo 2016 Event detection with zero example: select the right and suppress the wrong concepts In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval ACM, 127–134 [14] Masoud Mazloom, Efstratios Gavves, and Cees GM Snoek 2014 Conceptlets: Selective semantics for classifying video events IEEE Transactions on Multimedia 16, (2014), 2214–2228 [15] Masoud Mazloom, Xirong Li, and Cees GM Snoek 2016 Tagbook: A semantic video representation without supervision for event detection IEEE Transactions on Multimedia 18, (2016), 1378–1388 [16] Pascal Mettes, Dennis C Koelma, and Cees GM Snoek 2016 The imagenet shuffle: Reorganized pre-training for video event detection In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval ACM, 175–182 [17] Xiao-Yong Wei, Yu-Gang Jiang, and Chong-Wah Ngo 2011 Concept-driven multi-modality fusion for video search IEEE Transactions on Circuits and Systems for Video Technology 21, (2011), 62–73 [18] Hao Zhang, Yi-Jie Lu, Maaike de Boer, Frank ter Haar, Zhaofan Qiu, Klamer Schutte, Wessel Kraaij, and Chong-Wah Ngo 2015 VIREO-TNO@ TRECVID 2015: multimedia event detection In Proc of TRECVID [19] Cai-Zhi Zhu, Hervé Jégou, and Shin Ichi Satoh 2013 Query-adaptive asymmetrical dissimilarities for visual object retrieval In Proceedings of the IEEE International Conference on Computer Vision 1705–1712 [20] Cai-Zhi Zhu and Shin’ichi Satoh 2012 Large vocabulary quantization for searching instances from videos In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval ACM, 52 Multimedia Event Detection (MED) This session will highlight recent research towards detection of events, like ‘working on a woodworking project’ and ‘winning a race without a vehicle’, when video examples to learn from are scarce or even completely absent In the first part of the session we consider the scenario where in the order of ten to hundred examples are available We provide an overview of supervised classification approaches to event detection, relying on shallow and deep feature encodings, as well as semantic encodings atop of convolutional neural networks predicting concepts and attributes [9] As events become more and more specific, it is unrealistic to assume that ample examples to learn from will be commonly available [15, 16] That is why we turn our attention to retrieval approaches in the second part The key to event recognition when examples are absent is to have a lingual video representation Once the video is represented in a textual form, standard retrieval metrics can be used We cover video representation learning algorithms that emphasize on concepts, social tags or semantic embeddings [7, 10, 14] We will detail how these representations allow for accurate event retrieval and are also able to translate and summarize events in video content, even in the absence of training examples Video to Text (VTT) This tutorial session considers the challenge of matching or generating a sentence to a video The major challenge in video to text matching is that the query and the retrieval set instances belong to different domains, so they are not directly comparable Videos are represented by audiovisual feature vectors which have a different intrinsic dimensionality, meaning, and distribution than the textual feature vectors used for the sentences As a solution, many works aim to align the two feature spaces so they become comparable We will discuss solutions based on lowlevel, mid-level and high-level alignment for video to text matching [5, 6, 8] The goal of video to text generation is to automatically assign a caption to a video We will cover state-of-the-art approaches relying on recurrent neural networks atop a deep convolutional network, and highlight recent innovations inside and outside the network architectures Examples will be illustrated in the context of the new TRECVID video to text (VTT) pilot task CCS CONCEPTS • Information systems → Information retrieval; KEYWORDS TRECVID, Semantic Indexing, Multimedia Event Detection, Video Search, Instance Search, Video Description Tài liệu tham khảo [1] R Arandjelovi´c and A Zisserman Three things everyone should know to improve object retrieval In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, pages 2911–2918, Washington, DC, USA, 2012 [2] R Arandjelovi´c and A Zisserman All about VLAD In IEEE Conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013 [3] Artem Babenko and Victor S Lempitsky Aggregating deep convolutional features for image retrieval CoRR, abs/1510.07493, 2015 [4] Artem Babenko, Anton Slesarev, Alexander Chigorin, and Victor S Lempitsky Neural Codes for Image Retrieval, pages 584–599 Springer International Publishing, Cham, 2014 [5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool Surf: Speeded up robust features In European conference on computer vision, pages 404–417 Springer, 2006 [6] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua Brief: Binary robust independent elementary features In European conference on computer vision, pages 778–792 Springer, 2010 [7] Song Cao and Noah Snavely Graph-based discriminative learning for location recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 700–707, 2013 [8] Yang Cao, Changhu Wang, Zhiwei Li, Liqing Zhang, and Lei Zhang Spatial-bagof-features In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3352–3359, June 2010 66 [9] Danqi Chen and Christopher Manning A fast and accurate dependency parser using neural networks In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar, October 2014 Association for Computational Linguistics [10] O Chum, A Mikulik, M Perdoch, and J Matas Total recall ii: Query expansion revisited In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pages 889–896, Washington, DC, USA, 2011 IEEE Computer Society [11] O Chum, J Philbin, J Sivic, M Isard, and A Zisserman Total recall: Automatic query expansion with a generative feature model for object retrieval In IEEE International Conference on Computer Vision, 2007 [12] E J Crowley and A Zisserman The state of the art: Object retrieval in paintings using discriminative regions In British Machine Vision Conference, 2014 [13] G E Dahl, Dong Yu, Li Deng, and A Acero Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition Trans Audio, Speech and Lang Proc., 20(1):30–42, January 2012 [14] Navneet Dalal and Bill Triggs Histograms of oriented gradients for human detection In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume - Volume 01, CVPR ’05, pages 886–893, Washington, DC, USA, 2005 IEEE Computer Society [15] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell Long-term recurrent convolutional networks for visual recognition and description In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015 [16] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan Object detection with discriminatively trained part-based models IEEE Trans Pattern Anal Mach Intell., 32(9):1627–1645, September 2010 [17] Asja Fischer and Christian Igel Training restricted boltzmann machines: An introduction Pattern Recognition, 47(1):25–39, 2014 67 [18] Martin A Fischler and Robert C Bolles Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography Communications of the ACM, 24(6):381–395, 1981 [19] Ross Girshick Fast r-cnn In International Conference on Computer Vision (ICCV), 2015 [20] A Gordoa, J A Rodríguez-Serrano, F Perronnin, and E Valveny Leveraging category-level labels for instance-level image retrieval In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3045–3052, June 2012 [21] Petr Gronat, Guillaume Obozinski, Josef Sivic, and Tomas Pajdla Learning and calibrating per-location classifiers for visual place recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 907– 914, 2013 [22] Hervé Jégou and Ondˇrej Chum Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening In Computer Vision–ECCV 2012, pages 774–787 Springer, 2012 [23] Herve Jegou, Matthijs Douze, and Cordelia Schmid Hamming embedding and weak geometric consistency for large scale image search In Proceedings of the European Conference on Computer Vision: Part I, ECCV ’08, pages 304–317, Berlin, Heidelberg, 2008 Springer-Verlag [24] Hervé Jégou, Matthijs Douze, and Cordelia Schmid On the burstiness of visual elements In Conference on Computer Vision & Pattern Recognition, 2009 [25] Hervé Jégou, Matthijs Douze, and Cordelia Schmid Improving bag-of-features for large scale image search International Journal of Computer Vision, 87(3):316– 336, 2010 [26] Herve Jegou, Hedi Harzallah, and Cordelia Schmid A contextual dissimilarity measure for accurate and efficient image search In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8 IEEE, 2007 [27] Hervé Jégou and Andrew Zisserman Triangulation embedding and democratic aggregation for image search In CVPR - International Conference on Computer Vision and Pattern Recognition, Columbus, United States, June 2014 68 [28] Hongwen Kang, Martial Hebert, and Takeo Kanade Image matching with distinctive visual vocabulary In Applications of Computer Vision (WACV), 2011 IEEE Workshop on, pages 402–409 IEEE, 2011 [29] Jan Knopp, Josef Sivic, and Tomas Pajdla Avoiding confusing features in place recognition In European Conference on Computer Vision, pages 748–761 Springer, 2010 [30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In F Pereira, C J C Burges, L Bottou, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105 Curran Associates, Inc., 2012 [31] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, pages 2169–2178, Washington, DC, USA, 2006 IEEE Computer Society [32] Duy Dinh Le, Cai-Zhi Zhu, Sang Phan, Sabastien Poullot, Duc Anh Duong, and Shin’ichi Satoh National institute of informatics, japan at trecvid 2013 In TRECVID, Orlando, Florida, USA, 2013 [33] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng Unsupervised feature learning for audio classification using convolutional deep belief networks In Y Bengio, D Schuurmans, J D Lafferty, C K I Williams, and A Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1096–1104 Curran Associates, Inc., 2009 [34] Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart Brisk: Binary robust invariant scalable keypoints In 2011 International conference on computer vision, pages 2548–2555 IEEE, 2011 [35] Ting Liu, Charles Rosenberg, and Henry A Rowley Clustering billions of images with large scale nearest neighbor search In Applications of Computer Vision, 2007 WACV’07 IEEE Workshop on, pages 28–28 IEEE, 2007 [36] David G Lowe Distinctive image features from scale-invariant keypoints Int J Comput Vision, 60(2):91–110, November 2004 [37] J Delhumeau M Douze, J Revaud and H Jégou Evve dataset 69 [38] Christopher D Manning, Prabhakar Raghavan, and Hinrich Schă utze Introduction to Information Retrieval Cambridge University Press, New York, NY, USA, 2008 [39] Marcin Marszalek, Ivan Laptev, and Cordelia Schmid Hollywood2 dataset [40] J Matas, O Chum, M Urban, and T Pajdla Robust wide baseline stereo from maximally stable extremal regions In Proceedings of the British Machine Vision Conference, pages 36.1–36.10 BMVA Press, 2002 doi:10.5244/C.16.36 [41] Krystian Mikolajczyk and Cordelia Schmid An affine invariant interest point detector In European conference on computer vision, pages 128–142 Springer, 2002 [42] Krystian Mikolajczyk and Cordelia Schmid Scale & affine invariant interest point detectors Int J Comput Vision, 60(1):63–86, October 2004 [43] Krystian Mikolajczyk and Cordelia Schmid A performance evaluation of local descriptors IEEE transactions on pattern analysis and machine intelligence, 27(10):1615–1630, 2005 [44] Eva Mohedano, Kevin McGuinness, Noel E O’Connor, Amaia Salvador, Ferran Marques, and Xavier Giro-i Nieto Bags of local convolutional features for scalable instance search In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR ’16, pages 327–331, New York, NY, USA, 2016 ACM [45] David Nister and Henrik Stewenius Scalable recognition with a vocabulary tree In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2161–2168 IEEE, 2006 [46] Paul Over, Jon Fiscus, Greg Sanders, David Joy, Martial Michel, George Awad, Alan Smeaton, Wessel Kraaij, and Georges Quénot Trecvid 2014 – an overview of the goals, tasks, data, evaluation mechanisms and metrics In Proceedings of TRECVID 2014 NIST, USA, 2014 [47] O M Parkhi, A Vedaldi, and A Zisserman Deep face recognition In British Machine Vision Conference, 2015 [48] Michal Perdoch, Ondrej Chum, and Jiri Matas Efficient representation of local geometry for large scale object retrieval In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 9–16, 2009 70 [49] J Philbin, O Chum, M Isard, J Sivic, and A Zisserman Object retrieval with large vocabularies and fast spatial matching In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007 [50] James Philbin, Michael Isard, Josef Sivic, and Andrew Zisserman Lost in quantization: Improving particular object retrieval in large scale image databases In In CVPR, 2008 [51] Danfeng Qin, Stephan Gammeter, Lukas Bossard, Till Quack, and Luc Van Gool Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 777–784 IEEE, 2011 [52] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson Cnn features off-the-shelf: An astounding baseline for recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW ’14, pages 512–519, Washington, DC, USA, 2014 IEEE Computer Society [53] Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson Visual instance retrieval with deep convolutional networks CoRR, abs/1412.6574, 2014 [54] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Faster R-CNN: Towards real-time object detection with region proposal networks In Neural Information Processing Systems (NIPS), 2015 [55] Edward Rosten, Reid Porter, and Tom Drummond Faster and better: A machine learning approach to corner detection IEEE transactions on pattern analysis and machine intelligence, 32(1):105–119, 2010 [56] Hasim Sak, Andrew W Senior, and Fran¸coise Beaufays Long short-term memory recurrent neural network architectures for large scale acoustic modeling In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 338–342, 2014 [57] Gerard Salton and Chris Buckley Improving retrieval performance by relevance feedback Readings in information retrieval, 24(5):355–363, 1997 [58] Amaia Salvador, Xavier Giro-i Nieto, Ferran Marques, and Shin’ichi Satoh Faster r-cnn features for instance search In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016 71 [59] Grant Schindler, Matthew Brown, and Richard Szeliski City-scale location recognition In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7 IEEE, 2007 [60] Xiaohui Shen, Zhe Lin, J Brandt, S Avidan, and Ying Wu Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3013–3020, June 2012 [61] J Sivic and A Zisserman Video Google: A text retrieval approach to object matching in videos In Proceedings of the International Conference on Computer Vision, volume 2, pages 1470–1477, October 2003 [62] Henrik Stewénius, Steinar H Gunderson, and Julien Pilet Size matters: exhaustive geometric verification for image retrieval accepted for eccv 2012 In Computer Vision–ECCV 2012, pages 674–687 Springer, 2012 [63] Engin Tola, Vincent Lepetit, and Pascal Fua A fast local descriptor for dense matching In Computer Vision and Pattern Recognition, 2008 CVPR 2008 IEEE Conference on, pages 1–8 IEEE, 2008 [64] Giorgos Tolias and Yannis S Avrithis Speeded-up, relaxed spatial matching In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 1653–1660, 2011 [65] Giorgos Tolias and Hervé Jégou Local visual query expansion: Exploiting an image collection to refine local descriptors PhD thesis, INRIA, 2013 [66] Akihiko Torii, Josef Sivic, and Tomas Pajdla Visual localization by linear combination of image descriptors In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 102–109 IEEE, 2011 [67] Akihiko Torii, Josef Sivic, Tomas Pajdla, and Masatoshi Okutomi Visual place recognition with repetitive structures In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 883–890, 2013 [68] Panu Turcot and D Lowe Better matching with fewer features: The selection of useful features in large database recognition problems In ICCV workshop on emergent issues in large amounts of visual data (WS-LAVD), volume 4, 2009 72 [69] K E A van de Sande, T Gevers, and C G M Snoek Evaluating color descriptors for object and scene recognition IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1582–1596, 2010 [70] Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li Deep learning for content-based image retrieval: A comprehensive study In Proceedings of the ACM International Conference on Multimedia, MM ’14, pages 157–166, New York, NY, USA, 2014 ACM [71] Xin-Jing Wang, Lei Zhang, and Ce Liu Duplicate discovery on billion internet images In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 429–436, 2013 [72] M.D Zeiler, M Ranzato, R Monga, M Mao, K Yang, Q.V Le, P Nguyen, A Senior, V Vanhoucke, J Dean, and G.E Hinton On rectified linear units for speech processing In 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, 2013 [73] S Zhang, M Yang, T Cour, K Yu, and D.N Metaxas Query specific rank fusion for image retrieval IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4):803–815, April 2015 [74] Wei Zhang and Chong-Wah Ngo Searching visual instances with topology checking and context modeling In Proceedings of the ACM Conference on International Conference on Multimedia Retrieval, ICMR ’13, pages 57–64, New York, NY, USA, 2013 ACM [75] Yimeng Zhang, Zhaoyin Jia, and Tsuhan Chen Image retrieval with geometrypreserving visual phrases In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pages 809–816, Washington, DC, USA, 2011 IEEE Computer Society [76] L Zheng, S Wang, and Q Tian lp -norm idf for scalable image retrieval IEEE Transactions on Image Processing, 23(8):3604–3617, Aug 2014 [77] Liang Zheng, Shengjin Wang, Lu Tian, Fei He, Ziqiong Liu, and Qi Tian Queryadaptive late fusion for image search and person re-identification In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015 73 [78] Yan-Tao Zheng, Ming Zhao, Yang Song, Hartwig Adam, Ulrich Buddemeier, Alessandro Bissacco, Fernando Brucher, Tat-Seng Chua, and Hartmut Neven Tour the world: building a web-scale landmark recognition engine In Computer vision and pattern recognition, 2009 CVPR 2009 IEEE conference on, pages 1085–1092 IEEE, 2009 [79] Zhiyuan Zhong, Jianke Zhu, and Steven C H Hoi Fast object retrieval using direct spatial matching IEEE Trans Multimedia, 17(8):1391–1397, 2015 [80] Wengang Zhou, Ming Yang, Houqiang Li, Xiaoyu Wang, Yuanqing Lin, and Qi Tian Towards codebook-free: Scalable cascaded hashing for mobile image search IEEE Trans Multimedia, 16(3):601–611, 2014 [81] Xiao Zhou, Cai-Zhi Zhu, Qiang Zhu, S Satoh, and Yu-Tang Guo A practical spatial re-ranking method for instance search from videos In Image Processing (ICIP), 2014 IEEE International Conference on, pages 3008–3012, Oct 2014 [82] Cai-Zhi Zhu, Herve Jegou, and Shin’ichi Satoh Query-adaptive asymmetrical dissimilarities for visual object retrieval In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 1705– 1712 IEEE, 2013 74

Định dạng
Số trang	83
Dung lượng	11,83 MB