Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 37 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
37
Dung lượng
3,68 MB
Nội dung
International Journal of Pattern Recognition and Arti¯cial Intelligence Vol 29, No (2015) 1555010 (37 pages) # c World Scienti¯c Publishing Company DOI: 10.1142/S0218001415550101 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Keyword Visual Representation for Image Retrieval and Image Annotation Nhu Van Nguyen*,, Alain Boucher*,,Đ and Jean-Marc Ogier*,ả *Lab L3I, University of La Rochelle, La Rochelle, France † IFI, MSI; IRD, UMI 209 UMMISCO Vietnam National University, Hanoi, Vietnam nhu-van.nguyen@univ-lr.fr Đ alain.boucher@univ-lr.fr ả Jean-Marc.Ogier@univ-lr.fr Received December 2013 Accepted 13 April 2015 Published 24 June 2015 Keyword-based image retrieval is more comfortable for users than content-based image retrieval Because of the lack of semantic description of images, image annotation is often used a priori by learning the association between the semantic concepts (keywords) and the images (or image regions) This association issue is particularly di±cult but interesting because it can be used for annotating images but also for multimodal image retrieval However, most of the association models are unidirectional, from image to keywords In addition to that, existing models rely on a ¯xed image database and prior knowledge In this paper, we propose an original association model, which provides image-keyword bidirectional transformation Based on the state-of-the-art Bag of Words model dealing with image representation, including a strategy of interactive incremental learning, our model works well with a zero-or-weak-knowledge image database and evolving from it Some objective quantitative and qualitative evaluations of the model are proposed, in order to highlight the relevance of the method Keywords : Image retrieval; image annotation; incremental learning; user interaction Introduction Among existing image retrieval systems, there are two main categories of methods used to search images: methods based on textual information and those based on visual information The ¯rst category is based on the textual metadata of each image to search by keywords The second is based on the content of each image to search for images similar to those given by the user The text and the visual content ‡ Corresponding author 1555010-1 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier corresponds to di®erent semantic levels The text handles more semantic information, while the visual content is more perceptive These two types of information are complementary and provide very di®erent aspects to search for images Searching by keywords generally provides better results than by content in terms of response time and accuracy Moreover, the formation of the query by examples is more di±cult than the query by keywords, since the user must provide examples of images that are not always available and not always representative of all the user intentions This is a major problem in Content-Based Image Retrieval (CBIR) systems However, to perform query by keywords, annotation must be available for images This approach requires a priori annotation of the image database, a very laborious, time consuming and often subjective task Search by content is eventually necessary when text annotations are missing or incomplete Furthermore, the content-based retrieval can potentially improve accuracy even if pre-textual annotation exists, particularly through the provision of informational content of images In the context of specialized applications for image retrieval, there are many types of complex systems in which the image database evolves in real time, generally starting with zero-or-weak-knowledge This hypothesis is true for many applications of supervision or control, i.e video surveillance, video monitoring, natural disaster management In most of these applications, the image volume is not ¯xed, and the database typically contains old images and new incoming images, images which can be already processed/indexed, and images waiting to be processed and indexed In this type of application, we have identi¯ed three di®erent possible query types for image retrieval: (1) example image, (2) keyword, (3) example image and keyword combined The last two types raise many problems, especially when all or a part of the image database is not annotated, making this as an inaccessible part through textual queries Moreover, most of the automatic knowledge learning methods gives a bad subjective performance In this applicative context, the interaction between users, domain experts and the system can be used to improve system knowledge, simply by clicking on few relevant/irrelevant images In this work, therefore, we study (1) the interactive learning of associations between visual features and keywords and (2) the use of these associations for two applications Using these associations, we can propagate annotations to unlabeled images in the database (application: image annotation) The associations help also to represent textual query by visual features, therefore, give our system the ability of retrieving unlabeled images or new incoming images by textual query (application: image retrieval by textual query or by mixed image-text query) From a user's perspective, our work focuses on a user-oriented image retrieval system The main objective of an image retrieval system is to provide e®ective tools for browsing and searching for users, and it is essential for the system design to be centered on the human/user We believe that the understanding of the user intentions plays a key role in a retrieval system images 1555010-2 Visual Representation for Keywords We have identi¯ed three interactions levels between users and the image retrieval system: Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Level 0: The user has no well-de¯ned intention at the beginning and just wants to explore the collection of images and to select images that he likes Level 1: The user is very clear about what he wants in the system During an exploration and retrieval session, he provides feedback to the system for quickly leading to a satisfying ¯nal result Level 2: The user is very clear about what he wants in the system By providing the feedback, he wants the system to learn the knowledge and memorize it to use again in future sessions In our work, we suppose that the user is an expert who has the knowledge about data and the domain of the system From the user relevance feedback in long term, the system knowledge can be learnt from early life of the system without prior knowledge From a scienti¯c perspective, two main issues are studied in this paper The ¯rst issue is to link content-based and keyword-based image retrieval Low-level contentbased retrieval is coupled with high-level keyword-based queries from the user The second issue concerns non/poorly annotated image databases Usually, an image retrieval system works without knowledge or with a priori knowledge With the proposed learning method, the system knowledge can be constructed from zero The knowledge of the system is based on the annotations (images represented by textual keywords) but also on the visual content, in which keywords are translated into visual features (text represented by images) We present and discuss related works in Sec We propose an original model, named BoK — \Bag of KVRs" (Keyword Visual Representation), which represents associations between semantic concepts and visual features in the support of the wellknown Bag of Words (BoW) model (Sec 3).24 An interactive and incremental learning method is proposed for building the associations (Sec 4) This model helps not only to improve the performance of image annotation and image retrieval but allows retrieving images using textual queries in non/poorly annotated image databases (Sec 6) Related Work The co-occurrence model proposed by Mori et al represents the ¯rst approach for associations between text and image.20 First, images of the training set are divided into regions that inherit all the keywords of original images from which they depend Visual descriptors are then extracted from each region All descriptors are clustered into a number of groups, each of which is represented by its center of gravity Last, the probability of each keyword for each of the region groups can be measured 1555010-3 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier Duygule et al.5 proposed a translation model to represent the relationships between text and content According to their view, visual features and text are two languages that can be translated from one to the other Thanks to a translation table having estimations of probability of the translation between image's regions and keywords, an image is annotated by choosing the most probable keyword for each of regions Barnard et al.1 have extended the translation model of Duygulu et al.5 to a hierarchical model It combines the \aspect" model9 which builds a joint distribution of documents and features, with a soft clustering model1 which maps documents into clusters Images and text are generated by nodes arranged in a tree structure The nodes generate image regions using a Gaussian distribution, and keywords using a multinomial distribution Jeon et al.11 suggested improvements to the results of Duygulu et al.5 by introducing a language generation model, called Cross-Media Relevance Model (CMRM) First, they use the same process as Duygulu et al for calculating the representation of images (represented by blobs) Then Duygulu et al made the assumption that there is a one-to-one correspondence between regions and words, while Jeon et al assume that a set of blobs is related only to a set of words Thus, instead of seeking a probabilistic translation table, CMRM simply calculates the probability of observing a set of blobs and keywords in a given image Lavrenko et al.13 proved that the process of features quantifying using a Continuous-space Relevance Model (CRM) can avoid losing information related to the production of the dictionary in the CMRM model.11 Using continuous features of probability density to estimate the probability of observing a particular region in an image, they showed that the model performance on the same dataset is much more e±cient than the models proposed by Duygulu et al.5 and Jeon et al.11 Some studies have attempted to use the LSA technique for combining visual and textual features, including Hofmann9 and Monay and Gatica-Perez19 who applied the Probabilistic Latent Semantic Analysis for automatic image annotation With this approach, text and visual features are considered as \terms" It assumes that each term may come from a number of latent subjects, and each image can contain multiple subjects In the transformation model,15 the text query is automatically converted into visual representations for image retrieval First, the relationship between text and images are taken from a set of images annotated with text descriptions A transmedia dictionary which is similar to a bilingual dictionary is set up in the training set Chang and Chen3 propose to the opposite, which is to translate an image query into a text query Based on both textual and visual queries, the authors transform visual queries into textual queries, and acquire new textual queries After that, they apply text retrieval techniques to deal with initial textual queries and new textual queries constructed from the visual query for image retrieval Finally, they merge the results 1555010-4 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Visual Representation for Keywords Recently, nearest neighbor methods which treat image annotation as image retrieval problem, have received more attention Makadia et al.17 introduce a baseline technique that transfers keywords to images using its nearest neighbors A combination called Joint Equal Contribution (JEC) of basic distances to ¯nd nearest neighbors is used on low-level image features; the keywords are then assigned using a greedy label transfer mechanism A more complex nearest-neighbor-type model called TagProp is proposed by Guillaumin et al.8 The model combines a weighted nearest-neighbor approach with metric learning capabilities in a discriminative framework which allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set In Ref 29, the authors propose to use both similar and dissimilar images together with a group-sparsitybased feature selection method The paper provides an e®ective way to select features in image annotation task, which have not been well investigated before 2.1 Remaining issues In the existing models presented above, we have identi¯ed some remaining issues that have driven our approach presented in the next section Image segmentation Many of the above models use image segmentation (co-occurrences model, translation model, CMRM and transformation model) Image segmentation is a very di±cult task in the ¯eld of image processing There is no general solution, and it is often combined with knowledge domain to e®ectively solve the problem for a given domain Performance which depends on image segmentation is an extremely delicate operation in general Unavailability of the transformation from text to images The transformation from text to images is very useful for nonannotated image retrieval with a textual query We can search an image database which is not annotated or partially annotated by using a textual query if we turn it into a visual query Most methods of text/image association are designed to transform low-level visual features into keywords (image annotation) Only the transformation model15 proposes the conversion from text to images This model provides a visual representation of keywords using mutual information between text and blobs, o®ering the ability to annotate images and search for images by text However, one disadvantage of this model is the problem of image segmentation Constraints on the availability of knowledge Whatever the text/image association models can be, the existence of a priori knowledge, often represented as annotations, is absolutely essential for the learning phase This phase is extremely time consuming for the user, and particularly complex for specialized applications The association learning phase is mostly o®-line and the problem is notably di±cult, especially when the knowledge evolves, for example, to integrate new knowledge The performance of these models is not easy to improve The problems of developing 1555010-5 N V Nguyen, A Boucher & J.-M Ogier approaches to enrich the system knowledge therefore are particularly crucial, and these approaches should be possible without requiring o®-line calculations Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only 2.2 Our proposed approach As part of this work, we aim at giving answers to issues raised in the previous section First, in order to avoid the dependence on the quality of the segmentation phase, we are working in a context without segmentation In order to support image retrieval by textual queries independently of any manual annotation, we propose to add the bidirectional transformation between text and image Finally, we place ourselves in a system with incremental knowledge learning, which requires no special knowledge at the beginning of the life of the system This constraint seems essential, but also realistic, because most applications not have specialized knowledge in their early life In our model, text/image associations are learnt by an incremental learning method via relevance feedback without any knowledge at ¯rst Unlike other models where prior knowledge is available, in our system, knowledge comes from user interactions Therefore, our system knowledge is progressively improved over time through interactions, without requiring any o®-line learning stage We ¯rst summarize here the assumptions on which we rely for the development of our model and its context of use: the work is done on a large image database without prior knowledge; the volume of images is not ¯xed, and new incoming images are added over time; the system knowledge is based on the annotation of images (images represented by text keywords) plus some learnt representation of keywords (text represented by visual features); interactive learning can be done in reinforcement and/or incremental way; the interaction between the users, the domain experts and the system to improve overall system knowledge should be done through simple clicks for relevant/ irrelevant images; there is very few training data (reinforcement/incremental); the number of images clicked at each interaction must be low (maximum 20); image annotation propagation is performed in real time Table Comparison of existing text/image association models with our proposed model described in this paper System Image-to-Text Text-to-Image Multimodal Retrieval Knowledge Source 20 Co-occurrence model Translation model1,5 Latent Semantic Analysis9,19 Transformation model3,15 Our model Yes Yes Yes Yes Yes No No No Yes Yes 1555010-6 No No Yes Yes Yes a priori a priori a priori a priori + WordNet Interaction Visual Representation for Keywords Table gives a comparison between all the presented text/image association models and our model Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only A Bidirectional Association Model between Text and Image In this section, we propose an association model between text and image referred as \BoK model" (Fig 3) where a KVR is a possible visual representation for a keyword To avoid image segmentation, we use the famous BoW model to represent images In this model, the image is not represented using regions but using points of interest (detailed in the next section) Another reason for using the BoW model is the e®ectiveness of this model, con¯rmed by current trends of research on this model.24,25,27 3.1 Keyword Visual Representation KVR is an upgraded de¯nition of the Bag of Words (Bag of visual Words) representation.14,24 In the BoW representation of an image, visual words are based on image features such as interest points, regions, etc A dictionary of visual words is built using a clustering method over a big set of features, where each visual word represents a group of similar features An image is represented as a histogram of visual words (a BoW) In our work, we use the SIFT descriptor16; the dictionary is constructed using the k-means method and the BoW presentation is based on the TF Ã IDF weighting scheme Basically, a KVR is a BoW representation of a region (or a set of similar regions) corresponding to a keyword For example, let us consider a BoW VI containing all the n visual words vi of an image I VI ¼ ðv1 ; v2 ; ; Þ; vi I: Let us suppose that this image I has N di®erent regions, R1 ; ; RN which correspond to N keywords (objects) K1 ; ; KN Then we can divide VI into N di®erent BoWs VI1 ; ; VIN , respectively corresponding to K1 ; ; KN VI1 [ VI2 [ Á Á Á [ VIN ¼ VI : We consider VIi as a possible visual representation for keyword Ki (a KVR of Ki ) The KVR construction is easy if regions are available (i.e image segmentation, see Fig 1) However, due to the problem of image segmentation, our approach tries to ¯nd the representative visual words (KVR) of each concept in an image without segmenting it The construction of KVR and BoK in our system is presented in the following sections 3.2 BoK model A KVR is a BoW representation of a region corresponding to a keyword Considering the fact there exists several visual representations for one keyword, a keyword can 1555010-7 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier Fig Image with di®erent regions corresponding to di®erent concepts (Sky, Helicopter, Human, Sea) Each region corresponds can be represented by a Bag of visual Words, or a possible KVR for the concept correspond to several regions of the image, as one can see in Fig In Fig 2, the word \sky" can be interpreted into three di®erent types of sky: clear sky (blue), sky with clouds (white) and sunset (red sky) In our model, the BoK is created with the assumption that a keyword matches one or more di®erent image regions A keyword Fig The BoK representation The keyword \Sky" could be one of three types \Clear", \Cloudy" and \Sunset" The keyword \Sky" is then represented by a bag containing three correspondent KVRs 1555010-8 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Visual Representation for Keywords is then represented by a BoK in which a KVR (a bag of visual words) corresponds to a set of similar regions To construct a KVR from a set of similar regions Sr , we construct the BoW which includes the most frequent visual words in Sr In fact, all possible visual words not have a real meaning and only a number of them are really signi¯cant for characterizing a concept In order to tackle this problem, we can use two methods First, we can use a simple threshold to identify the most frequent visual words; this method is used in our experimentation To be more robust, we can base on the Zipf distribution of visual words frequencies in Sr This method is presented in our contribution in Ref 22 The visual similarity between two keywords or a keyword with an image is the visual similarity between two Bags of KVRs To compare the visual similarity between two Bags of KVRs we de¯ne a similarity function as follows Consider two Bags of KVRs B1 , B2 : B1 ¼ ðKVR 11 ; KVR 12 ; ; KVR 1k1 ị; B2 ẳ KVR 21 ; KVR 22 ; ; KVR 2k2 Þ; where KVR ji is the jth KVR of Bi , k1, k2 correspond to the number of KVRs in B1 and B2 The visual similarity between B1 and B2 is dened: Sim visualB1 ; B2 ị ẳ maxSim visualKVR 1i ; KVR 2j ịị 1ị with i ẳ 1: k1 , j ¼ 1: k2 The visual similarity of two KVRs, or in other words two BoWs, is presented in the BoW model above In Fig 3, the BoK representation can be summarized as follows: (1) Each image is represented by a bag of visual words (BoW model) (2) Each group of similar regions of the images in a category is represented by a BoW that we call KVR (3) With the assumption that a keyword matches one or more regions in the images, a keyword is represented by a BoK The BoK representation is used for image annotation or image retrieval It is particularly e®ective when image annotations in the database are insu±cient because the textual queries cannot be represented by visual features (transforming keywords into images is the second issue raised in the state-of-the-art, see Sec 2.1) The BoK is a transformation model as the model of Ref 15 While this model15 uses mutual information and image segmentation to transform a textual query into a visual query, our model takes advantage of the e±ciency and simplicity of the BoW model to represent the textual query by visual features, which in this case is the BoW representation Thus, our model can take advantage of the e±ciency of the BoW model and can avoid the problem of image segmentation 1555010-9 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier Fig The BoK model for the text/image association (our contribution in red, dashed lines) based on the existing BoW representation (blue, continuous lines) (color online) 3.3 KVR operators A KVR is constructed from a set of similar regions which is updated during the use of system Therefore, during the learning of BoK through interactions with users, KVRs in a Bag can be merged into a new KVR or a KVR could be divided into two KVRs These actions are based on the similarity between KVRs which we call the EQUAL between two KVRs For manipulating KVRs, we de¯ne four operators: ADD, EQUAL, MERGE and SPLIT The four operators are used by the incremental/reinforcement learning of BoK model in the next section The operators ADD, MERGE and SPLIT are used for the revision of the BoK model, while the EQUAL operator is used for the analysis of the BoK model We describe in detail the proposed algorithm and the use of operators in the next section ADD The ADD operator is used to add a KVR into a bag as long as it is di®erent from all existing KVRs in the bag This condition is veri¯ed by the operator EQUAL below EQUAL The EQUAL operator is used to determine whether two KVRs are considered close enough to be combined We propose to use a standard equivalent 1555010-10 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Visual Representation for Keywords Algorithm Input: A visual query Qv or a textual query Qt Output: BoK updated Begin Step CBIR Step Relevance feedback (to the satisfaction of the user) User assigns N images with keyword M Cluster N images into k groups Compute the k new sub-queries using Rocchio’s technique Sub queryi = Modify query(Q, technique Rocchio) KVR newi = the last sub-query Sub queryi Step Update (analyse and revision) For each new KVR: KVR new For all KVRs of keyword M : KVR existed If EQUAL(KVR new, KVR existed) = KVR merge = MERGE(KVR new, KVR existed) If KVR merge can be divided SPLIT(KVR merge) Else ADD(KVR new) Step Back to step End Application of BoK 5.1 Automatic annotation propagation Annotation propagation is used to label nonannotated images with keywords While manual annotation requires a lot of e®ort from users, annotation propagation can be performed automatically In our system, annotation propagation is updated when Bags of KVRs are updated Thus, when a KVR of a keyword K is updated, the similarities between the keyword K and nonannotated images are calculated With the annotating images, we state the hypothesis that only few keywords (up to ¯ve) have signi¯cant meaning regarding the content of the image in general We have limited the automatic annotation of images to a maximum of ¯ve keywords per image (can be less but no more) So if the keyword M is in the ¯ve most relevant to the image (the closest in the KVR space), then the keyword K is assigned to this image, and the sixth keyword is omitted Thus, each image always has the ¯ve most relevant keywords as annotation The annotation is improved with the evolution of the BoK of keywords We have improved the annotation propagation by incorporating the correlations between the keywords in the similarity function of KVR and image In our case, the 1555010-17 N V Nguyen, A Boucher & J.-M Ogier similarity between a KVR and an image is calculated based on the correlation between the keywords The correlation between the two keywords K1 and K2 is calculated as the probability P ðK1 ; K2 Þ for these two keywords to be presented as annotation for the same image These probabilities of a pair of keywords can be calculated by learning from a set of training data or by using ontology such as WordNet.a Based on the integration of correlations between keywords, the similarity between an image and a KVR is calculated as follows: (1) The nearest keyword K1 will have the similarity with the KVR: Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only SimK1 ; KVRị ẳ Sim visualK1 ; KVRị: The similarity Sim visual(K1 , KVR) is calculated using Eq (1) (2) The next keyword Kn will have the similarity: SimðKn ; KVRị ẳ Q Sim visualKn ; KVRị n1 iẳ1 pKi ; Kn Þ 5.2 Image retrieval using textual query Above-mentioned problems of image retrieval which are missing annotations and query formation can be solved by using the BoK model In our system, we can perform the image retrieval using textual queries while textual information is not available initially in the database By taking advantage of user interaction, the system builds its knowledge, or in other words, annotation at di®erent levels, which changes over time and allows users to continue using the system We can automatically transform a textual query into a visual query by using BoK (Fig 8) Thus, for a partially annotated image database or a new database without knowledge/ annotation we can use the textual query as soon as the ¯rst KVRs are built (that is to say, from the partial annotation of the early interactions) Initially, for an image database without annotation, conducted with few interactions, and therefore, few KVRs built, the results are obviously not the best, but they have the merit to exist and to o®er the ability to use the system from scratch In addition, these results improve gradually along with the use of the system (that is to say, the construction and re¯nement of KVRs) As more sessions of interaction are done, more images are manually annotated and more annotations are propagated to the image database and then the better results are obtained from textual querying Fig CBIR by using BoK model a WordNet: http://wordnet.princeton.edu/ 1555010-18 Visual Representation for Keywords Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only 5.3 Large-scale datasets Our work is based on the context of specialized applications, one in which the image database evolves in real time With images arriving continuously, image database could be very large Thanks to the incremental learning, our approach can handle large-scale datasets A major advantage of our approach is that it is not necessary to make an o®-line (a priori) learning Our system is indeed able to incrementally acquire and update knowledge (from zero) and continuously without requesting an o®-line learning This is di®erent from other models in which the o®-line learning must be repeated each time knowledge is added or changed However, to update KVRs of a keyword in our incremental learning method, the SPLIT operator may be executed using an images clustering step which leads to slow computation if the number of images associated to the keyword is too big We may avoid this by using an adaptive clustering which should take account of the pre-computed information Experimentation Protocol For evaluating our BoK model, here we present the experimentation protocol that we used We evaluate the evolution of knowledge, the performance of image annotation and image retrieval to give an overall evaluation for our model First, we present the di±culties of establishing an objective protocol for performance evaluation, and the methodology used to evaluate our model on signi¯cant experimentations Then, the next section presents the experimental protocol 6.1 Di±culties The ¯rst di±culty is the \dynamic" character The principle of our model is the reinforcement learning through interactions between the users and the machine Knowledge of the system is learnt from users/experts during the interactive image retrieval/exploration process To evaluate our model, we need to perform interactions between the users/experts and the machine This task requires a substantial work e®ort, time and may not be objective for the experimentation Another di±culty concerns the image database In our system, there is no boundary between the training set and the test set With the assumption of the \dynamic" character of the image database, the volume is not ¯xed New images arrive constantly, making it di±cult to simulate the base In addition, because it is \dynamic", our learning is not comparable to what exists To our knowledge, no other research and image annotation system has the same \dynamic" character than our system 6.2 Experimental protocol As mentioned above in the section on the experimentation di±culties, it is di±cult to evaluate our system because of the \dynamic" character There is no real existing experimental protocol for that which leads us to propose our own experimental 1555010-19 N V Nguyen, A Boucher & J.-M Ogier protocol In the following sections, we discuss ¯rst the cross-validations for the experimentation and retrieval scenarios in our system Then, we propose the use of a pseudo interactive evaluation method, based on agents These agents are used to simulate the interactions between the users and the machine Finally, two types of evaluation are discussed, the evaluation of the evolution of knowledge (the \dynamic" character) and the evaluation of the use of BoK (the \static" character) Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only 6.2.1 Cross-validations for the experimentation In our system, the image database is dynamic, since we assume that new images arrive in real time At time t, knowledge learning is done with existing data in the database, to which later added with fresh images arriving after t, dividing the base into two parts We call the ¯rst part \initial part" and the second \new part" The initial part is used to learn the knowledge, and the new part is only used for evaluation The base Corel30K is used in our experimentation.2 Our system is evaluated by performing four cross-validations, corresponding to di®erent situations that the system is likely to encounter In these situations, a part of image database already has annotations and the rest not For these four situations, we de¯ne arbitrarily the initial part corresponding to the proportions of 20%, 50%, 75%, 95% of the base Corel30K Learning is done through the image retrieval on these proportions of the base The remaining 80%, 50%, 25% and 5% is selected as new images arriving in the system (the new part) Our experience is based on 1000 retrieval sessions Considering the normal use of the system by multiple users, 1000 is only few sessions in the life of a system that is used in long term (for example, 1000 sessions for a few days or weeks) We then de¯ne the four cross-validations represented by four conditions of the use of the system In each case, the learning process is repeated three times (with 1000 sessions of image retrieval for each time), and then we calculate the mean (1) Experiment 1: The initial part of the image database is 20% of the base Corel 30K This proportion of the base Corel30K is called the \initial part" Users perform the interactive image retrieval on this part The remaining 80% of the base Corel30K is considered as new images coming into the system This proportion is called the \new part" In this scheme of using the system, the user builds knowledge on few images, and many new images arrive without any knowledge (2): Experiment 2: The \initial part" of the image database is 50% of the base Corel30K The remaining 50% of the base Corel30K is considered as the new part (3) Experiment 3: The \initial part" of the image database is 75% of the base Corel30K The remaining 25% of the base Corel30K is considered as the new part 1555010-20 Visual Representation for Keywords Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only (4) Experiment 4: The \initial part" of the image database is 95% of the base Corel30K The remaining 5%-based Corel30K is considered as the new part In this scheme of using the system, the user builds knowledge on many images and only few new images are arriving without any knowledge In reality, situations as diverse as the experiments to are quite possible and may even change over time from one situation to another Early in the life of the system, there is no annotation, or on the initial part or on the new part In other words, there is no prior knowledge In our system, with the assumption that new images coming into the system in real time, the image database increases in size However, in the context of this experimentation, we not simulate the evolution of the number of images We only use the new part to observe the performance of annotation propagation and the use of BoK on new images arriving in the system We use the keywords of Corel30K as the ground truth There are 1036 keywords if we consider only keywords with more than 10 annotated images As we consider just the keywords present in both in the initial part and the new part we have 929 keywords for experiment 1; 957 keywords for experiment 2; 951 keywords for experiment 3; 935 keywords for experiment 6.2.2 Pseudo interactive evaluation method To evaluate our system, we must realize a lot of user interactions, which is di±cult and subjective to experiment To avoid this integration of the actual user and produce a more automated process and especially repeatable evaluation, we propose the use of a pseudo-interactive evaluation methodology, based on agents, which is described as below We propose to use software agents to replace human users and to perform automated interactions The manual annotations of Corel30K are used as knowledge of users/experts for the software agents Di®erent types of agents are used to simulate the interactions, but also the evaluation (Fig 9): (1) an agent \System" to perform all basic operations such as CBIR and KeywordBased Image Retrieval (KBIR), which is our proposed system (2) an agent \Human" to specify queries (queries in the form of selected images in the initial part and/or textual query using the simulated knowledge base) and automatic interactions (based on the simulated knowledge base) This agent replaces the human users (3) an agent \Evaluation" to collect data and produce the evaluation results We evaluate the evolution of knowledge of the system and the use of BoK for image annotation and image retrieval We evaluate the use of BoK on both parts of the image database: the initial part and the new part for each experiment The evaluation of the initial part gives us the quality of knowledge of the system at 1555010-21 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier Fig Agents to simulate the interactions The section in red (Knowledge and Agent \Human") is the simulation of the user interaction The section in black (Agent \Evaluation") is the system evaluation process this time, while the evaluation of the new part gives us the e±ciency of knowledge for future research An interaction session (relevance feedback) is de¯ned as a maximum of 20 \clicks" for the agent \Human", each \click" being an image indicated as relevant or irrelevant for the current query Results In this section, we discuss the results of the evaluation on two main characters of the system: the \dynamic" character (the evolution of knowledge) and the \static" character (the use of knowledge) Considering the dynamic of our system, it is important to evaluate the evolution of knowledge In seeking information and exploring the image database, users interact with the system and thereby enrich its knowledge base The dynamically learnt knowledge is of three types: (1) manual annotation: Images are assigned keywords by users/experts during the interaction (2) propagated annotation: Other images in the database are assigned dynamically keywords using BoK 1555010-22 Visual Representation for Keywords (3) \image level" knowledge: Visual representations of keywords or in other words, the BoK of keywords We observe the quantity and quality of knowledge over time, which means: Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Quantity: The number of manual annotation, the number of propagated annotation, the number of BoK These numbers show the evolution of the volume of knowledge We can observe the relationship between the evolution of the volume of knowledge and the evolution of the quality of knowledge Quality: We evaluate the quality of knowledge related to the annotations and the \image level" knowledge, that is to say the BoK of keywords The precision of the propagated annotation is used to evaluate knowledge of annotation and precision of image retrieval by keywords (using BoK of keywords) is used to evaluate the \image level" knowledge We evaluate on two parts of the image database: existing images of the system (the initial part) and newly arriving images (the new part) After each session of interactive image retrieval (with relevance feedback), the information of Table is calculated We evaluate the evolution on 1000 sessions 7.1 Quantitative evaluation 7.1.1 The number of keywords learnt In the curves presented in Fig 10, there is a fast-growing number of keywords learnt in the beginning, identi¯ed by the dashed arrow, and after a certain point, around t ¼ 330, a slower increase, identi¯ed by the continuous arrow At the end of the experiment (1000 retrieval sessions), we see from the ¯gure that the maximum number of learnt keywords is about 400 As a reminder, according to Table Experimentation Number of learnt keywords Number of annotations Summary of experimentations Database cross-validations of Corel30K cross-validations of Corel30K Quality of image annotations cross-validations of Corel30K The quality of BoK cross-validations of Corel30K Corel5K with 1000 simulated interactions Corel5K with 1000 queries Use of BoK: image annotation Use of BoK: image retrieval 1555010-23 Description Evolution of the number of keywords learnt over time Evolution of the number of manual/propagated annotations over time Evolution over time of the performance of propagated annotation Evolution of BoK over time Comparison with other methods of image annotation Comparison with other methods of image retrieval Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier Fig 10 Number of keywords learnt over time (iterations) the experiments, we have between 929 and 957 total keywords The number of keywords learnt is about 45% of the keywords of Corel30K database In our experiments, we distinguish two types of keywords: \major" and \minor" ones A \major" keyword is associated with many images while a \minor" keyword is associated with few images in the database only For example, in Fig 11, we ¯nd major keywords in Corel30K (left of Fig 11) as \sky", \close up", \horses" and minor keywords (right of Fig 11) as \glasses", \soldiers", \highway" This abstract Fig 11 The distribution of keywords in the Corel30K Some keywords are associated with many images (major keywords-left), while others are associated with only a few images (minor keywords-right) 1555010-24 Visual Representation for Keywords de¯nition is used only to qualitatively analyze some results so we not need any threshold to distinguish these two types of keywords In Fig 10, we see that half of the keywords are learnt in the ¯rst third of the time of experimentation Most of the remaining time is used to improve the BoK quality of these keywords This improvement in performance is illustrated by the image annotation and image retrieval in Sec 5.2 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only 7.1.2 Discussion The amount of knowledge learnt is growing faster in the early time, but this rate of increase will decrease over time as we use the system The amount of knowledge learnt depends primarily on the number of interactions or the frequency of use of the system The more the system is used, the more knowledge is learnt Distribution of keywords over the database also in°uences the amount of knowledge learnt If images in the database have a wide range of keywords, the amount of knowledge learnt is more important than the case of a small range of keywords 7.2 Qualitative evaluation 7.2.1 Quality of annotation The annotation propagation using the BoK of keywords is evaluated by the precision and recall of annotation The evaluation is based on the four experiments presented above The annotation propagation is performed on both parts, initial and new, in each experiment The image annotation in the new part illustrates the e®ectiveness of the use of BoK for new images which arrive in the system The annotation propagation of images on the initial part can help to observe the in°uence of the number of initial images in the system (the usage of the system) on the quality of knowledge The initial part contains 6000 (20%), 15,000 (50%), 23,000 (75%), 29,500 (95%) images respectively for experiments 1, 2, and (Sec 6.2.1) The images in this database are distributed in all the categories of Corel30K Four di®erent conditions of use of the system according to four experiments show di®erent performances of the annotation propagation over time (Fig 12) Figure 12 shows the evolution of the precision of annotation propagation in our system in the four experiments In experiment (20–80%), the results might be less signi¯cant due to the fact that the initial part is small for that experimentation However, the results for the three other experimentations are similar and more signi¯cant Annotation propagation in the experiment is slightly better than in experiment 2, but lower than in experiment The di®erence is small because the number of relevant images per keyword (about 400 keywords learnt) is small compared to the total number of images in the database The annotation precision increases rapidly at ¯rst (the dashed arrow in Fig 12) for which the Bag of KVRs of major keywords are learnt In the following time 1555010-25 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier Fig 12 The evolution of precision for annotation propagation on the initial part The annotation precision increases rapidly at ¯rst (dashed arrow) where Bags of KVRs of major keywords are learnt (knowledge is increasing rapidly) In the second stage (continuous arrow), the Bags of KVRs of major keywords are improved (knowledge is improved), and other small keywords are learnt (knowledge increases less rapidly than in the ¯rst phase) Meanwhile, the annotation precision increases more slowly (the continuous arrow in Fig 12), the BoK of major keywords are improved and other minor keywords are learnt, while the annotation precision increases more slowly We note that it is di±cult to propagate the annotation of minor keywords because the number of images involved is very small as compared to the total number of images in the database In summary, the annotation precision of the initial part is improving over time, and in all experiments This means that the BoK gets better along with the use of the system Figure 13 shows the recall of annotation of the initial part in four experiments These results further con¯rm that the annotation of the initial part, or in other words, the knowledge of the system gets better in time along with the use of the system When we use the system on a small image database, the quality of knowledge is more reliable than when using the system on a large image database The new part contains respectively 24,000 (80%), 15,000 (50%), 7500 (25%), 1500 (5%) images for experiments 1, 2, and The images in the database are distributed randomly in all categories of Corel30K As in the case of the initial part, the time and the number of images of the new part in°uence the performance of the annotation propagation Figures 14 and 15 illustrate the performance of annotation propagation in the new part for the four experiments The annotation propagation in the new part is lower than that in the initial part because a certain percentage of manual annotation of the 1555010-26 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Visual Representation for Keywords Fig 13 Evolution of the recall of propagation annotation on the initial part initial part is used to propagate the annotation This number is respectively 0.3, 0.2, 0.16 and 0.15 in experiments 1, 2, and (these are the ratios between the number of manual annotation and the number of annotation in the ground truth of the initial part) Fig 14 The evolution of the precision of the annotation propagation on the new part 1555010-27 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier Fig 15 The evolution of the recall of the annotation propagation on the new part Like the initial part, annotation propagation in the new part increases rapidly during the time when the ¯rst major keywords are learnt An example is shown at t ¼ 250 (Figs 14 and 15) in which the precision and recall in experiment increases suddenly This means that at time t ¼ 250, the BoK of a major keyword is learnt This keyword (keyword \building" with 2373 images) is propagated to the images of the new part In this case, we can notice that if the system is used more intensively compared to the number of incoming images (like in experiment 4), propagated annotations are more reliable than with small usage of the system compared to the number of incoming images like in experiment 7.2.2 Propagation of annotation in the Corel30K database In this section, we evaluate annotation propagation in the four experiments on the same image database Corel30K (the fusion of the initial part and the new part for the four experiments) is used to propagate annotation In experiments 1, 2, and 4, the BoK of keywords are learnt by simulating user knowledge based on di®erent image data (20%, 50%, 75% and 95% of Corel30K) This means that the BoK of keywords is learnt based on di®erent sources of knowledge (in terms of amount of knowledge) In°uences in the four experiments are evaluated by annotation propagation on the same image database: Corel30K Figures 16 and 17 illustrate the performance of annotation propagation on Corel30K for four experiments In the ¯rst phase (up to time t ¼ 280), experiment gives the best performance; other experiments provide almost the same results The reason is probably that the knowledge used is still small as compared to the total amount of 1555010-28 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Visual Representation for Keywords Fig 16 The evolution of the precision of annotation propagation using Corel30K Results are similar for all four experiments In the second step, the amount of knowledge begins to in°uence the performance of the propagation Fig 17 Evolution of the recall of annotation propagation on Corel30K 1555010-29 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only N V Nguyen, A Boucher & J.-M Ogier knowledge In the remaining time, the amount of knowledge begins to in°uence the performance of the propagation It is better in experiments and than in experiments and However, we can notice that there are not too many di®erences, because of the fact that all learning in the four experiments is performed by simulating the same number of interactive queries (1000) In fact, the in°uence of the knowledge amount on the propagation can be con¯rmed over time Advancing in time, more knowledge is used for learning, and therefore, annotation propagation is better We can observe that di®erent conditions of the utilization of the system yield di®erent results In case of intensive use of the system, that is to say, with many interactions, the result is better This is con¯rmed by the improved performance of the annotation propagation or the images retrieval over time In case of using the system on many images (experiment 4), the result is better than few images (experiment 1) The experiment works best, and this is certainly due to the fact that the system is used extensively on many images However, in other conditions, the systems still managed well, and the performance improves with time 7.2.3 Quality of BoK Image retrieval by BoK is used to evaluate the \image level" knowledge We observe if the BoKs of keywords are well learnt by analyzing the precision of image retrieval Figure 18 shows the average precision of image retrieval by BoK Precision is the average of the average precision of image retrieval for all keywords The precision of Fig 18 Evolution of the precision of image retrieval by BoK on the initial part In general, the precision is better in time or in other words, the KVR bags improve over time 1555010-30 Int J Patt Recogn Artif Intell Downloaded from www.worldscientific.com by UNIVERSITY OF OTAGO on 07/02/15 For personal use only Visual Representation for Keywords experiment is the best while that of experiment is the worst In general, the precision of image retrieval increases over time or in other words, BoKs improve over time However, it should be noted that there are times when the precision decreases slightly This means that a BoK keyword still does not improve during learning This problem is probably due to the combination of images for the construction of the KVRs Our learning approach considers that the clustering is good because the number of examples is small Conversely, the clustering can give bad results in cases where the examples are too diverse Figure 19 illustrates the precision of the image retrieval by BoK on the new part Although the precision gets better over time, it remains lower in the case of a large image database as in all previous evaluations above In the case of experiment 1, the system works properly although there are few interactions between the users and the system In experiment 4, the knowledge evolves faster than the number of incoming images, and we see that in this case, the system works very well 7.3 Applications of BoK 7.3.1 Image annotation In this section, we evaluate the evolution of our system over time, which is di±cult to compare with other works We propose to compare our results for annotation propagation with other methods of automatic annotation The annotation propagation is performed based on the BoK representation Although Bags of KVRs are Fig 19 Evolution of the precision of image retrieval by Bags of KVRs on the new part 1555010-31 ... between visual features and keywords and (2) the use of these associations for two applications Using these associations, we can propagate annotations to unlabeled images in the database (application:... Application of BoK 5.1 Automatic annotation propagation Annotation propagation is used to label nonannotated images with keywords While manual annotation requires a lot of e®ort from users, annotation. .. propagated annotation: Other images in the database are assigned dynamically keywords using BoK 1555010-22 Visual Representation for Keywords (3) image level" knowledge: Visual representations