Ph.D. Thesis

Supervisors: Georges Quénot and Philippe Mulhem
Start date: 01/11/2010
Date of defense: 01/11/2010

Title : Using context for semantic indexing of image and video documents

The automated document indexing image and video is a difficult problem because of the “distance” between the arrays of numbers encoding these documents and the concepts (e.g. people, places, events or objects) with which we wish to annotate them. Methods exist for this but their results are far from satisfactory in terms of generality and accuracy. They generally operate by supervised or semi-supervised learning: the system learns to recognize concepts from positive and negative examples; it “generalizes” from these examples. Existing methods typically use a single set of such examples and consider it as uniform. This is not optimal because the same concept may appear in various contexts and its appearance may be very different depending upon these contexts. The context may be: the type of broadcast (television news, fiction, entertainment, advertising, etc.), Date, place, country or culture of broadcasting or production, or the modalities present or absent (for documents in black and white and / or without sound, for instance). The context may generally be regarded as another concept or as a set of other concepts. The concepts and relations between them can be represented in ontologies. One can interpret the relationship within an ontology like the fact that the elements are likely to be together or not in an image or a in a video shot and this information can be used for their automatic annotation.

The proposed subject concerns the use of the context to improve the performance of classifiers. The main idea is to consider, for each concept to be recognized, a number of contexts in which it may appear and to train a classifier for each of these contexts. During the recognition, the appropriate classifier is used according to the identified context. Alternatively, a weighted combination (fusion) of classification results can be used if we only have probabilities of being in a given context. Such an approach presents several difficulties. The first one is the identification of context during the recognition: in some cases, it may be known explicitly (from metadata, for example) but, in general, it is actually another concept, which also has to be recognized. The second difficulty is the need for a very important total volume of training data so that, for each context, there are enough examples to properly train a classifier. There is a complexity that is related to simultaneously manage the tuning of multiple classifiers for each concept. The third difficulty concerns the problem of merging the outputs of different classifiers in the frequent case in which there are uncertainties about the context actually present during the recognition. Implementation will possibly be based on the use of network operators (extractors of features, classifiers and merge modules), of ontologies to manage relationships between concepts and of active learning for automatic training data collection.

The developed methods will be evaluated in the context of national and international campaigns like TRECVID (http://www-nlpir.nist.gov/projects/...). The work will be done in the context of the Quaero program (http://www. quaero.org). This will, among other things, give access to a large volume of annotated image and video data.