TRECVID 2010 Collaborative annotation

Authors: Georges Quénot (LIG), Andy Tseng (LIG), Bahjat Safadi (LIG) and Stéphane Ayache (LIF).
Last revision: 6-may-2010.

We (LIG, Laboratoire d'Informatique de Grenoble) and LIF (Laboratoire d'Informatique Fondamentale de Marseille) are organizing the collaborative annotation for TRECVID [1] 2010 as were organized in 2003 [2], 2005 [3], 2007, 2008 and 2009 [4].

We have done some works on active learning and its relation to the corpus annotation problem. Part of this work has been described in a paper that has been published in Signal Processing: Image Communication [5]. This work indicates that it is possible to annotate only a small fraction of carefully chosen samples (typically between 15 and 20%) of a training collection and the system still achieves the same performance (or even better) compared to if all the collection was annotated. This was confirmed on the TRECVID 2007 collaborative annotation [4] though the optimal annotation fraction was found to be between 35 and 50% (this may be due to the small size of TRECVID 2007 development collection).

1. Active learning

How does it work? A set of samples is available for training but it is not annotated yet (this is currently the case for the development set of TRECVID 2009 where the samples are keyframes/subshots). A system for concept detection is also available and can be trained using samples annotated as positive and negative for a concept to be detected. The annotation of the training set is partial and incremental.

The principle of using active learning for annotation is to use the system to select the samples that are potentially the most informative ones for the system training. Several strategies can be considered, the most popular one selects the most probable or the most uncertain samples. If several systems are available, it is possible to select samples from which the different systems disagree.

Since the system is used to produce the set of annotations with which it will then use to train itself, this can work only in an iterative way while there is a "cold start" problem. In the present case, the cold start will be done using previous year annotations (on TRECVID 2007 development collection) and judgements (on TRECVID 2007 test collection) when available or using LSCOM annotations (on TRECVID 2005 development collection) depending upon the target concept (feature). Once the process is started, at each iteration, the system becomes better and better at selecting good samples for annotation [4][5].

Compared to TRECVID 2007-2009 collaborative annotations using active learning, there will be an innovation this year in the use of relations between annotated concepts. The principle is that if a shot is labeled as positive for Adult, it will automatically be labeled as positive for Person and if a concept is labeled negative for Person, it will be automatically labeled as negative for Adult, Male_Person, Female_Person, Teenagers, etc. Special care will be taken into account when choosing the annotation to be done and their order so that this effect makes each annotation done as efficient as possible. The 130 concepts of TRECVID 2010 have been selected so that they cover as many previous TRECVID HLFs as possible and also comply as much as possible with the LSCOM ontology and that they are linked by a number of generic-specific type relations.

The number of concepts is significantly higher this year (130 versus 20) and the number of shots is also higher (119685 versus 43616). The raw number of annotations to be done for a complete coverage is therefore about 18 times larger. With the combination of active learning and the use of relations, we expect to produce an annotation which will be partial but with a significant and well chosen fraction. The use of relation will be transparent to the annotators: for each annotation, they will still have to focus on a single concept.

2. Application to TRECVID 2010 collaborative annotation

We propose to use this approach to perform a partial annotation of the TRECVID 2010 development set in the context of a collaborative annotation effort. That effort would be done using a system similar to the one used for TRECVID 2007, 2008 and 2009 with light effort since only a fraction of the training set will be annotated. Like in the previous collaborative annotations, the annotations will be available to all teams that participated to the annotation before the TRECVID 2010 workshop and to everybody after.

We will provide a web interface for the collaborative annotation. The active learning process will be transparent for the annotators, they will simply encounter more positive samples than in a full or in a random annotation (at least in the beginning). In practice, the concept detection system will be re-trained continuously with the latest available annotations and the next samples to be annotated will be re-evaluated and re-sorted each time the system has been re-trained.

We plan to set up the annotation in the beginning of May. We will take advantage of ASR/MT. The annotation could be done during 2 to 4 weeks and annotation on sets of increasing sizes could be delivered periodically meanwhile.

We produced the master shot segmentation using the LIG shot boundary detection system [6] and extracted one keyframe per shot. There are 119685 shots/keyframes in the TRECVID 2010 development set. Each participating team will have to produce at least 30,000 annotations. At an average of 2 seconds per annotation (the exact value varies: in some cases, one may see almost at once that there is no positive in 25 images; in other cases, one may have to play the subshot to check its content quite often), this corresponds to about 17 hours of full time work.

Each registered participant is asked to do 30,000 annotations. This is normally for participants that are a unique research team or organization. Participants representing several research teams or organizations are asked to do more: 30,000 for the first research team (about two-three full days of work) plus 15,000 for additional research teams. Indeed, participants are welcome to do more than the minimum required.

We recommend to spread your annotation effort in sessions of 30 to 60 minutes and once or twice a day. Annotations can be done until the end of June though we do advise to do it earlier or as soon as possible. Participants that have completed their 30,0000 share can download at any time the latest version of the annotation. A limited version will be available for other participants so that they can start develop and test their systems even if they have not completed their share yet.

3. Participants

If you are interested in participating this collaborative annotation, please send an email to Andy.Tseng@imag.fr with copy to Georges.Quenot@imag.fr indicating your team and the contact person.

Please note that participating in the annotations and/or getting them is only open to the registered TRECVID participants who have signed the Sound and Vison licence agreement (i.e. listed in the "tv10.who.what" file). Access to the annotation system is password protected. Registered participants can access the annotation system at http://mrim.imag.fr/tvca2010/al.html.

47 teams are currently registered for the collaborative annotation:

Team Contact person
Aalto University School of Science and Technology Mats Sjöberg
AT&T Labs Eric Zavesky
Beijing University of Posts and Telecommunications Zhicheng Zhao
Brno University of Technology Michal Hradis
Columbia University Yu-Gang Jiang
Deutsche Forschungszentrum für Künstliche Intelligenz Adrian Ulges
Dublin City University Colum Foley
École Nationale d'Ingénieurs de Sfax (ENIS) Anis Benammar
Florida International University Fausto Fleites
France Telecom Orange Labs, Beijing Kun Tao
Fudan University, China Hong Lu
Fuzhou University Jianjun Huang
GDR ISIS IRIM group Franck Thollard
IBM Watson Research Center Lexing Xie
Informatics and Telematics Institute, Greece Vasileios Mezaris
Information and Communicatin Engineering, Xi'an Jiaotong University Zhe Wang
Informedia Digital Video Library (Carnegie Mellon University) Alexander Hauptmann
INRIA Willow Rachid Benmokhtar
Institute of Image Comm. and Inf. Proc., Shanghai Jiao Tong University Xiaokang Yang
JOANNEUM RESEARCH Forschungsgesellschaft mbH & Vienna University of Technology Werner Bailer
Kobe University Kimiaki Shirahama
Laboratoire d'informatique de Grenoble (LIG) Georges Quénot
Laboratoire d'informatique fondamentale de Marseille (LIF) Stéphane Ayache
Université Sud Toulon Var (LSIS) Hervé Glotin
Multimedia Understanding Group, Aristotle University of Thessaloniki Christos Diou
National Cheng Kung University Chien-Li Chou
National Institute of Informatics, Japan Duy-Dinh Le
National Taiwan University Guan-Long Wu
NHK (Japan Broadcasting Corp.) Science and Technical Research Laboratories Yoshihiko Kawai
Peking University Yuan Feng
Politecnico Di Milano - Department of Electronics and Information Ahmed Ghozia
Quaero consortium Hazim Ekenel
Ritsumeikan University, Japan Ai Danni
RMIT University School of CS&IT James Thom
Shanghai Jiaotong Univrsity-IS Jiang Chengming
The University of Electro-Communications, Japan Keiji Yanai
Tokyo Institute of Technology & Georgia Institute of Technology Kochi Shinoda
TÜBİTAK Space Technologies Research Institute Ahmet Saracoglu
Universidad Autónoma de Madrid Javier Molina
Universidad Carlos III de Madrid Iván González Díaz
University of Amsterdam Cees Snoek
University of Illinois at Urbana-Champaign & NEC Laboratories America Mert Dikmen
University of Marburg Markus Mühling
University of Sheffield Usman Ghani
UPS - IRIT - SAMoVA Hervé Bredin
VIREO at City Universtiy of Hong Kong Shiai Zhu
Waseda University Ong Kok Meng

4. Annotations download

The final version of the 2007 collaborative annotation can be downloaded from http://mrim.imag.fr/tvca2007/ann.tgz.
The final version of the 2008 collaborative annotation can be downloaded from http://mrim.imag.fr/tvca2008/ann.tgz.
The final version of the 2009 collaborative annotation can be downloaded from http://mrim.imag.fr/tvca2009/ann.tgz.

The latest version of the 2010 collaborative annotation can be downloaded from http://mrim.imag.fr/tvca2010/ann.html.

If you are a registered group and you have done the minimum amount of annotations assigned, you will get the latest version of full set of annotations. You can update it as frequently as you wish as the collaborative annotation progresses.

If you have not yet completed the minimum amount of annotations assigned, you will get a restricted version of annotation set: up to 100 positive samples and to 100 negative samples for each feature/concept. This is intended to allow you to develop and test your systems even if you have not completed your annotations yet. If you have already completed your annotations and want to access the restricted version, simply enter a random group/password combination.

Annotations are given in the order in which they were selected by the active learning process (those predicted as most useful first).

The annotations are delivered as a gzip compressed unix tar archive (".tgz"). They are dynamically generated at each download request and this may take some time. The format is the same as in 2005, 2007, 2008 and 2009:

 Each line represents a judgment. For a given feature and shot there will likely be
 more than one judgment. It is up to you how you use these multiple annotations.

 Each line contains the following information:

  toolname annotationSite featurename moviename keyframename judgment(Skip/Positive/Negative)

The shotname can be derived from the keyframename by removing the "_RKF" or "_NRKF_#" part. There may be several keyframes/subshots per shot for 2007, 2008 and 2009.

There is currently only one judgment per keyframe and possibly several judgments per shot.

5. Milestones (2010)

April 19: Announcement of the Collaborative Annotation project and first version of the Collaborative Annotation Web site.
April 26: Annotation system available.
June 20: Annotation system closed.
June 21: Final version of the annotation available.

6. Status on June 3rd, 2010

About 2,595K annotations have been made until now, of which about two thirds by TRECVID participants and one third with the support of the Quaero programme. Many groups did much more than the minimum required (30,000 annotations in total) and we thank them very much.

About 2270K sample.concepts have been annotated at least once; about 232K sample.concepts have been annotated at least twice and about 92K sample.concepts have been annotated at least three times (in order to improve the annotation quality).

All these are direct annotations. When generating the final list of annotations, we now also include indirect annotations using the above mentioned set of relations. Both direct and indirect annotations are then merged and if a conflict occurs, if there is a majority (e.g. 2P and 1N) we keep the majority and if there is no majority, the sample is marked as S (skipped). Meanwhile, conflicting annotations are selected in priority for another judgment in the collaborative annotation. Indirect annotations are now propagated for both "implies" and "excludes" relations.

After this step, there is a total of 24K skipped, 196K positive and 5,424K negative labels. A 100% annotation would correspond to 119,685 times 130, or about 15,559K annotations. We have therefore annotated over one third of what can be annotated within which the active learning approach helped us to get a significant fraction of the positive samples. Though we have no clear idea of the exact fraction, previous experiments on TRECVID collaborative annotations completed to 100% or almost suggests that it might be over 70%. Moreover, the labeled negative samples have been selected as "close to" the positive ones. More information about our active learning process can be found in [4] (ECIR'2008, see figure 5 for the rate of discovering of the positive samples in TV 2007; the same curve was found found for TV 2008 and 2009).

Finally, a number of videos have been removed from the TV 2010 collection, most of them because of inappropriate contents (see iacc.1.dropped) All annotations corresponding to these videos will be removed in the released annotation and key frames / shots from them should not be displayed for annotation. For various practical reasons, this has not been implemented yet but it will be soon.

Important: we plan to close the collaborative annotation on June 15 so that it remains stable for a long time enough for the development and validation process by the participants. If you want to get the collaborative annotation, please complete your share by that date. However, if possible, please continue to spread your annotation work along all the remaining period.

7. Acknowledgments

This work is supported by the Quaero programme.

For 2007-2009 collaborative annotations, the shot and master bshot segmentations were provided by Fraunhofer HHI [7]. Automatic speech transtcription was provided by the University of Twente [8]. Machine translation of the University of Twente speech transcription was provided by Queen Mary, University London. Alternate automatic speech transtcription was provided by LIMSI [9].

8. Contacts

Georges.Quenot@imag.fr
Bahjat.Safadi@imag.fr
Andy.Tseng@imag.fr
Stephane.Ayache@univmed.fr

References

[1] Smeaton, A. F., Over, P., and Kraaij, W. 2004. TRECVID: evaluating
    the effectiveness of information retrieval tasks on digital video.
    In Proceedings of the 12th Annual ACM international Conference on
    Multimedia (New York, NY, USA, October 10 - 16, 2004). MULTIMEDIA '04.
    ACM Press, New York, NY, 652-655.
    DOI= http://doi.acm.org/10.1145/1027527.1027678
[2] C.-Y. Lin, B. L. Tseng and J. R. Smith, "Video Collaborative Annotation
    Forum: Establishing Ground-Truth Labels on Large Multimedia Datasets,"
    NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD,
    November 2003.
    URL: http://www-nlpir.nist.gov/projects/tvpubs/papers/ibm.final.paper.pdf
[3] Timo Volkmer, John R. Smith, Apostol (Paul) Natsev, Murray Campbell, 
    Milind Naphade, "A web-based system for collaborative annotation of large 
    image and video collections", In Proceedings of the 13th ACM international 
    Conference on Multimedia, Singapore, 6-11 November, 2005 
[4] Stéphane Ayache and Georges Quénot, "Video Corpus Annotation using
    Active Learning", 30th European Conference on Information Retrieval
    (ECIR'08), Glasgow, Scotland, 30th March - 3rd April, 2008
    URL: http://mrim.imag.fr/georges.quenot/articles/ecir08.pdf
[5] Stéphane Ayache and Georges Quénot, "Evaluation of Active Learning
    Strategies for Video Indexing", Signal Processing: Image Communication,
    Vol 22/7-8 pp 692-704, August-September 2007.
    DOI: http://dx.doi.org/10.1016/j.image.2007.05.010
[6] Stéphane Ayache, Georges Quénot, and Jéröme Gensel,
    "CLIPS-LSR Experiments at TRECVID 2010",
    TREC Video Retrieval Evaluation Online Proceedings, TRECVID, 2006
    URL: http://mrim.imag.fr/georges.quenot/articles/trec06.pdf
[7] C. Petersohn. "Fraunhofer HHI at TRECVID 2004:  Shot Boundary Detection
    System", TREC Video Retrieval Evaluation Online Proceedings, TRECVID, 2004
    URL: http://www-nlpir.nist.gov/projects/tvpubs/tvpapers04/fraunhofer.pdf
[8] Marijn Huijbregts, Roeland Ordelman and Franciska de Jong, Annotation
    of Heterogeneous Multimedia Content Using Automatic Speech
    Recognition. in Proceedings of SAMT, December 5-7 2007, Genova, Italy
[9] Julien Despres, Petr Fousek, Jean-Luc Gauvain, Sandrine Gay, Yvan Josse,
    Lori Lamel, and Abdel Messaoudi. Modeling Northern and Southern
    Varieties of Dutch for STT. In Interspeech'09, pages 96-99, Brighton,
    UK, September, 2009.