We designed and implemented a video semantic indexing system as part of the TRECVID challenge in 2012. Our system combines image processing and machine learning techniques to automatically tag video shots with concepts that appear in it. This page documents our work at high level. If you are interested in more details, please refer to our technical paper .
The steep rise in the availability of online video content, during the last decade, is hardly breaking news. Today, YouTube reports that 72 hours of videos are uploaded to its servers every minute . The Internet has changed the rules of mass content distribution and consequently video has become ubiquitous: from the catalogs of professional broadcasters to prosumer content and personal archives, anybody can make video content widely available at nearly no cost. On the flipside, it is impossible to navigate these catalogs as efficiently as one would zap through broadcast TV channels.
As with other types of online media, video platforms depend heavily on search engines to help users find relevant content. Most commercial video search engines retrieve videos based on textual tags, descriptions and transcripts. This provides limited performance in important cases, such as when the keyword searched is not mentioned in the text, or when the accompanying text is in a different language. While professional video producers (e.g., news channels) can invest the resources needed to generate thorough text metadata for their catalogs, this is typically beyond the skill and resources of the prosumer and amateur markets. In order to scale our ability to navigate video databases, it is critical to develop tools that can automatically detect and annotate concepts in visual content.
A video semantic indexing system involves several components that need to be carefully designed to work together. Figure 1 shows an overview of our system design. Training data consists of video frames grouped into shots and annotations for the occurence of a pre-defined set of concepts within a shot. Our system works with a set of 50 concepts including, e.g., landscape, computer, and news studio. Annotations, when available, determine whether a concept is present (denoted as P in the examples in Figure 1) or not (denoted as N) in the video shot. An annotation of a given concept for a shot may also be missing (denoted as M), indicating that we do not know whether the concept occurs or not in the shot.
Figure 1. Overview of our video semantic indexing system.
In the feature extraction step, we compute a summary representation of the video frame which captures the important visual characteristics of the frame (c.f.  for details). We feed the representations of several thousand video shots with their corresponding concept annotations into an algorithm that learns the relationship between these two types of data and constructs a classifier that can input a new image (i.e., test data) and determine the likelihood that that it contains a concept without relying on annotations.
In each of the links below, you will find a link for each concept. For each concept, you can visualize 2,000 keyframes, each representing a shot. These are the 2,000 most likely (according to our algorithms) shots to contain the referred concept, ordered and named by its rank (i.e., file 0001.jpg corresponds to the most likely shot to contain the concept; file 0002.jpg the second most likely; etc).
All results are represented here by only a keyframe, so they might not be useful to assess the performance of classes that require the visualization of the shot (e.g., 'cheering', 'dancing', 'throwing', among others). For the precise definition of each concept, please refer to this file.