The automatic analysis and indexing of multimedia content in general domains are important for a variety of multimedia applications. This thesis investigates the problem of semantic concept detection in general videos focusing on two advanced directions: multi-concept learning and multi-modality learning.
Semantic concept detection refers to the task of assigning an input video sequence one or multiple labels indicating the presence of one or multiple semantic concepts in the video sequence. Much of the prior research work deals with the problem in au isolated manner, i.e., a binary classifier is constructed using feature vectors from the single visual modality to classify whether or not a video contains a specific concept. However, multimedia videos comprise of information from multiple modalities (both visual and audio). Each modality brings sonic information abort the other and their simultaneous processing can uncover relationships that are otherwise unavailable when considering the modalities separately. In addition, real-world semantic concepts do not occur in isolation. The context information is useful for enhancing detection of individual concepts.
This thesis explores multi-concept learning and multi-modality learning to improve semantic concept detection in general videos, i.e., videos with general content and are captured in uncontrolled conditions. For multi-concept learning, we propose two methods with the frameworks of two-layer Context-Based Concept Fusion (CBCF) and single-layer multi-label classification, respectively. The first method represents the inter-conceptual relationships by a Conditional Random Field (CRF). The inputs of the CRF are initial detection probabilities from independent, concept detectors. Through inference with concept relations in the CRF we get updated concept detection probabilities as outputs. To avoid the difficulty of designing compatibility potentials in the CRF, a discriminative cost function aiming at class separation is directly-minimized. Also, we further extend this method to study an interesting "20 questions problem" for semantic concept detection, where user's interaction is incorporated to annotate a small number of key concepts for each data, which are then used to improve detection of the remaining concepts. To this end, an active CBCF approach is proposed that can choose the most informative concepts for the user to label. The second multi-concept learning method does not explicitly model concept relations but optimizes multi-label discrimination for all concepts over all training data through a single-layer joint boosting algorithm. By sharing "good' kernels among different concepts, accuracy of individual detectors can be improved; by joint learning of common detectors across different classes, required kernels and computational complexity for detecting individual concepts can be reduced.
For multi-modality learning, we develop methods with two strategies: global fusion of features or models from multiple modalities, and construction of the local audio-visual atomic representation to enforce a moderate-level audio-visual synchronization. Two algorithms are developed for global multi-modality fusion, i.e., the late-fusion audio-visual boosted CRF and the early-fusion audio-visual joint boosting. The first method is an extension of the above two-layer CBCF multi-concept learning approach where the inputs of the CRF include independent concept detection probabilities obtained by using both visual and audio features, individually. The second method is an extension of the above single-layer multi-label classification approach; where both visual-based kernels and audio-based kernels are shared by multiple concepts through the joint boosting multi-label concept detector. These two proposed methods naturally combines multi-modality learning and multi-concept learning to exert the power of both for enhancing semantic concept detection. To analyze moderate-level audio-visual synchronization in general videos, we propose to generate a local audio-visual atomic representation, i.e., the Audio-Visual Atom (AVA). We track visually consistent regions in the video sequence to generate visual atoms. At the same time we locate audio onsets in the audio soundtrack to generate audio atoms. Then visual atoms and audio atoms are combined together to form AVAs, on top of which joint audio-visual codebooks are constructed. The audio-visual codebooks capture the co-occurring audio-visual patterns that are representative to describe different individual concepts, and accordingly can improve concept detection.
The contributions of this thesis can be summarized as follows. (1) An in-depth study of jointly detecting multiple concepts in general domains, where concept relationships are hard to compute. (2) The first system to explore the "20 questions" problem for semantic concept detection, by incorporating users' interactions and taking into account joint detection of multiple concepts. (3) An in-depth investigation of combining audio and visual information to enhance detecting generic concepts. (4) The first system to explore the localized joint audio-visual atomic representation for concept detection, under challenging conditions in general domains.