Quickly accessing the contents of a video is challenging for users, particularly for unstructured video, which contains no intentional shot boundaries, no chapters, and no apparent edited format. We approach this problem in the domain of lecture videos though the use of machine learning, to gather semantic information about the videos; and through user interface design, to enable users to fully utilize this new information.
First, we use machine learning techniques to gather the semantic information. We develop a system for rapid automatic semantic tagging using a heuristic-based feature selection algorithm called Sort-Merge, by using large initial heterogeneous low-level feature sets (cardinality greater than 1K).
We explore applying Sort-Merge to heterogeneous feature sets though two methods: early fusion and late fusion. Each takes different approaches to handling the different kinds of features in the heterogeneous set. We determine the most predictive feature sets for key-frame filters such as “has text”, “has computer source code”, or “has instructor motion”. Specifically we explore the usefulness of Harr Wavelets, Fast Fourier Transforms, Color Coherence Vectors, Line Detectors, Ink Features and Pan/Tilt/Zoom detectors. For evaluation, we introduce a “keeper” heuristic for feature sets, which provides a method of performance comparison against a baseline.
Second, we create a user interface to allow the user to make use of the semantic tags we gathered though our computer vision and machine learning process. The interface is integrated into an existing video browser, which detected shot-like boundaries and presented a multitimeline view. The content within shot-like boundaries is represented by frames to which our new interface applies the generated semantic tags. Specifically, we make accessible the semantic concepts of 'text', 'code', 'presenter', and 'person motion.' The tags are detected in the simulated shots using the filters generated with our machine learning approach and are displayed to users using a user-customizable multi-timeline view. We also generate tags based on ASRgenerated transcripts that have been limited to the words provided in the index of the course text book. Each of these occurrences is aligned with the simulated shots. Each spoken word becomes a tag analogous to the visual concepts. A full Boolean algebra over the tags is provided to enable new composite tags such as 'text or code, but no presenter.'
Finally, we quantify the effectiveness of our features and our browser through user studies, both observational and task driven. We find that users that use the full suite of tools performed a search task in 60% of the time of users without access to tags. We find that when users are asked to perform search tasks they follow a nearly fixed pattern of accesses, alternating between the use of tags and Keyframes, or between the use of Word Bubbles and the media player. Based on user behavior and feedback, we redesigned the interface to group spatially interface components that are used together, removed un-used components, and redesigned the display of Word Bubbles to match that of the Visual Tags. We found that users strongly preferred the Keyframe tool, as well as both kinds of tags. Users also either found the algebra very useful or not useful at all.