Project Details
Projekt Print View

Automatic Alignment of Textto-Video for Semantic Multimedia Analysis

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term from 2014 to 2018
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 252286362
 
In this project, we aim to explore rich descriptions of video data (TV series and movies) which opens myriad possibilities for multimedia analysis, understanding and obtaining weak labels for popular computer vision tasks. We wish to focus on two forms of text -- plot synopses and books. The former, plots are obtained via crowdsourcing and describe the episode or movie in a summarized way. In contrast books (from which the video is adapted) provide detailed descriptions of the story and visual world the author wishes to portray. While text in the form of subtitles and transcripts has been successfully used to automate person identification [Everingham 2006] or obtain samples for action recognition [Laptev 2008], those text sources are limited in their potential for understanding or obtaining rich descriptions of the story. To use the plot synopses, we will first align the sentences of the synopsis to shots in the video (WP2). We propose to use anchors, primarily person-id to help guide the alignment. We aim to solve two main challenges associated with this task: possible non-linearity of the plot synopsis, and skipping of shots. In contrast to plot synopses, the first step we take in analyzing books is to align chapters and their corresponding video shots (WP3). We can expect that some dialogues in the books match the ones used in the video adaptation. This allows us to automatically identify characters and learn person models in a second step, and also facilitates fine-grained alignment within a chapter. The alignment can be improved by knowing more about the scene or objects present in the shots. We will investigate this interconnected behaviour of labels and anchors in WP4, first in an iterative manner, and then by jointly modeling the two tasks of obtaining weak labels and performing alignment. We divide the applications into two types: (i) obtaining labels from the text sources and (ii) video-related applications. From plot synopses, we will specifically aim to obtain weak labels for places or scenes (WP5-P1). We will also explore tasks such as Summarization, Indexing and Retrieval (WP5-P2). For example, a coherent video summary based on the story (rather than low-level features) can be generated by first running a text summarizer on the plot, followed by selection of the set of aligned to the retained sentences. Indexing the descriptions for keywords can also lead to easy browsing through the video. From books, we wish to exploit dialogs for obtaining supervision for person identification, and rich descriptions surrounding the dialogs to learn attributes for the characters, scenes and objects (WP5-P1). Another interesting application is to automatically find differences between books and their video adaptations (WP5-P2).
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung