The immense and constantly growing number of videos urges efficient automated processing mechanisms for multimedia contents, which is a real challenge due to the huge Semantic Gap between what computers can automatically interpret from audio and video signals and humans can comprehend based on cognition, knowledge, and experience. Low-level features, which correspond to local and global characteristics of audio and video signals, and low-level feature aggregates and statistics, such as various histograms based on low-level features, can be represented by low-level feature descriptors. Such automatically extractable descriptors, such as dominant color and motion trajectory, are suitable for a limited range of applications only (e.g., machine learning-based classification), and are not connected directly to sophisticated human-interpretable, high-level descriptors, such as concepts depicted in a video.
To narrow the Semantic Gap, feature extraction and analysis can be complemented by machine-interpretable background knowledge formally grounded in description logics. The depicted concepts and their spatial relationships are usually described in RDF, which can express machine-readable statements in the form of subject-predicate-object triples (RDF triples), e.g.,
scene-depicts-person. The formal definition of the depicted concepts and relationships are derived from controlled vocabularies, ontologies, commonsense knowledge bases, and Linked Open Data (LOD) datasets. The Regions of Interest (RoI) can be annotated using media fragment identifiers. The temporal annotation of actions and video events can be performed using temporal description logics and rule-based mechanisms. The fusion of these descriptors, including descriptors of different modalities, is suitable for the machine-interpretable spatiotemporal annotation of complex video scenes.
Based on these structured annotations, various inference tasks can be performed to enable the automated interpretation of video scenes (e.g., video frame interpretation via abductive reasoning, video event recognition via reasoning over temporal DL axioms). Also, the structured annotations can be efficiently queried manually and programmatically using the very powerful query language, SPARQL, although high-level concept mapping usually requires human supervision and judgment, and automatic annotation options need more research. Research results for high-level concept mapping in constrained videos, such as medical, news, and sport videos, are already promising. Application areas include video understanding, content-based video indexing and retrieval, automated subtitle generation, clinical decision support, and automated music and movie recommendation engines.