Bridging the Semantic Gap


One of the common machine learning-based video scene categorization approaches is to extract low-level features, such as color and texture, or a bag of visual words (BoVW) or bag of visual features (BoVF), which are then processed with a classifier, such as support vector machines (SVM), to infer sophisticated information about the visual content. However, low-level video features alone are inadequate for representing video semantics, because annotating an automatically extracted video feature, such as color distribution, does not provide the meaning of the visual content of videos. This huge discrepancy between automatically extracted, machine-processable low-level video features and manually created and/or evaluated, sophisticated high-level annotations is called the Semantic Gap.
Due to the growing popularity of video resources, bridging the Semantic Gap is highly desirable, but has many challenges which need more research. Approaches to bridge, or at least narrow, the Semantic Gap in video understanding failed in the past decade including, but not limited to, the introduction of “core” multimedia ontologies, ruled-based automated reasoning, and new algorithms for local feature extraction.