Multimodal Compressed Domain Video Analysis


A fast and simple way for content-based information retrieval is to partially process videos without transcoding them. While desired, compressed domain analysis is not widely used in videos with state-of-the-art encoding schemes, such as H.264/MPEG-4 AVC and H.265/HEVC, mainly because of the spatial predictive encoding of I frames and the Integer Discrete Cosine Transform involved. New algorithms are needed to address these issues, and enable compressed domain information retrieval from videos that use highly efficient codecs. This is an increasingly important field, which needs more research.

One of my research aims is to develop a common mechanism for compressed domain analysis covering moving object tracking, face detection and tracking, crowd flow segmentation, video classification, indexing and retrieval, and human action recognition from standard video formats, such as MPEG-2 and H.264/AVC. The input video stream is processed and wrapped in a standard MPEG-4 container without transcoding (which cannot be achieved just by installing FFmpeg), retaining original video quality and eliminating or delaying additional CPU-intensive computations. This mechanism would be complemented by a multimodal recording and filtering software framework designed for machine learning, adaptation, and tagging of high-level semantics, such as emotions. The automatically identified high-level descriptors can be semantically enriched and represented as RDF in popular serializations, such as RDF/XML or Turtle. The NRT characteristics of the system provide implementation potential in a variety of applications from surveillance and OHS to robot vision.