Automatically extracted low-level features are inefficient for unconstrained video scene interpretation and complex event recognition, making formally grounded high-level symbolic scene interpretations a viable extension for activity recognition in video streams. Video scene interpretation can be formalized using first-order logic (FOL) as well as description logics, the logical underpinnings of the DL flavor of the Web Ontology Language, i.e., mapping atomic concepts to geometric primitives. Machine learning algorithms can be combined with OWL ontologies for scene interpretation, where the ontologies provide the concepts, properties, and relationships of the depicted knowledge domain. OWL 2 can be extended using Semantic Web Rule Language (SWRL) rules to represent complex video events. While ontology-based video scene interpretation is promising, there are many research challenges. For example, deductive reasoning is insufficient for automated video scene interpretation, complex video event interpretation cannot be modeled as classification, and DL ontologies are inherently monotonic.