Common Low-Level Visual Descriptors
The most common perceptual categories of visual descriptors for still images are color, texture, and shape. Image sequences add one more dimension of perceptual saliency to these: motion.
- Color descriptors. The Dominant Color Descriptor specifies a set of dominant colors for an image (typically 4–6 colors), and considers the percentage of image pixels each color is used in, as well as the variance and spatial coherence of the colors. The Color Structure Descriptor encodes local color structure by utilizing a structuring element, visiting all locations in an image, and summarizing the frequency of color occurrences in each structuring element. The Color Layout Descriptor represents the spatial distribution of colors. The Scalable Color Descriptor is a compact color histogram descriptor represented in the HSV color space and encoded using Haar transform.
- Texture Descriptors. The Homogeneous Texture Descriptor characterizes the regional texture using local spatial frequency statistics extracted by Gabor filter banks. The Texture Browsing Descriptor represents a perceptual characterization of texture in terms of regularity, coarseness, and directionality as a vector. The Edge Histogram Descriptor represents the local edge distribution of an image as a histogram which corresponds to the frequency and directionality of brightness changes in the image.
- Shape Descriptors. The Region-Based Shape Descriptor represents the distribution of all interior and boundary pixels that constitute a shape by decomposing the shape into a set of basic functions with various angular and radial frequencies using angular radial transformation, a two-dimensional complex transform defined on a unit disk in polar coordinates. The Contour-Based Shape Descriptor represents a closed two-dimensional object or region contour in an image or video. The 3D Shape Descriptor is a representation-invariant description of three-dimensional mesh models, expressing local geometric attributes of 3D surfaces defined in the form of shape indices calculated over a mesh using a function of two principle curvatures.
- Motion Descriptors. The Camera Motion Descriptor represents global motion parameters, which characterize a video scene in a particular time by providing professional video camera movements, including moving along the optical axis (dolly forward/backward), horizontal and vertical rotation (panning, tilting), horizontal and vertical trans-verse movement (tracking, booming), change of the focal length (zooming), and rotation around the optical axis (rolling). The Motion Activity Descriptor indicates the intensity and direction of motion, and the spatial and temporal distribution of activities. The Motion Trajectory Descriptor represents the displacement of objects over time in the form of spatiotemporal localization with positions relative to a reference point and described as a list of vectors. The Parametric Motion Descriptor describes the global motion of video objects using a classic parametric model (translational, scaling, affine, perspective, quadratic).
Common Low-Level Audio Descriptors
The most common audio descriptors are the following:
- Temporal Audio Descriptors. The energy envelope descriptor repre-sents the root mean square of the mean energy of the audio signal, which is suitable for silence detection. The zero crossing rate de-scriptor represents how many times the signal amplitude undergoes a change of sign, which is used for differentiate periodic signals and noisy signals, such as to determine whether the audio content is speech or music. The temporal waveform moments descriptor repre-sent characteristics of waveform shape, including temporal centroid, width, asymmetry, and flatness. The amplitude modulation descriptor describes the tremolo of a sustained sound (in the frequency range 4–8 Hz) or the graininess or roughness of a sound (between 10–40Hz). The autocorrelation coefficient descriptor represents the spectral distribution of the audio signal over time, which is suitable for musical instrument recognition.
- Spectral Audio Descriptors. The spectral moments descriptor corre-spond to core spectral shape characteristics, such as spectral centroid, spectral width, spectral asymmetry, and spectral flatness, which are useful for determining sound brightness, music genre, and categorizing music by mood. The spectral decrease descriptor describes the average rate of spectral decrease with frequency. The spectral roll-off descriptor represents the frequency under which a predefined percent-age (usually 85–99%) of the total spectral energy is present, which is suitable for music genre classification. The spectral flux descriptor represents the dynamic variation of spectral information computed either as the normalized correlation between consecutive amplitude spectra or the derivative of the amplitude spectrum. The spectral irregularity descriptor describes the amplitude difference between adjacent harmonics, which is suitable for the precise characterization of the spectrum, such as for describing individual frequency components of a sound. The descriptors of formants parameters represent the spectral peaks of the sound spectrum of voice, and are suitable for phoneme and vowel identification.
- Cepstral Audio Descriptors. Cepstral features are used for speech and speaker recognition and music modeling. The most common cepstral descriptors are the mel-frequency cepstral coefficient descriptors, which approximate the psychological sensation of the height of pure sounds, and are calculated using the inverse discrete cosine transform of the energy in predefined frequency bands.
- Perceptual Audio Descriptors. The loudness descriptor represents the impression of sound intensity. The sharpness descriptor, which corresponds to a spectral centroid, is typically estimated using a weighted centroid of specific loudness. The perceptual spread de-scriptor characterizes the timbral width of a sound, and is calculated as the relative difference between the specific loudness and the total loudness.
- Specific Audio Descriptors. The odd-even harmonic energy ratio de-scriptor represents the energy proportion carried by odd and even harmonics. The descriptors of octave band signal intensities represent the power distribution of the different harmonics of music. The attack duration descriptor represents how quickly a sound reaches full vol-ume after it is activated, and is used for sound identification. The harmonic-noise ratio descriptor represents the ratio between the ener-gy of the harmonic component and the noise component, and enables the estimation of the amount of noise in the sound. The fundamental frequency descriptor, also known as the pitch descriptor, represents the inverse of the period of a periodic sound.