Learning-Based Local Visual Representation and Indexing

Learning-Based Local Visual Representation and Indexing

von: Rongrong Ji, Yue Gao, Ling-Yu Duan, Hongxun Yao, Qionghai Dai

Elsevier Reference Monographs, 2015

ISBN: 9780128026205 , 128 Seiten

Format: PDF, ePUB, OL

Kopierschutz: DRM

Windows PC,Mac OSX für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Apple iPod touch, iPhone und Android Smartphones Online-Lesen für: Windows PC,Mac OSX,Linux

Preis: 28,95 EUR

Mehr zum Inhalt

Learning-Based Local Visual Representation and Indexing


 

Chapter 2

Interest-Point Detection


Beyond Local Scale


Abstract


In this chapter, we propose a novel approach that introduces contextual cues in the detection stage, which is left unexploited in the existing literature. Our motivation is that, so far only local cues are used for detection, which ignores the contextual statistics from neighborhood points. As a result, it is hard to detect the real “interest” points at a higher scale. Furthermore, we also consider the possibility to “feedback” the supervised information to guide the detection stage. In this chapter, we introduce a context-aware semi-local (CASL) feature detector framework to achieve these goals. This framework boosts the interest-point detector from the traditional local scale to a “semi-local” scale, which enables the detection of more meaningful and discriminative features.

Keywords

local feature detector

semi-local

scale space

DoCG

CASL

2.1 Introduction


In recent years, local interest points, a.k.a., local feature or salient regions, have been widely used in a large variety of computer vision tasks, ranging from object categorization, location recognition, image retrieval, to video analysis and scene reconstruction. Generally speaking, the use of local interest points typically involves two consecutive steps, called the detector and descriptor steps. The detector step involves discovering and locating areas where interest points reside in a given image, e.g., corners and junctions. Such areas should contain strong signal changes in more than one dimension, and can be repetitively identified among images captured with different viewing angles, cameras, lighting conditions, etc. To provide repeatable and discriminative detection, many local feature detectors have been proposed; for instance, Harris-Affine [80], Hessian-Affine [81], MSER [29], and DoG [8].

The descriptor step involves providing a robust characterization of the detected interest points. The goal of this description, in combination with the previous detection operation, is to provide good invariance to variations in scales, rotations, and (partially) affine image transformations. Over the past decade, various representative interest-point descriptors have also been proposed in the literature; for instance, SIFT [9], GLOH [10], shape context [19], RIFT [20], MOP [21], and learning-based (MSR) descriptors [22, 23].

In this chapter, we will skip the basic concepts of how typical detectors and descriptors work. Instead, we discuss in detail a fundamental issue: the detector scale. Generally speaking, the detector operation provides scale invariance to a certain degree, thanks to the scale space theory. However, our concern here is, whether the detection should be at the scale of the “isolated” interest point, or in a higher scope; for example, by investigating the spatial correlation and concurrence among these local interest points. So far, the detector phase of each local feature has been treated in an isolated manner [8, 29, 80, 81]. To this end, each salient region is detected and located separately. We term the proposed detector Context-Aware Semi-Local (CASL) detector.

However, one interesting observation is that, in many computer vision applications, the interest points are proceeded together in the rest recognition or classification steps. In other words, in the rest operations, the recognition or classification system investigates their joint statistics. It is therefore a natural guess that the linking (or middle-level representation) among such “local” interest points would be more important. The inspirations also come from the study of human visual systems [82]. For instance, it has been discovered that the contextual statistics of simple cell responses in the V1 cortex (can be simulated by local filters such as Gabor) are integrated into complex cells in V2 [83] to produce semi-local stimuli in visual representation.

Therefore, it is natural to raise the question: “Can the context of correlative local interest points be integrated to detect more meaningful and discriminative features?”

In this chapter, we explore the “context” issue of local feature detectors, which refers to both spatial and inter-image concurrence statistics of local features. We first review related works on how the local feature context can benefit visual representation and recognition. In general, works in this field can be subdivided into three groups:

 The methods to combine spatial contexts to build local descriptors [19, 8589]. For example, the shape context idea [19] adopted spatially nearby shape primitives to co-describe local shape features by radial and directive bin division. Lazebnik et al. [87] presented a semi-local affine part description, which used geometric construction and supervised grouping of local affine parts to model 3D objects. Bileschi et al. [89] proposed a contextual description with continuity, symmetry, closure, and repetition to low-level descriptors, with the C1 feature as its basic detector part.

 The methods to combine spatial contexts to refine the recognition classifiers. The most representative work comes from that of Torralba et al. [9092] in context-aware recognition, aiming to integrate spatial co-occurrence statistics into recognition classifier outputs to co-determine object labels. In addition to these two groups, there are also recent works in context-aware similarity matching [35, 93], which adopts similarities from spatial neighbors to filter out contextually dissimilar patches. It has been shown that by combining the contextual statistics of the local descriptors, the overall discriminability can be largely improved.

 The methods related to biological-inspired models to filter out non-informative regions with limited saliency [94]. Serre et al. [95] presented a biological-inspired model to mirror the mechanisms of V1 and V2 cortices, in which a “S1-C1-S2-C2” framework is proposed to extract features.

The proposed feature detector consists of two steps: First, at a given scale, the correlation among spatially nearby local interest points are modeled, with a Gaussian kernel to build a contextual response. We simulate the Gaussian smoothing and call the output of this step contextual Gaussian estimator. Then, following the difference of Gaussians setting, we derive the difference between nearby scales, which are aggregated into a difference of contextual Gaussians (DoCG) field. We show that the DoCG field can highlight contextually salient regions, as well as discriminate foreground regions from background clutter to reveal visual attention [94, 96]. The proposed “semi-local” detector is built over the peaks in the DoCG field, which is achieved by locating contextual peaks using mean shift search [37]. Each located peak is described with a context-aware semi-local descriptor, which meanwhile ensures the invariance to scales, rotations, and affine transformations.

We further extend our semi-local interest-point detector from the unsupervised case to the supervised case. In the literature, this is related to visual pattern mining [71, 72, 97] or learning-based descriptors [22, 23]. Notably, our work serves as the first one targeted at “learning-based interest-point detection” to the best of our knowledge. This is achieved by integrating category learning into our mean shift search kernels and weights to discover category-aware discriminative features.

This chapter serves as the basis for the entire book about learning-based visual local representation and indexing: First, at the initial local interest-point detection stage, it is beneficial to embed spatial nearest neighbor cues as context to design a better detector, which can serve as a good initial step for the subsequent unsupervised or supervised codebook learning. Second, the learned “semi-local” interest-point detector can be further used in many other application scenarios, to be combined with other related techniques to directly improve retrieval, as will be detailed later in this chapter. For more details and innovations of this chapter, please refer to our publication in ACM Transactions on Intelligent Systems and Technology (2012).

2.2 Difference of Contextual Gaussians


We first introduce how to build the local detector context based on their spatial co-occurrence statistics at multiple scales. General speaking, the “base” detector can be any of the current approaches [81]. As an example, here we use difference of Gaussians [9] as the base detector.

2.2.1 Local Interest-Point Detection


For a target image, we first define the scale space L at scale δ by applying a Gaussian convolution kernel to the intensity component in its HSI color space. As shown in Equation (2.1), I(x,y) stands for the Intensity of pixel (x,y) and G(x,y,δ) is a Gaussian kernel applied to...