Vision for recognition
Reading
 Nelson, From visual homing to object recognition, in Aloimonos (ed.) Visual Navigation, Lawrence Erlbaum Associates, 1997
 Borotschnig, Appearancebased object recognition, Image and Vision Computing v18 2000.
In the FermullerAloimonos Active Vision Paradigm heirarchy of competencies, first comes the motion competencies, then the shape competencies. Object recognition is one of these.
Nelson presents a procedure for extending the associative memory based approach he used for homing, a motion competency, to OR. This is consistent with the Active Vision Paradigm insistence that competencies are layered on one another, dependent on one another. This is where we will start our look at methods for implementing shape competencies.
Nelson's associative memorybased OR
There are two broad classes of 3D OR methods:
 Model or objectbased
 Memory or viewbased
Objectbased methods use 3D match criteria while viewbased methods use 2D matching. Earlier this semester we discussed examples of both types: objectbased methods based on volumetric, polyhedral, superquadric and sweep representations. the Goad algorithm. Viewbased methods included the method of aspect graphs and characteristic views. The Goad Algorithm is also viewbased since its match criteria are 2D even though the stored templates are 3D.
In the spirit of active vision's emphasis on making the link between percept (sensory data) and behavior as direct as possible, Nelson chooses a memorybased approach. In his view, we see then we act. The model based approaches require an intermediate step of some complexity: we see (2D images) then we model (build a 3D model for matching purposes) then we act.

Advantages of memorybased approaches:
 Consistent with motion competencies;
 Behavior rather than recovery oriented;
 Less dependent on segmentation and preproc;
 Faster online;
 Consistent with learning;
 OR becomes pattern recognition.

Some problems with memorybased approaches are:
 Memory load;
 Occlusion not easily handled;
 Clutter not easily handled.
Nelson's twostage associative memory approach is meant to address all three of these problems.
The basic scheme involves these steps:
Offline:
 For each training view of each object, determine and store the values of visible higher level features called keys or key features. The keys should be local rather than global.
 Over the entire training dataset, determine and store the relative frequency of association of each key with each hypothesis (object class, position, pose and scale).
Online:
 Extract primitive features, combine into key features.
 Pass the key features through an associative memory designed to recall corresponding object classes and configurations (location and pose in image space).
 Add geometric information about the key features to produce the set of feasible hypotheses called the evidentiary set.
 Pass the evidentiary set through a second associative memory which associates each hypothesis with the evidentiary set using a voting procedure based on the probabilities.
So the first pass inputs the key features and outputs the evidentiary set, that is, narrows down the set of hypotheses to be explored further to be only those which are realistically possible given the keys and geometry, i.e. the evidentiary set. Then the second pass inputs these hypotheses and judges which is most likely.
Here's how this procedure deals with the three main problems with memorybased 3D OR mentioned above:
 Memory load: keys are semiinvariant, highly compressed object representations;
 Occlusion: keys are local and the final OR decision will be based on a voting procedure not requiring a full view;
 Clutter: voting procedures inherently tolerate feasible candidates with few votes.
Semiinvariance means that the key features change only slowly as the pose and scale of the object changes in image space. For instance, color of a given segment of the object would be semiinvariant, while angle between at a vertex would not be semiinvariant.
Note that this method is consistent with learning since the probabilites can be adjusted as new training data becomes available.
Eg: Trying to tell dogs from cats.
We have an image, and find three key features in it. We pass each through the AM and find the first two are associated with a dog seen from two certain points on the viewsphere (poses), and the third with a cat in a different pose. Three feasible hypotheses are constructed, this is the evidentiary set: {(dog,configuration1), (dog,configuration2), ( cat,configuration3)} where configuration= (location,pose). The geometry is used to estimate the distance of the animal, thus its location.
These three are passed through the secondpass AM which tells us that the evidence is voting 2:1 for dog. That is the output of the procedure.
The intended mode of operation for the active vision system is that the keys and probabilities of the keys given the object class and configuration are determined offline in advance. Online, the system takes a new image and computes the votes for each of the feasible hypotheses (the evidentiary set). If the vote for the most likely hypothesis is not above the decision threshold, another image is acquired and the resulting votes combined. This process is repeated until a winning feasible hypothesis is found, or until no feasible hypothesis remains. In the latter case object recognition failure is declared.
Implementation and tests
Nelson gives results for his method in a "toy" environment, just as he did for homing. The goal is to recognize children's blocks, seven simple polyhedral objects: triangle, square, trapezoid, pentagon, hexagon, plus sign, 5pointed star. The background was uniform, contrast was high and there was no clutter (except from other blocks in certain experiments).
Two kinds of keys were tried: 3chains and curve patches.
 3chains: make polyline approximations to the visible 3D bounding contours. Combine 3 touching line segments, characterize by its four vertices. Each such set of four vertices is a 3chain. So each 3chain feature is 8D, the xy image coordinates for each vertex.
 curve patches: define the set of longest, strongest edges as base curves. For each, find the locations along the base curves where other nearby edges intersect them. Each such set of locations, together with the angle measures at the intersections, is a curve patch.
Note that both these classes of keys have the desired properties of semiinvariance and locality of reference. They are semiinvariant because small changes in viewpoint or image distortion only lead to small changes in the feature values; they are local because they depend only on the properties of a small region of the image.
Eg: curve patches on one edge of a sixsided nonconvex polytope.
The feature vector for this curve patch would be 6D: one location parameter in [0,1] and one angle measure per point.
Online, the selected key feature set is computed from the image and each key feature is associated with a set of [object x configuration]labelled images in the database. Hypotheses are formed (object x configuration x scale x 3D location) for each such association, and votes taken. For the tests done in this paper, only a single view was used, that is, evidence is accumulated over keys but not over views.
A set of training views was taken over approximately a hemisphere of view angles, with variable numbers of views for each object (max 60 for trapezoid, min 12 for hexagon).
Based on tests with a single visible object in the FOV, Nelson approximates that his method is at least 90% accurate (he claims no errors in the tests).
With one object and small portions of others, the complete object was correctly recognized each time. For objects with severe clutter and some occlusion (presence of a robot endeffector in the FOV) the results were significantly poorer (10%25% in tests).
Nelson suggests the cause was poor lowlevel key extraction, not memory crosstalk due to spurious keys produced by the clutter.
As a final note, lets measure the Nelson approach against the active vision and animate vision paradigms. How would it score?

Active vision:
 Vision is not recovery.
 Vision is active, i.e. current camera model selected in light of current state and goals.
 Vision is embodied and goaldirected.
 Representations and algorithms minimal for task

Animate vision: Anthropomorphic features emphasized, incl.
 Binocularity
 Fovea
 Gaze control
 Use fixation frames and indexical reference
Nelson's approach perhaps strives for too much universality across embodiments and tasks, and does not incorporate the characteristics of human vision sufficiently. It might be considered a bridge between the old paradigm and the new.
Active object recognition
When we try to recognize objects that appear to resemble several object classes, it is natural to seek to view them first from one viewpoint, and if this is not enough to disambiguate, choose another viewpoint. An object recognition scheme based on planning and executing a sequence of changes in the camera parameters, and combining the data from the several views, is called active object recognition.
If the sequence of views is planned in advance and followed regardless of the actual data received, this is offline active object recognition.
If we choose the next camera parameters in light of the previous data received, this is online active object recognition.
 Eg: A robot with camerainhand moves to 4 preplanned positions and captures 4 views of a workpiece before a welding robot computes its welding trajectory.
 Eg: A camera on a lunar rover sees something which may be a shadow or a crevasse. So it moves closer to get another look from a different groundplane approach angle.
Any online active object recognition system must have five distinct, fundamental capabilities:
 Using the current camera parameters, acquire an image, determine the most likely recognition hypotheses and give them each a score. This is the recognition module.
 Combine scores from all views of the object thus far into an aggregate score for each of the most likely hypotheses. This is the data fusion module .
 Decide if the decision threshold has been reached, and if so, declare a recognition. This is the decision module.
 If no decision can be reached based on current view sequence, select the next camera parameters. This is the view planning module.
 Servo the camera platform to the next desired position in parameter space. This is the gaze control module.
So for active OR we observefusedecideplanact. Note that some of these modules are logically independent, such as the decision and gaze control modules, so we can "mix and match" their designs. Others, like the fusion module and the planning module, are logically connected and should be selected together.
Before discussing Borotschnig's implementation of the observefusedecideplanact paradigm for active vision, we should note that there are other ways of categorizing the tasks involved in formulating an active taskoriented response to a set of perceptions. Among the most frequently cited is the OODA (ObserveOrientDecideAct) loop which has been the basis of much of the socalled C^{3}I automation applications of the US military (command  control  communications  intelligence).
Courtesy www.airpower.maxwell.af.mil
Borotschnig's Appearancebased active OR
A good example of a complete active OR system is that designed by a group from the Technical University at Graz,
1. The recognition module
They base their recognition logic on Principal Component Analysis (PCA). PCA is a quite general procedure, not just limited to image processing, for projecting a numerical vector down into a lowerdimensional subspace while keeping the most important properties of the vector. It is called a "subspace" method. What defines PCA is that it is the most accurate subspace method in the leastsquares sense.
PCA
Given a set of ndimensional vectors X, the goal of PCA is to specify an ordered set of n orthonormal vectors \(E=\{e_j\}\) in \(R^n\) such that each \(x_i\) in X can be expressed as a linear combination of the \(e_j\)'s
\(x_i = a_{i1}e_1+a_{i2}e_2+...+a_{im}e_m...+a_{in}e_n\)
and so that for all \(m<n\), if \(x_i\) is approximated by the just its first m terms,
\(x_i' = a_{i1}e_1+a_{i2}e_2+...+a_{im}e_m\)
the total meansquare error over the entire dataset
\(\sum_i (x_ix_i')^2\)
is minimum over all choices of the ordered set E.
The orthonomal basis E for \(R^n\) is called the set of principal components of the set X.
Thus \(e_1\) is the basis for the best onedimensional space to project X down into in the MS sense. \((e_1,e_2)\) is the best 2dimensional space, and in general \((e_1,...,e_m)\) is the best \(m<n\) dimensional space to project X into in the MS sense.
The principal components answer the question: if you had to approximate each x as a linear combination of just m fixed vectors, which m would you choose? In other words, what is the best way to approximate a given set of vectors from \(R^n\) by projection into a lowerdimensional space \(R^m\).
 Eg: Three vectors \(X=\{x_1,x_2,x_3\}\) in 2 dimensions. \(e_1\) is the first principal component, \(e_2\) the second.
There is no other vector that is better than \(e_1\) in the sense of minimizing the total error
\(MSE=1/3(\text{error}_1^T.\text{error}_1+\text{error}_2^T.\text{error}_2+\text{error}_3^T.\text{error}_3)\)
Applying PCA to the active OR context
First consider the offline (training) steps. For a selected set of objects \(\{o_i\}\) to be recognized, and of viewpoints \(\{p_j\}\), we acquire images. Each of these images is then normalized to the same size and average brightness.
Now lets rewrite each image as a vector, say using raster ordering, to create image vectors \(x_{ij}\) of dim(N). So for instance, for a 512x512 image, N=512*512=262144. There will be \(N_x\) = (number of objects)x(number of viewpoints) such image vectors \(x_{ij}\).
Let \(\bar{x}\) be the average image vector
\(\bar{x}=(1/N_x)(\sum_{ij} x_{ij})\)
Define the vector X by concatenating the vectors \((x_{ij}\bar{x})\) as column vectors. Then the covariance matrix of the image set \(\{x_{ij}\}\) is defined as
\(Q = XX^T\)
Note that dim(Q)=NxN, so for instance in a 512x512 image case with N=262144, Q is a 262144x262144 matrix with \(2^{18} \times 2^{18}=2^{36}\) or about 68 gigaelements.
The principal components of the image set \(\{x_{ij}\}\) are then the eigenvectors of Q, ordered in descending order of the magnitudes of their eigenvalues.
Since Q is symmetric and positive semidefinite (\(x^*Mx \ge 0\)), its eigenvectors will be orthogonal and can be considered normalized. Also, its eigenvalues will be nonnegative.
The fractional mean square error in projecting \(\{x_{ij}\}\) into the subspace spanned by the first m principal components is
\(\text{FMSE} = \sum_{i>m} λ_i^2 / \sum_{j<=N} λ_j^2\)
where \(λ_1, λ_2 ,..., λ_N\) are the eigenvalues in descending order (largest first).
Eg: Consider an N=4 pixel image set with eigenvalues
\(λ_1=7, λ_2=3, λ_3=2, λ_4 =1\).
Then if we project the images into the 1D subspace spanned by just its first principal component (first eigenvector), the fractional square error will be
\(\text{FMSE} = (3^2+2^2+1^2)/(7^2+3^2+2^2+1^2) =\)
\(= 14/63 = 0.222\)
So the fractional average error will be \(\sqrt{\text{FMSE}}\) or .471. Thus projecting from 4D to 1D we will lose about half the original image information.
Note: the projection of a vector x into the 1D subspace spanned by the eigenvector \(e_1\) is the best approximation of x as \(a e_1\), where the coefficient vector a minimizes the error \(x  a e_1\).
If we project into a 2D subspace of the first two eigenvectors,
\(\text{FMSE} = (2^2+1^2)/(7^2+3^2+2^2+1^2) = 0.0794\)
and the fractional average error will be \(\sqrt{.0794}\) or .28. So we will have lost about a quarter of the image information in projecting into 2D.
The projection of a vector x into the 2D subspace spanned by the eigenvectors \(e_1\) and \(e_2\) is the best approximation of x as \(a_1 e_1 + a_2 e_2\), where the coefficients \(a_1\) and \(a_2\) minimize the error \(x (a_1 e_1+a_2 e_2)\).
Based on the eigenvalues (λ's) we compute offline, we can determine the FMSE using only the first k principal components, for k=1,2,.. until the FMSE is sufficiently low for our error tolerance.
We select this k and this becomes the dimension of the principal component subspace we will use for approximating image vectors, the subspace spanned by the first k eigenvectors \(\{e_1,...,e_k\}\), also called the principal components of the training data set.
So given any image vector x, dim(x)=N, we approximate x by g, dim(g)=k, where
\(g = \sum_i g_i e_i\)
where the \(g_i\) are the scalar weights of the various eigenvectors in approximating x in the reduced space of dimension k<<N, which are given by the inner products of the pattern x with the eigenvectors:
\(g_i = x^Te_i\)
Now back to Borotshnig's active object recognition scheme. Besides the projection of a given vector into a selected kdimensional principal component subspace, the other thing we must approximate offline is the probability density function of g, for each object \(o_i\) and pose \(p_j\), \(P(go_i,p_j)\).
This is done by taking a set of sample images with fixed \((o_i,p_j)\) but varying the lighting conditions, and fitting the resulting set of gvalues by a Gaussian distribution.
\(P(go_i,p_j)\) can be used through Bayes' Law to determine posterior probabilities of objectpose hypotheses
\(P(o_i,p_jg) = P(go_i,p_j)P(o_i,p_j)/P(g)\)
and the posterior object probabilities
\(P(o_ig) = \sum_j P(o_i,p_jg)\)
Thus, the most likely hypotheses \((o_i,p_j)\) and their scores \(P(o_i,p_jg)\) can be determined by the recognition module online by projecting x down to g and using the offline probability database to get the hypothesis probabilities.
Marginalizing (summing the joint probabilities) over the poses yields the posterior probabilities of each object class given the image data g.
2. Data Fusion Module:
Using this Bayesian approach, data fusion becomes straightforward. Given a set of n image vectors \(\{x_1..x_n\}\) which project to their PCs \(\{g_1...g_n\}\),
\(P(o_ig_1..g_n)=P(g_1..g_no_i)P(o_i)/P(g_1..g_n)\)
Lets determine this overall probability recursively assuming classconditional independence
\(= P(g_1..g_{n1}o_i)P(g_no_i)P(o_i)/P(g_1..g_n)\)
\(= P(o_ig_1..g_{n1})P(o_ig_n)P(o_i)/D\)
where
\(D = P(g_1..g_n)/(P(g_1..g_{n1})P(g_n))\)
3. Decision module:
Decide \(o_i\) at time n if
\(i=\text{arg max} P(o_ig_1..g_n)\)
and
\(P(o_ig_1..g_n) > T\)
where T is a preselected threshold. So for instance if you require P(error)<5% you would set T=0.95+V, where V>0 is chosen to correct for sampling error in estimating the various probabilities.
4. Where do we look next?
Define the uncertainty in the object recognition task at time n as the entropy
\(H(Og_1..g_n) =  \sum_i P(o_ig_1..g_n) \log P(o_ig_1..g_n)\)
of the prob density function of the random variable O where \(O=o_i\) if and only if the object class is i.
Now for each candidate change in the camera parameters we can determine the expected reduction in the entropy. This expected reduction depends on the \(\{P(o_i,p_jg_1..g_n)\}\), the current hypothesis probabilities. For instance, if there are two very probable hypotheses and there is a way to change camera parameters to eliminate one or the other, that nextlook with have a low expected entropy.
Their approach to view planning is to select the next view (camera parameters) which minimizes the expected entropy. See (9)(11) in the paper for details.
5. Gaze control module
A control strategy to implement the physical motion, i.e. the view planning outcome. Not of interest here.
Strengths and Limitations of the Borotschnig approach
Strengths
A major strength is that it is a complete endtoend active vision OR system based on well understood mathematical tools (PCA, Bayesian scoring, entropy).
Another strength is that it is domainindependent, and the database can be incrementally updated. In addition, Borotshnig's procedure does not require segmentation preprocessing, so the matching of test data is fast.
Limitations
A major problem with applying PCA to OR is that it is highly sensitive to object location in the object plane. Shifting the image radically changes the PCs.
Principal component analysis guarantees that the images in the training set, and images very similar to them, can be recognized using just a few principal components. But if an object is shifted in an image, like the test image above, its MS error compared to the corresponding image in the training set will be large, and the principal components will not be effective at capturing the test image, thus OR based on them will likely fail. Position counts!
In general, PCA is not invariant under affine transformations, of which pure translation is the most difficult to avoid. But rotation and scaling also must either be forbidden by tightly controlling the imaging environment, or otherwise be somehow compensated.
PCA is not invariant under lighting changes either. So large scale lighting changes will confound a PCA OR scheme.
Finally, it should be noted that PCA is a global rather than local matching procedure. This leads to problems with occlusion and clutter as Nelson discusses in the context of local vs. global keys.
These difficulties can each be addressed, but together probably limit the application of the method to highly constrained domains, like automated part inspection or "toy" problem domains.
A good approach to (relatively) unconstrained, naturalvision active OR has yet to be discovered. This is a great challenge for the next generation of researchers in the computer vision field!