7. Passive 3-D Object Recognition

We will now extend our foray into low level active vision to complete our discussion of the standard (passive) paradigm approaches to 3-D and motion vision.

These are the higher level passive operations, more AI-dependent, as opposed to the lower level passive operations such as stereopsis, more signal processing dependent, which we have discussed thus far.

Required reading: Sonka, Ch. 10.

Most of our presentation of 3-D OR will be descriptive rather than analytic. That is, we will not go into detail concerning the representations and algorithms. The only exceptions will be in the goad algorithm and the aspect graph method, both of which will be discussed towards the end of this section.


In our consideration of Shape Recovery in Sec. 3, we saw that the paradigm of "vision-for-recovery" is very ambitious and in most general cases the results have been limited and perhaps disappointing.

Here we consider a somewhat less ambitious, though still very difficult, task: 3-D Object Recognition (OR). It is easier than shape recovery because you do not have to recover an object in order to recognize it.


Suppose there are m hypotheses (object classes) in the Frame Of Discernment (FOD) the set of all object classes you can declare. OR is the association of an object from the scene with a hypothesis 1...m from the frame of discernment, ie. declaration of one class label as the true hypothesis.

Typically ("open system assumption"), one of the hypotheses is called the null hypothesis, the hypothesis that states that the object does not belong to any of the other m-1 classes. In a closed system the null hypothesis is not included and it is assumed all objects in the scene can be correctly associated with one (or more) of the hypotheses.


Distinct hypotheses may require physically distinct object classes, such as class 1: vehicles, class 2: people, class 3: clouds, etc. Or the hypotheses may be combinations of object physical characteristics and other properties.


How are objects declared to belong to class i rather than class j? OR always involves comparisons. These comparisons may be of two general types:

  1. Comparing descriptions of the object features with features of archetype objects of each class;
  2. Comparing object class construction rules with the manner in which the object to be recognized may be constructed.

The first is called recognition by analysis in which features of the object are matched against features of each of the stored archetypes (except the null hypothesis), and the best match, if it is above a preset threshold, determines the classification decision. If no class matches sufficiently, and the null hypothesis is available (open-world assumption), the null hypothesis is chosen.


The second is called recognition by synthesis in which the constructive procedures for each class and the object are compared. Again, the best match determines the classification decision.

In either case, to make the comparisons we need to have data concerning the object and data concerning objects representative of each class in the frame of discernment to compare.


Describing 3-D shapes

Descriptions can be complete or partial. 3-D recovery is complete description, ie. a different description for each distinct object. Complete descriptions are also called representations.


For 3-D OR we seldom work with complete descriptions, for several reasons:

  1. Limited view angles and image sets;
  2. Hard to compute;
  3. Contain more detail than needed for OR matching.

Instead, we work with partial representations, ie. models. Typically, we select a class of models and within this class find the parameter values (features) that best describe the object and that best describe the archetypes. These feature vectors may then be matched.

Models are never complete, ie. many distinct objects are represented by the same model. Models represent classes of objects. But often models are unique (each object has unique best model fit). Unique models are a well suited for OR use.


Volumetric (space-filling) models

Consider a discrete binary image. Each pixel has 2 possible values: true (white, background) and false (black, object). Similarly, we may define a discrete binary 3-D scene model which consists of a grid of voxels, or volume elements). If voxel (i,j,k)=0 that means that the corresponding small spatial cube is filled by object rather than background.

For finite voxel size, since many distinct objects have the same "staircased" appearance, this creates a partial representation, or model of the object. In the limit, as the voxel size > 0, voxel models become complete representations. Voxel models come with all the analytical tools of signal processing. They tend to be high-dimensional, require lots of storage, particularly non-binary voxel models.


Volume primitives approach

Another way to fill space is with unions and intersections of volume primitives, simple 3-D solids. By using smooth-surfaced primitives such as spheres and ellipsoids, or non-right-angle primitives such as parallelograms and pyramids, the 3-D staircasing in voxel models can be avoided. Different classes of volume primitives may be employed.

Note that voxels are cubic volume primitives which are combined through union to form space- filling models, so voxel modelling is a special case of the volume primitive approach.


Polyhedral models

The single volume primitive used to create 3-D polyhedral models is the half-space, ie. the set of points x in \(R^3\) satisfying

\(\alpha^T x + \beta \geq 0\)

for some \(\alpha\) in \(R^3\) and scalar \(\beta\). Only intersections of them are required for convex objects. General non-convex objects can be decomposed into a union of convex parts, and so can be modeled by unions of such intersected half-spaces.

One nice feature of polyhedral models is that they can be characterized by 4 scalars (a,b,c,d) for each facet (bounding plane), since \(ax + by + cz + d \geq 0\) defines a half-space. Another is that linear programming can be used to determine vertices and characteristics.

The obvious weak point of polyhedral modelling is the proliferation of artificial facet edges and, at the intersection of facet edges, artificial vertices.


Eg: Polyhedral model of teapot, simple shading.

Source: http://www.frontiernet.net/~imaging/fo_arch.html


Source: http://www.frontiernet.net/~imaging/fo_arch.html

Some kind of smoothing is obviously necessary to make the model reasonably realistic and accurate.


Constructive geometry models

Also known as constructive solid geometry. The volume primitives are (typically) sphere, cylinder, cone, cuboid (or prism) and half-space. Sometimes ellipsoids, paraboloids and other simple forms are added to the set of primitives.

These models are surprisingly expressive, but it is often difficult to determine the smallest set for a given permissible deviation. The expressiveness of a modeling scheme is the range of different types of objects which it can accurately represent.


Super-quadrics

Quadric solids are defined by

\(ax^2 + by^2 + cz = d\) (Type I quadric)

or

\(ax^2 + by^{2} + cz^2 = d\) (Type II quadric)

For instance if a and b are of the same sign then the Type I quadric is called a paraboloid, else a hyperbolic paraboloid. If a, b, c and d are all of the same sign then the Type II quadric is called an ellipsoid.


Superquadrics are generalizations of quadrics with additional parameters to make them less smooth and symmetric than quadrics. See Sonka Fig. 10-14 p 526 for some super-ellipsoids. But note errors in exponent s in eqn 10-23 p 525. Should read

\[ \]


A description of superquadrics and development of some of their mathematical properties can be found here.

Superquadrics have many interesting mathematical properties, and may be usefully generalized in various ways. They are much more expressive than quadrics, at the cost of two additional parameters (6 vs. 4) in the models.


Sweep representations

Given a 3-D object and a curvilinear path or "spine" entirely contained within it, the object may be described by defining at each point along the spine the 2-D shape of the crossection of the object at that point. This is called a sweep representation, or generalized cylindrical representation.


Sweep-represented objects need not have axes of symmetry, or any symmetry whatsoever. But they work best for "spaghetti-like" objects, that is, objects for which the change in the crossection for a properly selected spine is relatively slow and smooth.

In case of hand-like objects with several protruding parts, the spine must divide into multiple skeleton like extensions. This complicates sweep representations significantly.


Surface models

The alternative to space-filling models are those that model the boundaries of the object: surface models. Just as there are many ways to model boundaries in 2-D, ie. 1-D curves on a plane, there are many ways to model curvilinear 2-D manifolds in 3-space.

There are two important problems to be solved in computing a surface model: what class of manifolds to use, and how and where they will intersect as 1-D curves in 3-D.


The simplest volumetric model was the polyhedron, a solid 3-D object with planar boundaries. If we think of the elements of a polyhedron model not as half-spaces but as the facets that bound them, this model class can also be considered a surface model. However, if we are describing a polyhedron by quantifying its facets we need to specify its edges and vertices.

The edges can be characterized by equating pairs of half-space boundaries, and the vertices found by solving triples of these equations.



The half-space boundaries are these inequalities considered as equalities (e.g. x=0). The first two facets intersect where x=0 and y=0, which is the 1-D straight line

\((x,y,z) = (0,0,z)\) for all z

The vertices for this edge between facets x=0 and y=0 are the points on all of the first three half-space boundaries x=0, y=0, z=0,

\((x,y,z) = (0,0,0)\)

and the first two and the sixth,

\((x,y,z) = (0,0,1)\)


In this example, the 6 half-space inequalities define the volumetric model, while the 6 corresponding equalities, together with the 12 edge equations and 8 vertices, define the corresponding polyhedral boundary model. Clearly, the cube requires more constraints to be described as a surface model than a volumetric model. But the additional boundary parameters in surface models permit modelling of non-convex bodies, which cannot be done with intersections of half-spaces.


Delaunay Triangulation

Given a set of points on the surface of a 3-D object, or surface point cloud, these points can be used as the vertices of contiguous triangles whose union models the surface of the object. This process is called surface triangulation and the most common strategy is Delaunay Triangulation:

  1. Select 3 points from the point cloud and find the unique circle they lie on. Extend that circle to the unique sphere that has that circle as a great circle.
  2. If there are no other points in the surface point cloud which are in the interior of that sphere, make these three points vertices of a triangular facet of the model.
  3. Repeat steps 1 and 2 for all sets of 3 points.

A problem with the algorithm given above is that like polyhedral volumetric modelling, it requires additional constraints to handle non-convex objects. The unconstrained algorithm above returns the convex hull of the point cloud.

Delaunay Triangulation can also be extended to non-convex objects by first decomposing the point cloud into the union of point clouds in which the convex hull models the object, then performing the algorithm on each separately and merging the resulting convex polyhedrons.


There are numerous other triangulation schemes (e.g. constrained triangulation, Morse triangulation) but Delaunay Triangulation is still the most commonly used in practice.


Object-based OR using volumetric or surface models

Having considered several types of 3-D models, we now turn to the question of how are we to use a 3-D model for OR? As suggested earlier, general pattern recognition procedures most often have these steps:



By a hypothesis, we mean a declaration of the object class, together with any necessary view parameters, ie. a declaration of what we are looking at and how we are looking at it.


In methods employing the recognition-by-analysis approach, the "set of attributes" are quantitative or categorical mappings from the model called features. Note: a categorical mapping is a mapping into a non-numerical set which need not be endowed with a similarity or distance metric.


Object-centered coordinate systems assume that the object being viewed is located at the origin of the world coord system \((x_w,y_w,z_w)=(0,0,0)\) and that it has a specific known pose (3-D rotation relative to the world coord axes).


The canonical position of the object in world coords is often taken to be such that a selected vertex is at the origin as shown, or such that the centroid of the object is at the origin. The canonical pose is such that the major axes of the object are oriented with the world coord axes, or principal edges are aligned as much as possible.


The Goad algorithm

If we use 3-D polyhedral models, the Goad algorithm can be used to implement the OR procedure.

Assume the class library consists of polyhedral archetypes, one for each class. Take the feature set for each member of the class library to be the complete set of object edges, each represented by an ordered pair of points in world coords.

Then e1=(v1,v3), e2=(v1,v6), ... e12=(v2,v4)


The camera is assumed to be a unit distance away from the object to be classified, and records an image. That image is precompiled to a polyhedral model and thence to a set of edges (represented by pairs of vertices). The location p on the unit sphere is not known, nor the optical axis, nor the camera image plane rotation q.

The optical axis degree of freedom can be removed by translating the object to the center of the image, orienting the optical axis to the world coord origin.

Thus, a hypothesis will consist of an object class together with (p,q), the view location and image plane rotation extrinsic camera parameters.

Note: Do not confuse the 3-D location vector p and (p,q) of the gradient space defined for shape recovery.


The standard form of Goad's algorithm assumes the object is opaque. Thus, given view sphere location p, we must consider that only some edges will be visible. The set of p's from which a given edge e is visible is called the visibility locus of e, and denoted L(e).

The basic idea of Goad's algorithm is to seek an object label, p and q for which edges in the object image and edges in an image computed from the class library match up as much as possible for p in L(e).


The algorithm proceeds by hypothesize-and-verify over increasing edge sets with backtracking. At a given iteration, we have our current hypothesis \((\omega_i, p, q)\) where \(\omega\) is the hypothesized object class, and the set of edges that we have matched with this hypothesis up to this iteration. We select an unmatched edge in the object image, and based on the intersections of L(e) for all matched edges, determine a region where the matching edge must lie in the \((\omega_i , p, q)\) pre-stored image.

If there is one there, we have verified the hypothesis so far, and we continue by seeking to match another edge that has not yet been matched. If there fails to be one, ie. no edge in the \((\omega_i , p, q)\) pre-stored image which is in the intersections of L(e) for matched edges, the hypothesis \((\omega _i , p, q)\) is rejected and we backtrack through (p,q) space.


The algorithm terminates with success when there are no unmatched edges in the object image, and the current hypothesis \((\omega_i , p, q)\) is declared. It terminates in failure when we have backtracked through all \((\omega_i , p, q)\) for the current \(\omega_i\).

In this case the object is not of class \(\omega_i\), and we move on to the next class. Note that since there are a finite number of hypotheses, and each polyhedral object image has a finite number of edges, the algorithm has guaranteed convergence.


2-D view-based methods

There are 3 distinct strategies we can employ for matching in a 3-D OR problem:

  1. 3-D model-based matching: the library stores 3-D archetypes. The image object is recovered, ie. a 3-D model is produced, and the size, position and pose of the image object relative to the archetype are hypothesized. Matching is done in \(R^3\).
  2. Hybrid matching: the library stores 3-D archetypes. Based on the 2-D view (image), but without recovery, the pose, size and location are hypothesized and used to create a synthetic view. Matching is done in \(R^2\). Goad's algorithm is an example of hybrid matching.
  3. 2-D view-based matching: the library stores a set of 2-D views of the object. Based on the 2-D image object, the best stored 2-D view is selected and matching is done in \(R^2\).

Full 3-D model-based matching is seldom attempted for practical applications except in very narrow domains (eg. blocks-world type). Problem is the difficulty with recovery.

Hybrid matching is sometimes used for non-time-critical applications. There is a way to implement 2-D matching which does not require creating synthetic views online, in which a set of such views are precompiled.


To perform view interpolation, that is, the rendering of a synthetic view from a view location which has not been stored, an important fact is that a given object point in one view may be mapped into its location in another view using a simple transformation (affine in case of orthographic projection, trilinear in perspective projection). This can permit "tuning" the view location before matching as long as the set of visible object points has not changed between views.

Eg:

Note that only edges and vertices visible in a given image can be mapped using such methods. Features visible from an unstored pose but not visible from a stored pose will be missed in the "tuned" image.


Most of the work on view-centric approaches assumes that the object models are polyhedral, both library views and object views. Polyhedra have many nice properties for relating views around the view sphere.

  1. They can be easily characterized in a given view by sets of vertices and edges;
  2. Polyhedral edges are straight in any view;
  3. For most small changes of viewpoint, there is no change in the set of visible object points.

If the 3-D polyhedron is convex, then there is a fourth important simplifying property:

Eg:

Note that a view of a non-convex polyhedron may contain "false vertices," that is, vertices of the view that do not correspond to object vertices.


The most direct approach to 2-D view-based OR would be to store views equally spaced around the view sphere. For instance, the centers of the 20 triangular facets of the icosahedron. Or, if a greater set was desired, dividing each of these 20 triangles into 4 triangles whose vertices are the original triangle vertices plus the midpoints of the three sides.

A problem with such a method is that equal sized view loci may over-model certain aspects of the object and under-model others, making reliable recognition hard or impossible.


Eg:

Consider the union of the view loci of the two vertices at the bottom of the slot. If there is not a viewpoint on the view sphere in that union, we would get no information about the structure of the bottom of the slot. That union would be a small proportion of the surface area of the view sphere, so having a viewpoint in there would not be very likely. Then, if there were two object classes that differed only in what's going on in that slot, we could never distinguish them since their set of stored views would be identical.


Characteristic views

In the late 1970's and early 1980's groups led by J.K. Koenderink and H. Freeman developed the ideas of the characteristic views and the aspect graph. These ideas have made 2-D view-centric approaches the dominant choice for 3-D OR ever since.

Let us limit ourselves to employing polyhedral models. Any view of a polyhedral model, either perspective or orthographic, reduced to a line drawing, is called an aspect of the model. An aspect is identified (uniquely specified) by the set of its visible facets.

Two aspects are said to be adjacent if they differ by only a single facet. A nondirected graph in which the aspects are nodes and two aspects are connected by an edge iff they are adjacent views is called an aspect graph.


Every view of a given object always has both topological and geometric characteristics.

Topological
graph structure of the view when considered as an undirected graph (Image Structure Graph ISG), ie. number of vertices and their interconnectivity.
Geometrical
lengths of edges, size of angles.

Call a view "stable" if a small neighborhood around the viewpoint contains only views with the same topological structure, ie. where the same aspect is viewed. Then a contiguous stable set of views bounded by unstable viewpoints is called a characteristic view domain, and the aspect being viewed in that domain is called a characteristic view. A characteristic view class is a set of characteristic view domains which are topologically equivalent, ie. have identical ISGs.


For convex polyhedra, the characteristic view locus boundaries correspond to view loci formed by the facets of the polyhedron projected onto the view sphere.


To determine the aspect graph for a convex polyhedral object, extend the facets to complete planes in \(R^3\), and note how they planes partition \(R^3\) into cells. All cell boundaries are planes. Each cell except the object itself corresponds to an aspect of the polyhedron.

The nodes in the aspect graph are the aspects of the polyhedron. Aspects are adjacent if they are connected by a cell face. Each pair of adjacent aspects are connected by an edge in the aspect graph. That completes the definition of the aspect graph for polyhedra.


Let's number the aspects, beginning with the nine aspects above the object. Starting from the left and backmost aspect, lets number them in raster order looking down at the object. Continue in that fashion for the eight aspects neither above nor below the object, then the nine below. So for instance aspect 2 is above and behind, while aspect 14 is directly to the right of the object.


Let's also number the object facets. Call the one nearest to us facet 1, facet 2 to its right, facet 3 behind, facet 4 to its left, 5 above, 6 below.

We are ready to construct the aspect graph. It will have 26 nodes, each labelled by a line drawing of the object as seen from that aspect. Each of the extended planes has 8 cell faces connecting aspects, for a total of 48 edges.


For instance, aspect 1 looks like

and has edges connecting this node to aspects number 2, 4 and 10. So this part of the aspect graph would look like

Note that he aspect graph nodes do not need to contain line drawings, they are just shown here for illustrative purposes.


For a typical example, many of the aspects are topologically equivalent. That is, if we made an ISG of each aspect, in which the visible facets were nodes and adjacent facets were connected with edges these ISGs would be isomorphic.

For our example, there are only 3 topologically distinct aspects, ie. 3 non-isomorphic aspect-topology graphs: 1-facet, 2-facets, 3-facets, and thus three characteristic views.



Adjacent views, those linked in the aspect graph, have the property that there is a continuous path of stable view locations connecting them. That is, a viewer can move from one aspect to the other without encountering any other aspects, and do so even with small perturbations to the path. Thus no single-facet view is linked to a 3-facet view, since in moving from a one-facet view, one of the two additional facets would be encountered first, to first see them both at exactly the same point in the path is unstable.


Construction of the aspect graph for non-convex polyhedra is more difficult. Freeman calls the partitioning planes described so far type A planes.

There are type B planes formed by an edge and a non-adjacent vertex, sometimes called edge-vertex planes.

Type B plane


The Aspect Graph Match Method for 3-D OR Aspect graphs are used in 3-D OR by segmenting out the object to be recognized from the image, mapping it into a 2-D polyhedron (ie. an aspect) and thence to an Image Structure Graph (ISG). The ISG is an undirected unattributed graph in which the nodes of the graph are the vertices of the aspect, and the links are its edges. We look for a match with ISGs of the aspects of each class aspect graph.

If there are one or more matches to aspects within a single aspect graph, a recognition is declared. If no matches anywhere, the null hypothesis is declared.

If there are matches in two or more aspect graphs, a second view can be taken from an adjacent viewing position and the process repeated. Candidate classes must have aspects that match both views, with an edge in the aspect graph between them.

The process is continued until all matching aspect graphs have been found. If none, declare OR failure (null hypothesis). If one or more, proceed to geometric matching step. Aspect graph matching does not consider geometry.


Note: There is a graph related to the ISG called the Planar Structure Graph. The nodes of the PSG are the visible vertices of the object, the edges are its visible edges. While any two aspects which are topologically equivalent have the same PSG's, two aspects can have the same PSG's without being topologically equivalent (the same is not true of ISG's). Eg:

These two views are not topologically equivalent even though their PSGs are identical:

To be topologically equivalent, two views need to have isomorphic PSGs and be related by rotation, translation and perspective projection in \(R^3\).


Finally, note that views that match topographically (perfectly matching aspect graphs)need not match geometrically. So a final step must be done after topographic aspect graph matching to determine if the object view geometries match up as well. Thus we should check for matches in angles and proportions.

Eg:

The topologies of these views match but not their geometries. There is no 3-D polyhedron such that Views A and B are both views of that object.


Batlle on natural object OR from outdoor scenes

Required reading: Batlle et al, A review on strategies for recognizing natural objects in color images of outdoor scenes , Image and Vision Computing v18 (2000) 515-530.

We close our look at 3-D OR with a general review of OR constrained to an important problem domain: recognizing objects in unstructured outdoor scenes.

This is a typical 3-D domain: it is constrained in terms of object classes, but there is great variability in the appearance of objects. The number of object classes in the 20 or so systems surveyed ranged up to 14.

Typical: {bushes, car, fence, grass, road, roof, shadow, window}.


Outdoor scenes are challenging because the appearance can change radically in seconds due to cloud movement. In such circumstances, color is very important for segmentation, and all the systems tested use color.

Natural objects tend to have complex and highly variable shapes that make them poor candidates for accurate shape modelling (eg. trees, rocks, grass) whether 2-D view-based or 3-D object-based. Most frequently they are modelled by a feature vector capturing a few coarse shape descriptors (eg. aspect ratio, major axis) together with texture and color features.

An important part of OR in such scenes is adjacency rules: for instance, the sky cannot be under a rock, or a rock high in a tree. This is an example of the use of object relationship or context rules in OR.


The surveyed algorithms are of three basic types: top-down, bottom-up and hybrid.

Top-down
begin with a small set of hypotheses, employ hypothesize-and-verify to aid segmentation. Use heterogeneous knowledge-based procedures such as context rules and object-part rules for both segmentation and matching.
Bottom-up
begin with pixel values, use data-driven procedures, uniform classes of archetype models and image object models. Only hypothesize at final matching step.
Hybrid
do some bottom-up processing (eg. non-purposive low-level segmentation), then top-down processing (eg. class-specific merging high-level segmentation rules).

Top-down systems can handle very complex scenes due to their flexibility, different procedures for different hypotheses. But they are not good at handling or learning unmodelled objects, or where there are many object classes.

Bottom-up systems can handle unmodelled objects well, since they generate descriptions of all objects encountered, not just objects of known classes. But they have great difficulty with complex scenes and are mostly limited in practical use to "blocks-world" type problems.

Hybrid systems combine the discrimination power of top-down systems with the robustness of bottom-up systems. After initial non-purposive segmentation, the segments and their adjacencies can be used to order hypotheses for purposive continuations. The descriptions derived for new classes can be the basis for learning.


Main takeaway from Batlle's paper: current computer vision systems working in natural (not artificially constrained) outdoor environments are not yet capable of recognizing lexicons of hundreds or thousands of distinct object types reliably. Small lexicons can be handled, but the scale-up problem has not yet been solved.

There has been some progress in natural scene labelling since the paper was written in 2000, but I think the conclusion is still valid.