Purposive vision paradigms

Departing for a time from approaches based on the Marr paradigm of task-independent, disembodied "vision as recovery", we will next consider the newer "active" approach to extracting useful information from 3-D and motion image sequences. There are several names different research groups give to this approach, which emphasize different aspects:

We will begin by reviewing three important papers which lay out the conceptual schemes, or paradigms, that underlie these newer approaches. They have much in common, as we will see. Together they define a common view of how to improve on the passive, recovery-based paradigm that dominated computer vision for so many years.


Reading
Fermuller and Aloimonos, Vision and action, Image and Vision Computing v 13 (1995) 725-744.

The two basic theses of this paper:

  1. What is vision? It is not recovery (the construction of a quantitative 3-D world model of all possible object identifications and localizations. It is the process of acquiring task-directed visual information and then acting. Vision exists to to support behavior.
  2. Vision is intimately linked with the observing entity, its goals and physical capabilities. It is goal-directed behavior which can be understood only in the context of the physical "details" of the observer. Vision is embodied and goal-directed.

E.g.: The vision of a flying insect will have different representations, algorithms and decision processes than that of a human. Likewise, the vision of a human and a robot welder, or a robot welder and a Mars rover vehicle.


In an nutshell, a common element of these recent research approaches is that the goal of vision is not to see but to do. Aspects of vision which, in a given embodiment and task environment, do not support action (behaviors) should not be supported.

A quick review of the Marr paradigm

David Marr, a British mathematician and neuroscientist on the faculty of MIT, proposed, in a very influential 1982 book "Vision: a computational investigation into the human representation and processing of visual information," that vision is a process of creating a full and accurate world model of the visually accessible world through quantitative recovery.

Vision can be understood, Marr proposed, outside the specifics of task or embodiment. There is "vision," according to Marr, not "visions."


His paradigm decomposed the study of vision into three hierarchical levels:

The algorithms for tracking, for instance, in a frog hunting insects should be similar to those of a mother deer watching her fawns running through a meadow.


Marr believed it was very important to separate the three hierarchical levels and to analyze them top-down and independently. First determine the theory for some operation, then see what algorithms satisfy that theory, then determine the representations that are most efficient for those algorithms.

At the theory level, Marr defined three stages in the process of transforming data on the retina into visual awareness of the 3-D world:

The primal sketch extracts key features, such as edges, colors and segments. The 2 1/2-D sketch overlays the primal sketch with secondary features: shading, texture, etc. The 3-D world model is the fully recovered scene.


Where are the problems in the Marr paradigm?

We next turn to the active vision approach of Fermuller and Aloimonos to see if these problems can be addressed through their paradigm.


Fermuller-Aloimonos: Modular View of an active vision system

Each visual competency is a skill needed to complete some task or set of tasks. The purposive representations are data structures as simple as they can be yet effectively supporting a linked competency.

E.g.: Obstacle avoidance is a basic visual competency. A purposive representation for this competency needs to include some measure of visual proximity.


The Fermuller-Aloimonos paradigm laid out in this paper is a "synthetic" approach: start by defining the simplest goals and competencies, embody them, link them with learning. Then add a new layer of competencies on top of the existing set for a given embodiment.

Competencies

These should be understood as skills for recognition not recovery.

This distinction is very important, for it puts the qualitative procedures of pattern recognition at the center of the action, not the quantitative procedures of metric mapping. So the right question for determining representations and algorithms is: what do we need to know about the object to label it properly guide us to take the right action?


Their "Hierarchy of Competencies"


Note that the Motion Competencies precede the Form Competencies. In their view, we need to be able to understand motion before we can understand shape. The basic principle of the form competencies is that shapes can be represented locally and qualitatively.

E.g.:


Finally, there are the Space Competencies

These are on the top of the hierarchy of competencies, since they involve interpretations of spatio-temporal arrangements of shapes.


Reading
Ballard, Animate Vision, Artificial Intelligence v 48 (1991) 57-86

Dana Ballard, working at U of Rochester, had a similar conception of how to get past the recovery roadblock. He called his paradigm animate vision, that is, vision with strong anthropomorphic features. These include

By refusing to study monocular systems in blocks-world settings, Ballard emphasized that interaction with the environment produced additional constraints that could be used to solve vision problems.


Point 1 appears to have moved to the right, since a' > a and b' < b (right disparity of left eye has increased, left disparity of right eye decreased). Its the opposite for Point 2.


So to tell which of two points is farther away in depth, fixate one and move your head laterally. See if the other point moves in same or opposite direction.

Using kinetic depth, the problem of depth estimation becomes much simpler (compared to shape-from-shading or other shape-from methods) with a lateral-head movement because we can add a constraint to each object point P (in fixation coordinates)

\(z_P = k ||dP||/||dC||\)

where \(z_p\) is the depth of point P, dP is the displacement vector of P in the fixation plane, and dC is the displacement vector of the camera projected into the fixation plane.


Animate vision paradigm
Visual processing is built on animate behaviors and anthropomorphic features. Fixation frames and indexical reference lead to compact representations.
Fixation frame
placing the origin of the coordinate system at the point you are looking at or "fixating;"
Indexical reference
means indexing views by their most important features, discarding unnecessary visual detail.

Advantages mentioned by Ballard:

Note that a fixation frame has its coordinate center at the fixation point, while object-centered frame has its coordinates centered on the object even as your fixation point changes.



Visual behaviors

Just as lateral head movement simplifies depth estimation, other motions simplify other visual processes. These motions are called visual behaviors.

So the animate vision approach is that for different tasks and different embodiments we should engage different visual behaviors. Each distinct visual behavior then evokes distinct compatible representations and constraints.


Visual behaviors can use very simple representations, for instance color and texture maps.

The visual behavior of texture homing can be done, for instance, by finding the sign of the gradient of the match score of the present view and the desired texture as fixation changes. Then using that one-bit representation to guide future saccades (eye movements).

Can divide most visual processes into "what" and "where" tasks, recognize and locate. Visual behaviors make both what and where searches more efficient.


Reading
Aloimonos, Weiss and Bandyopadhyay, "Active Vision", Int. J. of Computer Vision, v 1 (1988), 333-356

While this paper was not the first to consider the combination of perception and intentional behavior that became known as active vision, it probably had more impact than any of the other early papers because of its strong claims for this area. These claims are summarized in Table 1 from the paper:

Problem Passive Observer Active Observer
Shape from shading Ill-posed, needs regularization, no unique soln (nonlin). Well posed, linear, stable
Shape from contour Ill-posed, hard to regularize. Well posed, unique solns (mono, bino)
Shape from texture Ill-posed, needs added assumptions Well posed.
Structure from motion Unstable, nonlinear. Stable, quadratic.
Optic flow Ill-posed, needs regularization. Regularization dangerous. Well posed, unique solution

The paper goes on to justify each of these claims by mathematical analysis. We will not have time to discuss the analysis, but you should at least read it over to familiarize yourself with the arguments.

Their bottom line:

... controlled motion of the viewing parameters uniquely computes shape and motion.

In the years since this paper was published, the area of active vision has become the dominant paradigm in some important vision applications areas (e.g. robotics, autonomous vehicle navigation) because some of the claims above have proven out in practice.

But there are hidden difficulties also... for instance, how do you build an intelligent camera platform to execute desired saccades accurately in real time?


Note: The definition of a well-posed problem in applied mathematics, due to the French mathematician Hadamard, is one which has these three properties:

  1. It has a solution
  2. The solution is unique
  3. The solution depends continuously on data and parameters

He argues persuasively that only well-posed problems have solutions which are of practical value in the real world. Any problem which is not well-posed is said to be ill-posed.

E.g.:

  1. \(x^2+a=0; x \in \mathbb{R}; -1<a<-2\), is ill-posed (multiple solutions)
  2. \(x^2+a=0; x \in \mathbb{R}; +1<a<+2\), is ill-posed (no solutions)
  3. \(y=x+1, y=ax; x,y \in \mathbb{R}; 0<a<+2\), is ill-posed (solution not contin in a)
  4. \(y=x+1, y=ax; x,y \in \mathbb{R}; +2<a<+3\), is well-posed