CSE668 Principles of Animate Vision Spring 2011

5. Purposive vision paradigms

Departing for a time from approaches based on the

Marr paradigm of task-independent, disembodied

"vision as recovery", we will next consider the newer

"active" approach to extracting useful information from

3-D and motion image sequences. There are several

names different research groups give to this

approach, which emphasize different aspects:

* purposive vision

* animate vision

* active vision

We will begin by reviewing three important papers which

lay out the conceptual schemes, orparadigms, that underlie

these newer approaches. They have much in common, as we will

see. Together they define a common view of how to improve on

the passive, recovery-based paradigm that dominated

computer vision for so many years.

CSE668 Sp2011 Peter Scott 07-01

Fermuller and Aloimonos, Vision and action, Image and

Vision Computing v 13 (1995) 725-744.Required reading.

The two basic theses of this paper:

1. What is vision?

It is not recovery (the construction of a quantitative

3-D world model of all possible object indentifications

and localizations. It is the process of acquiring task-directed

visual information and then acting. Vision exists to

to support behavior.

2. Vision is intimately linked with the observing entity,

its goals and physical capabilities. It is goal-directed

behavior which can be understood only in the context of

the physical "details" of the observer. Vision is

embodied and goal-directed.

Eg: The vision of a flying insect will have different

representations, algorithms and decision processes

than that of a human. Likewise, the vision of a human

and a robot welder, or a robot welder and a Mars

rover vehicle.

CSE668 Sp2011 Peter Scott 07-02

In an nutshell, a common element of these recent research

approaches is that the goal of vision is not to see but to

do. Aspects of vision which, in a given embodiment and

task environment, do not support action (behaviors) should

not be supported.

A quick review of the Marr paradigm

David Marr, a British mathematician and neuroscientist

on the faculty of MIT, proposed, in a very influential

1982 book "Vision: a computational investigation into

the human representation and processing of visual infor-

mation," that vision is a process of creating a full and

accurate world model of the visually accessible world

through quantitative recovery. Vision can be understood,

Marr proposed, outside the specifics of task or embodi-

ment. There is "vision," according to Marr, not "visions."

CSE668 Sp2011 Peter Scott 07-02a

His paradigm decomposed the study of vision into three

heirarchical levels:

* Theory

* Algorithms

* Representations

The algorithms for tracking, for instance, in a frog

hunting insects should be similar to those of a mother

deer watching her fawns running through a meadow.

CSE668 Sp2011 Peter Scott 07-03

Marr believed it was very important to separate the

three heirarchical levels and to analyze them top-down

and independently. First determine the theory for some

operation, then see what algorithms satisfy that theory,

then determine the representations that are most efficient

for those algorithms.

At the theory level, Marr defined three stages in the

process of transforming data on the retina into visual

awareness of the 3-D world:

* Creation of a "primal sketch"

* Mapping the primal sketch into a "2 1/2-D sketch"

* Mapping the 2 1/2-D sketch into a 3-D world model

The primal sketch extracts key features, such as edges,

colors and segments. The 2 1/2-D sketch overlays the

primal sketch with secondary features: shading, texture, etc.

The 3-D world model is the fully recovered scene.

CSE668 Sp2011 Peter Scott 07-03a

Where are the problems in the Marr paradigm?

* Disembodied: inefficient, ineffective

* Heirarchical: non-adaptive, non-feedback,

non-learning

* Quantitative recovery: non-robust, unstable.

We next turn to the active vision approach of Fermuller and

Aloimonos to see if these problems can be addressed through

their paradigm.

CSE668 Sp2011 Peter Scott 07-04

Fermuller-Aloimonos' Modular View of an active vision system

Each visual competency is a skill needed to complete some

task or set of tasks. The purposive representations are

data structures as simple as they can be yet effectively

supporting a linked competency.

Eg: Obstacle avoidance is a basic visual competency.

A purposive representation for this competency needs

to include some measure of visual proximity.

CSE668 Sp2011 Peter Scott 07-05

The Fermuller-Aloimonos paradigm laid out in this paper

is a "synthetic" approach: start by defining the simplest

goals and competencies, embody them, link them with

learning. Then add a new layer of competencies on top

of the existing set for a given embodiment.

Competencies

These should be understood as skills for recognition

not recovery.

Eg: For obstacle avoidance we need to recognize

which object points are getting closer to us,

threating us. We don't need to know how many

mm away they are or their quantitative velocity.

This distiction is very important, for it puts the

qualitativeprocedures of pattern recognition at the

center of the action, not the quantitative procedures

of metric mapping. So the right question for determining

representations and algorithms is: what do we need to

know about the object to label it properly guide us to

take the right action?

CSE668 Sp2011 Peter Scott 07-06

Their "Heirarchy of Competencies: "

Motion Compencies

* Egomotion estimation

* Object translational direction

* Independent motion detection

* Obstacle avoidance

* Tracking

* Homing

Form (Shape) Competencies

* Segmentation

* 2-D boundary shape detection

* Localized 2-1/2-D qualitative patch maps

* Multi-view integrated ordinal depth maps

CSE668 Sp2011 Peter Scott 07-07

Note that the Motion Competencies preceed the Form

Competencies. In their view, we need to be able to

understand motion before we can understand shape.

The basic principle of the form competencies is that

shapes can be represented locally and qualitatively.

Eg:

CSE668 Sp2011 Peter Scott 07-08

Finally, there are the Space Competencies

* View-centered recognition and localization

* Object-centered recognition and localization

* Action-centered recognition and localization

* Object relationship awareness

These are on the top of the heirarchy of competencies,

since they involve interpretations of spatio-temporal

arrangements of shapes.

CSE668 Sp2011 Peter Scott 07-09

Ballard, Animate Vision, Artificial Intelligence v 48

(1991) 57-86.Required reading.

Dana Ballard, working at U of Rochester, had a similar

conception of how to get past the recovery roadblock.

He called his paradigmanimate vision, that is, vision

with strong anthropomorphic features. These include

* Binocularity

* Foveas

* Gaze control

By refusing to study monocular systems in blocks-world

settings, Ballard emphasized that interaction with the

environment produced additional constraints that could

be used to solve vision problems.

CSE668 Sp2011 Peter Scott 07-10

Eg: Computation ofkinetic depth(motion parallax).

This requires binocularity and gaze control.

Move the camera laterally while maintaining

fixation on a point X. Points in front of X

move in opposite direction to points behind X,

with apparent velocities proportional to depth

difference with X.

Point 1 appears to have moved to the right, since

a'>a and b'<b (right disparity of left eye has

increased, left disparity of right eye decreased).

Its the opposite for Point 2.

CSE668 Sp2011 Peter Scott 07-11

So to tell which of two points is farther away

in depth, fixate one and move your head laterally.

See if the other point moves in same or opposite

direction.

Using kinetic depth, the problem of depth estimation

becomes much simpler (compared to shape-from-shading

or other shape-from methods) with a lateral-head

movement because we can add a constraint to each

object point P (in fixation coordinates)

z_P = k ||dP||/||dC||

where z_p is the depth of point P, dP is the displace-

ment vector of P in the fixation plane, and dC is the

displacement vector of the camera projected into the

fixation plane.

CSE668 Sp2011 Peter Scott 07-12

Animate vision paradigm: Visual processing is built

on animate behaviors and anthropomorphic features.

Fixation frames and indexical reference lead to

compact representations. Note:fixation framemeans

placing the origin of the coordinate system at the

point you are looking at or "fixating;"indexical

referencemeans indexing views by their most important

features, discarding unnecessary visual detail.

Advantages mentioned by Ballard:

* More efficient for search

* Can represent controlled camera motions well

* Can use fixation and object-centered frames

Note that a fixation frame has its coordinate center

at the fixation point, while object-centered frame

has its coordinates centered on the object even as

your fixation point changes.

CSE668 Sp2011 Peter Scott 07-13

* Can use qualitative rather than quantitative

representations and algorithms

Eg: for collision avoidance, only have to worry about

objects "in front of" other objects. Don't need

a metric scale.

* Can isolate a region of interest preattentively.

Eg: Using head-motion induced blur, can find a fixation

ROI easily (object points near the fixation point

in three-space).

* Exocentric frames (object, fixation) are efficient

for recognition and tracking.

* Foveal cameras yield low-dimensional images for

given FOV and maximum resolution.

CSE668 Sp2011 Peter Scott 07-14

* Active vision permits use of the world as a

memory cache.

Eg: You see a book on the desk. Rather than memorize

its title, you just need to recall where the book

is. When you need the title, you can "look it up"

in the cache by gazing at it once again.

Eg: You want to decide if an aircraft you are looking

at matches a photograph. You quickly scan the

aircraft, noting where the insignia is, where the

engine intakes are, where the cockpit is. Then

you look at the photograph, memorize the insignia,

then go to your "indexically referenced" aircraft

view and compare the insignias. Repeat for the

other key features, indexed by location.

CSE668 Sp2011 Peter Scott 07-15

Visual behaviors

Just as lateral head movement simplifies depth estimation,

other motions simplify other visual processes. These

motions are calledvisual behaviors.

Eg: Fixate on a point you are trying to touch, and hold

the fixation while you move your finger. If the

relative disparity decreases, you are getting closer.

Can think of this exploratory finger motion as

sampling a fixation frame relative disparity map.

So the animate vision approach is that for different tasks

and different embodiments we should engage different

visual behaviors. Each distinct visual behavior then

evokes distinct compatible representations and constraints.

Eg: Fixation frame relative disparity map for hand-eye

coordination vs. indexical reference map for object

recognition.

CSE668 Sp2011 Peter Scott 07-16

Visual behaviors can use very simple representations, for

instance color and texture maps.

Eg: Texture homing. Fixate a region with a given texture.

Eg: Color homing, edge homing.

The visual behavior of texture homing can be done, for

instance, by finding the sign of the gradient of the match

score of the present view and the desired texture as

fixation changes. Then using that one-bit representation

to guide future saccades (eye movements).

Can divide most visual processes into "what" and "where"

tasks, recognize and locate. Visual behaviors make both

what and where searches more efficient.

* What: indexical reference

* Where: low-dimensional visual search representations

CSE668 Sp2011 Peter Scott 07-17

Aloimonos, Weiss and Bandyopadhyay, "Active Vision,"

Int. J. of Computer Vision, v 1 (1988), 333-356.

Required reading.

While this paper was not the first to consider the

combination of perception and intentional behavior

that became known as active vision, it probably

had more impact than any of the other early papers

because of its strong claims for this area. These

claims are summarized in Table 1 from the paper:

Problem Passive Observer Active Observer
------- ---------------- ---------------

Shape from Ill-posed, needs to be Well posed, linear,

shading regularized. Even then stable.

no unique soln (nonlin).

Shape from Ill-posed, hard to Well posed, unique

contour regularize. solns (mono, bino).

Shape from Ill-posed, needs added Well posed.

texture assumptions.

Structure Unstable, nonlinear. Stable, quadratic.

from motion

Optic flow Ill-posed, needs to be Well posed, unique

regularized. Regular- solution.

ization dangerous.

CSE668 Sp2011 Peter Scott 07-18

The paper goes on to justify each of these claims by

mathematical analysis. We will not have time to discuss

the analysis, but you should at least read it over to

familiarize yourself with the arguments.

Their bottom line: "... controlled motion of the

viewing parameters uniquely computes shape and

motion."

In the 15 yrs since this paper was published, the area

of active vision has become the dominant paradigm in

some important vision applications areas (eg. robotics,

autonomous vehicle navigation) because some of the

claims above have proven out in practice. But there are

hidden difficulties also... for instance, how do you

build an intelligent camera platform to execute desired

saccades accurately in real time?

CSE668 Sp2011 Peter Scott 07-18a

Note: The definition of a well-posed problem in applied mathematics, due to the French mathematician Hadamard, is one which has these three properties: 1. It has a solution; 2. The solution is unique: 3. The solution depends continuously on data and parameters. He argues persuasively that only well-posed problems have solutions which are of practical value in the real world. Any problem which is not well-posed is said to be ill-posed. Eg: 1. x²+a=0, xεR, -1<a<-2,is ill-posed (multiple solutions)    2. x²+a=0, xεR, +1<a<+2,is ill-posed (no solutions)

   3. y=x+1 x,yεR, 0<a<+2,is ill-posed (solution not contin in a)
       y=ax

   4. y=x+1 x,yεR, +2<a<+3,is well-posed
       y=ax

CSE668 Sp2011 Peter Scott 07-19