Introduction

In this course, we are interested in vision in The Real World, a world of moving 3D objects and scenes. In which the imaging systems are not ideal, objects and noise and illumination and motion are all difficult to constrain in advance.

We are not interested in single static images. Or 2-D Flat Worlds, or Blocks Worlds, or other such constrained worlds. This eliminates many approaches and algorithms.

The standard model of vision: Vision for Recovery

An imaging model is a mapping from \(R^3\), the 3D scene, to \(R^2\), the image plane. An imaging model maps a 3D scene into a 2D image of that scene.

The standard model of vision is that the purpose of vision is to invert the imaging model. That is, given an image, recover (reconstruct) the scene. Determine the shapes and locations of all objects in the scene.

There are attractive elements to the "Vision As Recovery" approach:

Compatible with all cameras, scenes;
Recovery supports any narrower task;
Recovery uses visual data maximally.
Objectively assessable, quantifiable.

Because recovery is so general and its errors quantifiable, it has been the standard model for understanding what vision "is" for a long time.

This approach is called passive vision. We do not actively choose images or goals. In this course we will consider passive vision for 3D and motion recovery.

Early passive vision:
- 3D imaging models: projective geometry, stereopsis, epipolar geometry
- Shape recovery: shape from shading, other shape-from algorithms, illumination and reflection (radiometry), correspondence.
Late passive vision:
- 3D object recognition: 3D object-centered, 2D view-centered, indexing and matching.
- Motion analysis: optical flow, structure from motion, passive egomotion, tracking.

As we review this literature, we will find that the general problem of scene recovery from passive imagery is far from satisfactorily solved. Only a few very special cases of this approach have succeeded:

Blocks World OR systems;
Autonomous vehicle nav systems operating slowly on structured roadways;
Robots in controlled environments.

Why is recovery so difficult?

Imaging model is many-to-one. Recovery is not well posed. One-to-many, non-robust, underconstrained, sensitive Eg: Is this cube tilted up or down?

Imaging model has many parameters. Eg. intrinsic camera params, extrinsic parameters, illumination params, surface reflectance params, etc. Hard to identify them accurately. Motion, shape parameters are hard to separate.

Given these difficulties, perhaps we should look at the best existing systems, namely natural ones, for clues to alternative approaches.

Q: Does biological vision strive for recovery?
A: Almost never! Biological vision is designed to support specific behaviors, not to recover every detail of everything it sees. It is purposive.

Eg: Frog waiting for an insect to fly by. It does not need to recover the scene, just to detect moving objects and estimate distance to them.

Eg: Bee flying to the hive. It does not need to recover the scenes it confronts on the way, it just needs to recognize a few landmarks and do obstacle avoidance.

Purposive vision

Recovery is too difficult and produces much information that is not needed. Vision as recovery is wasteful of important resources. So here is an alternative to Vision As Recovery: Vision exists not to recover but to support specific behaviors and tasks. We refer to this as Purposive Vision. From this point of view, all representations, algorithms and strategies should be task-dependent, not set on recovery.

Vision vs. Visions

Thus there is not one "vision" but there are many "visions." Vision for a cheetah chasing its prey should have quite different algorithms than vision for an ant seeking food, or a cheetah returning to its den. Those interested in purposive vision do not ask, how in general do things see, but rather, how does vision-enabled system X support task Y? X can be a cheetah, a man, a CCD camera linked to a computer. Y can be egomotion estimation, obstacle avoidance, object detection, tracking, etc.

So purposive vision is about selecting representations, algorithms and strategies which fit with:

A specific goal, task or behavior.
A given embodiment.
A given set of environmental constraints.

Eg: goal: homing;

embodiment: bee with multifaceted bee eyes;
environmental constraints: can fly up to 5 mph, must fly at low altitudes also occupied by trees and bushes; may sustain attack by reptiles and spiders.

Eg: behavior: walking;
- embodiment: human being;
- environmental constraints: path is uneven, can trip over roots and rocks. Path is difficult to see in places. Must divide attention between footfall area and area ahead to maintain track on path.
Eg: task: Scud missile interception
- embodiment: anti-missile missile with onboard forward-looking camera
- environmental constraints: ballistic target, chaff, very high speed intercept.

It was the dream of the Vision As Recovery scientists to devise one set of algorithms, representations and schemes for all vision. But it is hard to imagine that the same algorithm that optimizes use of a bee eye in homing will be useful to guide a walking human or an anti-missile missile. Each has its own needs, each requires an algorithm optimized for those needs.

Purposive vision is also active vision, in the sense that it is linked with selection of future views and integration of vision with dynamic behaviors. In addition to early and late passive vision, in this course we will consider:

Early active vision: Active vision for navigation. Egomotion estimation, Obstacle avoidance, Visual servoing, Homing.

Late active vision: Active vision for recognition and tracking. Active object recognition, Active tracking.