CSE668 Principles of Animate Vision Spring 2011
   

6. Vision for navigation



In this section of the course we will consider the "motion

competencies," those animate vision skills associated with

relative motion between the camera and objects in the

scene.
 

Motion competencies include:
 

    * egomotion estimation

    * obstacle avoidance

    * visual servoing

    * tracking

    * homing
 

With only about two weeks to discuss all the motion

competencies, we cannot go into much depth but rather

we will take a representative approach to each of these

problems and read the paper and describe the method.
 

CSE668 Sp2011 Peter Scott 08-01



Egomotion estimation
 

In the Fermuller-Aloimonos heirarcy, motion competencies

are the most basic and among them, egomotion estimation

comes first. There is good reason for both of these

choices.
 

Motion competencies are the most basic, they argue, because

they are the simplest and most primitive in natural vision.

Recall that the snail has a single photodetector and senses

changes in brightness as possible shadow of a predator looming.
 

Egomotion estimation comes first because it is necessary

to separate egomotion from independent motion of objects

in the scene in order to do any other motion competency.
 

Eg: Cannot effectively avoid an obstacle unless you know

    if it is approaching you, or you approaching it.
 

Eg: Cannot devise a good homing stategy on a moving

    object unless you know how it is moving.

CSE668 Sp2011 Peter Scott 08-02



 

One way to do egomotion estimation is to determine

the 3-D locations of several stationary feature

points in an image sequence and then apply the

Structure-from-motion Theorem (Sonka p. 510).

With as little as three projections of four points,

we can recover both the 3-D rigid displacement

vectors (shape) and the motion parameters, the

translation and rotation vectors. Since the points

are stationary, the camera motion will be the

negative of the perceived object motion.
 

Problems: need a calibrated camera, must solve the

correspondence problem for each feature point, and

must be sure that the feature points are stationary.
 

Eg: You are sitting on a train in the station and the

    train next to you starts to glide slowly backward.

    Are you moving forward or it moving backward?
 

CSE668 Sp2011 Peter Scott 08-03



 

In contrast, active vision approaches do not recover the

3-D structure of the scene in order to determine egomotion.

They are based on the apparent motion field created as the

camera moves about while fixating on scene points. Apparent

motion of a point in an image sequence is the motion of

that point measured in image coordinates.
 

Eg: If you are fixated on an scene point right along

    your egomotion translational velocity vector, and

    you are not rotating the camera as you move, the

    apparent motion field is purely radial.

    So, if you can produce radial optical flow, then the

    egomotion must be pure translational along the optical

    axis of the system, which points directly at the

    focus of expansion (FOE).If you are moving away from

    the object, the FOE becomes the focus of contraction

    (vanishing point).

      
 

CSE668 Sp2011 Peter Scott 08-04



 
 

Using the above, its easy to estimate egomotion parameters

in world coordinates, like "I'm moving straight towards

the middle of the closest face of that rectangular object."

Note that we can't make a quantitative estimate of the

magnitude of our egomotion vector, ie. our egomotion speed,

but only its direction. We would need to know how big the

objects in our FOV are in order to determine speed as well.


 

    Eg: Situation 1: you are a distance d from an object

        O oriented with pose p and are moving directly

        towards it with a velocity vector e.

        Situation 2: you are a distance 2d from an object

        O' which is identical to O except twice as

        big, is oriented with pose p and you are moving

        directly towards it with a velocity vector 2e.

        You cannot distinguish these situations.

Note that in this example you cannot quantify speed, but

you can quantify time-to-collision, which is the same for

both situations. We will consider this in studying the

problem of obstacle avoidance.
 

CSE668 Sp2011 Peter Scott 08-05


 

Determining apparent motion  


We consider only the case of stationary objects in which

the optical axis is aligned with the egomotion velocity

vector (i.e., you are looking where you are going).

 

The fixation frame is the coord system shown with origin

at O', the point we are looking at, and the plane containing

O' and perpendicular to the optical axis is called the

fixation plane.

CSE668 Sp2011 Peter Scott 08-05a


 

Let P=(x,y,z) be a point in camera coords. Define the

view angle θ as the plane angle shown  below in the

O-O'-P plane. In this plane,

        tan θ = r/z

     d(tan θ) = dθ/(cos2 θ)

           d(r/z) = -r/z2 dz  = -tan θ/z dz


Then  dθ/dz = (cos2 θ)(-tan θ/z)

        = (-cos θ)(sin θ)/z = -sin(2θ)/(2z)

CSE668 Sp2011 Peter Scott 08-05b


 

Let the egomotion speed be s = -dz/dt. Then

      dθ/dz = -dθ/(s dt) and 

      dθ/dt = (s/(2z))(sin(2θ))


This tells us how fast the view angle
θ of a point will be

changing as we move in the direction we are looking.

Note that it depends on the ratio s/z, and on the

view angle θ itself.

CSE668 Sp2011 Peter Scott 08-05c


 

Eg: The image of a stick (camera coords of endpoints

shown) which we see while walking with speed 1 m/s is

(units: m)

For P, θ = arc tan sqrt(x2+y2) = .955 rad.

    dθ/dt = (s/2z)(sin 1.91)= .472 rad/sec

For Q, θ = arc tan sqrt(x2+y2) = .340 rad.

    dθ/dt = (s/2z)(sin 0.68)= 0.08 rad/sec

CSE668 Sp2011 Peter Scott 08-05d


 

So after a short period of time, the image will be

The apparent motion of P in the image plane will be

.472/.08, or about 6 timesthat of Q. Why is this?

P is closer to the camera, so its apparent motion will

be greater.

CSE668 Sp2011 Peter Scott 08-05e


 

View angle and (u,v)

 
For uo=vo=0, there is a simple proportionality between

the radius of a point in the z-plane and the radius on

the image plane:


By similar triangles, rz/ri = z/f or ri = (f/z) rz.

If in addition there is no skew (intrinsic parameter b=0)

then the image coordinates are related to the camera

coordinates by the linear relationship

        (u,v) = (f/z) (x,y)

CSE668 Sp2011 Peter Scott 08-05f




The relationship between ri and
θ is gained by noting

that tan
θ = rz/z, so

        ri = f tan
θ, θ = arc tan ri/f

 dri/dt = f (1/cos2
θ)(s/2z)(2sin θ*cos θ)

        = (s f tan
θ)/z

The apparent radial motion of an image point is proportional

to the egomotion velocity s and to the tangent of the view

angle θ.
This of course assumes the fixation point is

stationary in world coords, and you are looking where you

are going (optical axis and egomotion velocity vector are

collinear).


Eg: For uo=vo=b=0, f=50mm, what is the image radius ri

and view angle of the scene point (x,y,z) = (100,10,3000)

(units:mm).

   (u,v)=(f/z)(x,y)= (50/3000)(100,10)=(1.67mm,.17mm)

    ri = sqrt(1.67^2+.17^2) = 1.68mm
    θ = arc tan 1.68/50 = .0336 rad = 2o.

CSE668 Sp2011 Peter Scott 08-05g


 

The Barth-Tsuji algorithm

Required reading: Barth and Tsuji, Egomotion determination through an intelligent gaze control strategy, IEEE Trans Systems, Man and Cybernetics, v 23 (1993), 1424-1432.

 

  This algorithm for estimating egomotion is based on the

previously stated idea that assuming pure translation,

there is no apparent motion along the optical axis when

the optical axis is aligned with the translational

velocity vector. There is no radial motion because the

tangent of the view angle is zero, and no other motion

because all apparent motion is radial in this case.

CSE668 Sp2011 Peter Scott 08-05h


 

Barth and Tsuji assume there is a single mobile camera

The algorithm proceeds by specifying a sequence of

sudden changes of the optical axis of the system,

called saccades, which are designed to bring the optical

axis into alignment with the translational velocity

vector. Once they are in alignment, then
 

    * translational egomotion velocity vector is aligned

      with the optical axis and thus its direction is

      known;
 

    * rotational egomotion velocity vector is the negative

      of the apparent motion along the optical axis.
 

The latter follows from the fact that since there is no

translational component, all motion at the FOE must be

due to rotational movement of the camera.
 

CSE Sp2011 Peter Scott 08-06



The logic of the saccade calculation is split into


two parts: first consider which direction to saccade


in, then how large a saccade to make in that direction. 

1. Saccade direction:

Suppose at time t you are maintaining fixation at the

scene point Z. Assume that close (in image coordinates)

to Z there are scene points which are at different

distances to the camera than Z. Then in order to move

towards alignment of the optical axis with the velocity

vector, you must saccade in the opposite direction of

apparent motion of the scene points which are closer to

the camera than Z.
 
 

CSE Sp2011 Peter Scott 08-07



 

How do we know which points are closer? Remember from our

previous discussion of motion parallax that points behind

the fixation point move in the same direction as the

egomotion translation producing the motion parallax, points

in front move in the opposite direction.

CSE Sp2011 Peter Scott 08-08



 

Why does saccading in opposite direction of close

points work?

One way to recognize the closer points, F+, is to

note that during the maneuver the camera head must

rotate either cw or ccw to maintain fixation at Z.

Points that appear to rotate in the same direction

are called F+ and opposite, F-.
 

Another, if camera head rotation is not known, is to

note that average velocity in F+ is higher (equal

displacement of a point nearer the camera results

in greater displacement on the image plane). This

works as long as the points in F+, F- are equally

distributed around the fixation point, or can be

guaranteed if a single point in F+ is closer than

half the distance to the fixation point.


 

CSE Sp2011 Peter Scott 08-09


 

2. Saccade magnitude:


How much do you rotate? In the paper, it is shown that

if you use the iterative rule
 

        delta(θ) = -K(|F+|-|F-|)
 

where F+ is the average apparent motion of the set of

points near the fixation point in image space which

are moving in the same direction as the camera is

rotating in order to maintain fixation on Z, and F-

the average for the the opposite, then for small

displacements along the velocity vector, this is

well-approximated by
 

        delta(θ) ~ -K' sin(θ)
 

which converges to zero for suitable choice of K'.

Here θ measures the angle between optical axis and

velocity vector.
 

CSE Sp2011 Peter Scott 08-10



 

Finally, as mentioned earlier, once you have alignment

(or near-alignment) between the translational velocity

vector and the optical axis, you can determine the

yaw and pitch rotational speeds by observing the apparent

rotational velocity of points along the optical axis.
 

Yaw is rotation about the vertical axis, roll is rotation

about the optical axis, and pitch about the other

horiontal axis

CSE Sp2011 Peter Scott 08-11



 
 

Eg: Suppose that at t, we have alignment and are fixated

to a point on the optical axis. As the camera

moves in yaw (rotation about the y axis) and pitch

(rotation about the x axis) to maintain fixation, the

negative of these angular rates of change are the

rotational velocities in these directions. Note that the

rotational velocity in the roll direction (rotation about

the optical axis) cannot be determined. However, this can

be easily determined from the apparent tangential motion

of all other points.
 

CSE Sp2011 Peter Scott 08-12



 

Note how natural this animate vision algorithm is. We

seek to look in the direction we are moving. We get

clues for where to saccade next by a sequence of fixations

during which we observe the motion parallax. Once we

are looking where we are going, we both know where we

are going (translational motion) and how our head is

twisting (rotational motion) from the view along the

foveal axis.
 

CSE Sp2011 Peter Scott 08-13



 
 
 

As a final note, lets measure the Barth-Tsuji scheme

against the active vision and animate vision paradigms.

How would it score?
 

Active vision:

    * Vision is not recovery.

    * Vision is active, ie. current camera model selected

    in light of current state and goals.

    * Vision is embodied and goal-directed.

    * Representations and algorithms minimal for task
 

Animate vision: Anthropomorphic features emphasized, incl.

    * Binocularity

    * Fovea

    * Gaze control

    * Use fixation frames and indexical reference
 

CSE668 Sp2011 Peter Scott 08-14



 

Collision avoidance 


Required reading: Kundur and Raviv, Active vision-based control schemes for autonomous navigation tasks, Pattern Recognition v 33 (2000), 295-308.

 

If the camera platform (robot, vehicle, etc.) is to move

in the real world of objects which are also moving, and

of stationary objects whose locations are not mapped in

advance, we must detect collision threats and react to

them. This behavior is called collision avoidance. An

equivalent term is obstable avoidance. We will use the

former term, as do Kundur and Raviv (KR).
 

The recovery-based approach to collision avoidance would

be to recover the relative shape and motion parameters of

each object suface in the FOV and determine which, if any,

will collide with the camera platform or pass too close

to it for comfort. Then revise egomotion parameters to

protect against collision while maintaining task awareness.
 

CSE668 Sp2011 Peter Scott 08-15



 

The problem with this approach is that the representations

and algorithms are vastly over-complicated for the task.

We don't have to recover an accurate shape model and pose

and translational and rotational components of motion of

the person we are about to pass in the hallway, we just

have to make sure we are out of his way as we pass.
 

Details of shape and motion of threat objects are

irrelevant once they have been recognized as threat

objects and coarse estimates of their time-to-collision

have been made. For non-threat objects, shape and motion

are of no consequence (in the context of collision avoidance).
 

CSE668 Sp2011 Peter Scott 08-16



 

Recognizing threats
 

How do we accomplish obstacle avoidance when we are out

walking? We are constantly on the lookout for objects

that are pretty near to us and which we are approaching,

these represent threats. Once identified, each threat

can be given our attention, from the most threatening

on down.
 

Consider a threat map, which is a mapping from (u,v),

the current image, into R1, the degree of collision threat

associated with each pixel. How could we determine such

a useful map? One way would be to estimate the time-

to-collision tc with each visible object surface patch.

Then the inverse of tc could be the threat metric.

CSE668 Sp2011 Peter Scott 08-17



 

Threat map construction for stationary scenes
 

In case that the only scene motion is due to egomotion,

this paper shows a strategy which starts from "looking

where we are going" and ends with the computation of

time-to-collision tc for each object pixel in the scene.

Thus we get the threat map in a straightforward fashion,

and the threat map is all you need for collision avoidance.

CSE668 Sp2011 Peter Scott 08-17a



Note that "looking where you are going" is the output

of the algorithm described by Barth and Tsuji. This

"visual behavior" can only be accomplished if you have

mastered the motion competency of qualitative egmotion

estimation. So the output of egomotion estimation is

a precondition of obstacle avoidance. This is an example

of Fermuller and Aloimonos' concept that competencies

build in a sequential fashion.

CSE668 Sp2011 Peter Scott 08-18



Let O be an object point in the scene. Consider the plane

containing O and the optical axis z of the camera. Let

the coordinates of this plane be (r,z), where z is the

optical axis and r is the radial distance of O from

the optical axis. 

By similar triangles,
 

        f/ri = zo/ro,  ri = f ro/zo

        dri/dt = d/dt (f ro/zo) = (-f ro/zo2) dzo/dt
 

Substituting in ri,
 

        dri/dt = (ri)(-dzo/dt)/zo
 

CSE668 Sp2011 Peter Scott 08-19



 
 

Now if the egomotion velocity dzo/dt, assumed to lie

along the optical axis, is maintained, the point O

will collide with the z=0 zero-depth plane at time
 

        tc = zo/(-dzo/dt)
 

Thus the time-to-collision tc for O can be computed from

the optical flow at O's image point ri as
 

        tc = ri/(dri/dt)
 

Now if we take the inverse of tc to be the threat metric,

we can compute the threat map pixel by pixel using the

radial optical flow dri/dt,
 

        1/tc = (dri/dt)/ri
 

Thus we can do obstacle avoidance in a stationary scene

by "worrying about" objects or surface patches which are

moving rapidly away from the FOE.


Note that this can also be written


        1/t
c = (dri/dt)/ri = d(ln ri)/dt


though we will not use this form in the present context.

CSE668 Sp2011 Peter Scott 08-20



Object avoidance based on the visual threat cue (VTC)
 

A few equations above we wrote

        dri/dt = (ri)(-dzo/dt)/zo

Dividing by ri,

       (dri/dt)/ri = (-dzo/dt)/zo

So 1/tc we computed above can be rewritten
 

        1/tc = (-dR/dt)/R
 

where (changing notation to follow the KR paper)

R is the depth range of the object point. Note: ri 

is the radius in the image plane at which the object

point is seen, R is the z-coordinate value of the object

point in camera coordinates.


The authors next modify this 1/tc threat metric in two

following ways. First, lets define a desired minimum

clearance z=Ro. This corresponds to a "personal space" or

"no-fly zone" we wish to maintain in front of the camera

coorinate origin to keep the camera platform safe from

an unexpected maneuver. Then the time at which the object

will penetrate this zone is

           to = (R-Ro)/(-dR/dt)


So to is the time-to-mininum-clearance.

CSE668 Sp2011 Peter Scott 08-21



Second, lets scale the resultaing threat metric

 

        1/to = (-dR/dt)/(R-Ro)
 

by Ro/R, which measures how close the threat object is to

the desired minimum clearance. Time-to-minimum-clearance

to assumes that the object will continue on its current

course. But if it is close, it might speed up or head

right at us, so we should take its proximity into

account also. Combining the time-to-minimum-clearance

threat factor and the proximity-threat factor, we get

the Kundur-Raviv (KR) Visual Threat Cue (VTC)
 

        VTC = (Ro/R)((-dR/dt)/(R-Ro))

CSE668 Sp2011 Peter Scott 08-22



 

If we could construct a threat map using the VTC, we could

use it as above. But it involves the range R to the object.

If we need to compute that from the image, this is another

recovery-based method, not a true active vision method.
 

To get around that, KR define an apparently

unrelated quantity which can be computed directly from

an image without recovery, the Image Quality Metric IQM,

is a mapping from a specified region of the field of view

to the reals as follows:
 

IQM: Image region Rho -> Real number
 

    1. define an image region Rho;

    2. define a window W;

    3. for each point (x,y) in Rho, sum absolute

       differences in gray levels between (x,y) and all

       other points in W(x,y);

    4. sum the results of 3. over all (x,y) in Rho;

    5. normalize by the total number of terms being summed.
 
 

CSE668 Sp2011 Peter Scott 08-23



 
    So the IQM is a measure of how variable the image is over

spatial scales limited to the chosen window size. This

measure is averaged over a selected image region Rho. We can

think of the IQM as an indicator of how sharply in focus the

object segment surrounding that pixel location is.
 

      KR claim: d(IQM)/dt is a proxy for the VTC.
 

By a proxy, we mean it can be used to "stand for" the VTC,

ie. it is a reasonably accurate model of the VTC. Clearly,

as an object gets closer you can see its textured surface

better, thus IQM increases. Like the old saying goes,

 "Don't fire until you can see the whites of their eyes."


In other papers referenced in the KR paper,

they give some empirical evidence for this claim.
 

CSE668 Sp2011 Peter Scott 08-24



 
 
 

The basic action of their obstacle avoidance algorithm

is to segment out a candidate image region and measure

d(IQM)/dt with Rho = segment. Based on prior training

(fuzzy system, neural network, etc.) a control scheme

reacts to the proxy VTC by stopping, moving away, etc.
 

How good an active-animate method is the KR algorithm?

That remains to be seen as they explore into more

natural environments (see paper). But other than employing

indexical reference and being active, it is suspect on all

other grounds (goal-seeking, embodied, binocular, foveal,

gaze control).
 

CSE668 Sp2011 Peter Scott 08-25


Visual servoing 


Required reading: [15] Vassallo et al, Visual servoing and appearance for navigation, Robotics and Autonomous Systems, v 31 (2000), 87-97.

 

In control theory, a servomechanism is a system including

a controlled plant with measureable outputs, a controller,

and a feedback loop between them, which controls the

plant outputs to follow a desired or reference trajectory.

Servoing is the process of following that trajectory.
 

Eg:

Visual servoing is the control of egomotion guided by

visual image sequences.
 

CSE668 Sp2011 Peter Scott 08-26


Examples of visual servoing:
 

    * Moving the end-effector of a robot so that

      it can pick up a machine part on a coveyer belt

      (robotic hand-eye coordination).
 

    * Pursuit and interception of an incoming missle

      by an anti-missle missle.
 

    * Guiding a welding robot to lay the weld along

      a visible seam between sheet metal parts.
 

    * Maintaining an autonomous vehicle in the center

      of the right lane of a highway even as the road

      curves and turns guided by visual data.
 

    * Visual navigation along a desired path, for

      instance, following a corridor.
 

CSE668 Sp2011 Peter Scott 08-27



 

Visual servoing is a class of motion competencies, built

on egomotion estimation, that are essential for autonomous

movement using vision sensors.
 

Note that visual servoing is often combined with other

visual competencies in developing useful behaviors. For

instance, we can home to a point adjacent to a conveyer

belt, then visually track a part moving along the belt,

then perform hand-eye coordination to pick it up, then

home to a parts bin, then do hand-eye to position the

hand above the bin, then release the part. Along the

way, we have done frequent egomotion estimation, homing,

and perhaps obstacle avoidance as well as visual servoing.

They are the motion competencies necessary to accomplish

this vision-supported behavior.
 

CSE668 Sp2011 Peter Scott 08-28



 

Vassalo et al develop a nice example of visual servoing

embedded within a larger visual navigation system in

their  paper "Visual servoing and appearance for

navigation" (see reading list for full citation).
 

Their task scenario: A robot must be programmed to find

and enter selected offices and rooms in a one-floor

office space.
 

Design option 1: No visual servoing (no camera needed).

Memorize floorplan. Then from known initial location

and heading can compute the full sequence of control

signals to get you wherever you want to go "with your

eyes closed."
 

Design option 2: Full visual servoing using a camera and

appearance-based control. Offline training: for each

possible position of the robot and orientation of the

optical axis, store an image. Then each time a new image

is registered, search the library of images for best match.

Turn and move the robot based on its current position,

orientation and the desired destination.
 
 

CSE668 Sp2011 Peter Scott 08-29



Neither of these prove satsifactory design options. Problems:
 

DO1. Small initial condition errors will produce increasing

navigational errors as you proceed. Also, cannot react

to obstacles, determine closed/open doors, etc.
 

DO2. Large set of images to be stored and at each frame

input, searched for best match. So, high computational

complexity and large storage requirements.
 

Vassalo et al propose a compromise (DO3) in which we store

a small set of indexed landmark images. When we start

the robot, it performs a visual servoing behavior we will

call "corridor-following." As each new image comes in,

we compare it with a single image, the first landmark

image in the route. When we arrive at the first

landmark image location, we update our location

data, replace the image with the next indexed landmark

image and proceed. This is landmark-based navigation.
 
 

CSE668 Sp2011 Peter Scott 08-30



 

So DO3 combines pure visual servoing during corridor-

following with appearance-based (view-based localization)

navigation at a sparse set of landmarks.
 

The landmark-image matching is done by straight image

correlation. How is the corridor-following done?
 

Servoing to the vanishing point
 

We can follow the corridor by locating the ground plane

vanishing point and turning the robot to home in on it.

                     Schematic of an image of the corridor
 
 

CSE668 Sp2011 Peter Scott 08-31



 
 

              Vanishing point is the intersection of the corridor-ground plane edges.

CSE668 Sp2011 Peter Scott 08-32



 
 

To find the corridor-ground plane edges:

    1. Use Sobel operators to find pixels with both

       significant horizontal and vertical gradients;

    2. Separate into positive and negative gradient

       orientation pixel sets (left and right edges).

    3. Erode with diagonal line mask to remove pixels

       not on diagonal lines.

    4. Find the positive and negative straight lines

       with the largest number of pixel "hits."
 

CSE668 Sp2011 Peter Scott 08-33



 

In the paper it is shown that the horizontal image

coordinate of the vanishing point ixv satisfies (eqn 7)
 

           
 

where Su is an intrinsic camera model parameter and

cp is the cosine of the pitch (downlooking angle). θ

is the deviation between the robot heading and the

direction of the corridor. So controlling ixv to zero

(centering the vanishing point in the image) is the

same as driving θ to zero (orienting the robot

to follow the corridor).
 

CSE668 Sp2011 Peter Scott 08-34


Visual homing 


Required reading: [16] Nelson, From visual homing to object recognition, in Aloimonos (ed.) Visual Navigation, Lawrence Erlbaum Associates, 1997.

 

As was pointed out in the Vassallo paper, their scheme

combining visual servoing and landmark memory can form the

basis for visual homing, movement to a desired location

using visual data.
 

Visual homing is important in several applications

contexts:
 

    * Robotic tool positioning

    * Robotic or autonomous vehicle navigation

    * Material handling

    * Object set manipulation, eg. assembly tasks
 

We will study an approach to visual homing presented in

the paper by Randal Nelson, "From visual homing to object

recognition" (see above for full citation). As the

title implies, Nelson's approach has implications for

active vision approaches to the shape competencies whose

review will form the last part of this course.
 

CSE668 Sp2011 Peter Scott 08-35



 

The central questions in visual homing are: given

the goal (homing to a specific location),
 

    * What is my current state (location and pose)?
 

    * Given my current state, what motion control

      actions should I take?
 

The second question is a control engineering question

tied to the mechanics of the mobile platform and is

outside the range of this course. We will simply assume

that the motion control problem can be solved, and

concentrate our attention on using visual data for

state determination.
 

CSE668 Sp2011 Peter Scott 08-36



 

Visual associative memory
 

State determination can be done based on visual recovery.

This passive approach is inefficient, however. Nelson

advocates the use of a pattern-recognition based procedure

called a visual associative memory. In general, an

associative memory is a memory addressed by partial

content.
 

Eg:
 

    Phone number 645-23?? -> 645-2394

Eg:

    Who do I know with a big nose and blond hair? That

    must be Bob.
 

A visual associative memory is a store of many 2-D images,

and given a new image, addresses the most similar image.
 

CSE668 Sp2011 Peter Scott 08-37



 

Let Ak be the kth feature of a pattern, with Ak = {ak},

where each ak is a definite value for that feature. Define

a pattern as an m-tuple P=(a1,...,am) from the pattern

space A=A1xA2x .. Am (x is Cartesian product). So a

pattern is a set of feature values from the pattern space.
 

Define a similarity function s: AxA -> R (R is the set

of real numbers) which compares patterns. There are many

such mappings which can be used as similarity functions,

for instance the norm-induced correlation similarity

measures similarity between Px and Py as
 

    s1(Px,Py) = ||Px-Py||/sqrt(||Px||*||Py||)
 

where ||*|| is a selected norm over A.
 
 

CSE668 Sp2011 Peter Scott 08-38



 
 

For his visual associative memory, Nelson uses the

similarity measure
 

            s(Px,Py) = 1 - d(Px,Py)/m
 

where d is the cardinality of the set of features that

are not identical, and m is the dimension of the pattern

space. d is sometimes called the "Hamming Distance."

 

Eg: Px = (0.384,2,green,0,to the left  of);

    Py = (0.383,2,green,0,to the right of)
 

    s(Px,Py) = 1 - 2/5 = 3/5
 

So s measures the fraction of features that are identical. 


 Nelson selects this similarity measure because it can be

used for categorical variables, those for which there are

only labels without reequiring any ordering function, as

well as numerical variables.
 

CSE668 Sp2011 Peter Scott 08-39



 

Suppose we have n patterns M={P} stored in our associative

memory. Then when presented with an input (index pattern)

Px, the associative memory's output will be the Py in M

which best matches Px
 

    max_M s(Px,P) = s(Px,Py)
 

if the similarity exceeds a threshold t
 

    s(Px,Py) >= t

If the best-match does not exceed t we return "no match."
 

CSE668 Sp2011 Peter Scott 08-40



 

Here is how we will use an associative memory for visual

homing.
 

    1. Store a set M of n reference patterns (camera views

       mapped down to their feature values) taken at

       known states (locations and poses). Label one or

       more of them as goal states.

    2. For each reference pattern, store the value of the

       desired control action from that state.

    3. For each new incoming image, map to its feature set

       and pass that feature set through the associative

       memory.

            a. If there is a matched pattern,

               perform the reference pattern control

               action.

            b. If there is no match, randomly change state

               and go to 3.

    4. Repeat 3 until the goal state has been reached.
 
 

CSE668 Sp2011 Peter Scott 08-41



 

Eg: Want to home to a point X in front of the cubic object

with the camera optical axis perpendicular to the

closest face. Assume camera can translate horizontally

(not vertically) and rotate in yaw and pitch but not roll.

These are reasonable DOFs for a mobile ground plane robot.

Pick a sampling grid on the ground plane and a sampling

grid in pose space (camera yaw x pitch). For each point

in this 4-D discrete state space, take a image. Define a

control action for each point as "take a step towards the

point X, and orient towards the center of any visible

object."
 
 

CSE668 Sp2011 Peter Scott 08-42



 

We next need to specify the visible features of this scene

that define the pattern space. There are many ways to do

this, Nelson selects a vertex feature scheme.


Here is how a binarized version of his vertex-finder

algorith works.


Divide the image space into a KvxKh grid of cells. For a

given image, use a vertex-finder algorithm to determine

the cells in which vertices are determined to be

visible. Assign cell (i,j) to have a feature value a=1

if there is a vertex there, otherwise a=0. The pattern

corresponding to the goal state X will look like
 

                000000000000000000000000000
                000000000000000000000000000
                000000000100000100000000000
                000000000000000000000000000
                000000000000000000000000000
                000000000000000000000000000
                000000000100000100000000000
                000000000000000000000000000
                000000000000000000000000000

If we are at a position like Y, the index pattern would be
 
 

                000000000000000000000000000
                000000000000000000000000000
                000000000001010010000000000
                000000000000000000000000000
                000000000000000000000000000
                000010000000000010000000000
                000000000001010000000000000
                000000000000000000000000000
                000000000000000000000000000

and, if we have a reference pattern from approximately

that same pose and location, it would have similarity 1.

The action attached to that pattern would be something

like "move one step in a direction 30 degrees ccw from

optical axis and orient the camera to the centroid of

the visible vertices."


In this example we used a binary feature space. But in

general some or all features could be numerical.

 

 

CSE668 Sp2011 Peter Scott 08-43



In order for this scheme to work well, two conditions

should be satisfied:
 

1. Match existence condition: for each x in C, the

   space of competence, there is a y in the set R

   of reference patterns such that s(Px,Py)>=t. In

   other words, there is at least one match.
 

2. Match uniqueness condition: for all x and y,

   s(Px,Py)>=t implies ||x-y||<r(x). r(x) is the

   desired accuracy of the state determination. In

   other words, there is at most one match.
 
 

If these conditions are met, then we will never have

to pick a state randomly, nor confuse two states which

are far apart. If they are not met, this algorithm may

never converge. For instance if no y in R is visible

from any point we might randomly move to, we will keep

making random viewpoint changes forever.


Note that ideally both these conditions are met, but

even if they are not, the algorithm may still converge.

 
 

CSE668 Sp2011 Peter Scott 08-44



 

Nelson defines his vertex-finder algorithm as follows.

Divide the image into a GxG grid. Using an edge operator,

find the cells that have strong edges in them, and the

dominant edge direction in each such strong-edge grid cell.

Quantize that direction into d values, eg. d=8 for

D={N,NE,E,SE,S,SW,W,NW}. Then the pattern for an image

consists in Px = (a1,a2,..,aG^2) where Ak={DU0}, U is

the union operator and 0 is the null edge direction.
 
 

CSE668 Sp2011 Peter Scott 08-45



 

In implementing such a system, a basic design question

is how many reference patterns M must be stored in order

to satisfy the two match conditons above. By setting G

and/or t low, we can reduce card(M) and still satisfy the match

existence condition throughout C, but the lower G or t is

set, the more likely the match uniqueness condition will

be violated somewhere in C.
 

By setting G or t high, we are less likely to violate

the uniqueness condition but at the expense of the

cardinality of M.
 

CSE668 Sp2011 Peter Scott 08-46



 

Define the recognition neighborhood R(y) of a reference

pattern Py as the largest connected set of similar points

containing y (similar points are pairs (x,y) for which

s(Px,Py)>=t), ie points that would be labelled by the

same reference pattern). As long as UR(y)=C for all

reference patterns M={Py} the match existence condition

will be met throughout C.
 

CSE668 Sp2011 Peter Scott 08-47


This is a 2-D example of a design satisfying the existence

condition but not the uniqueness condition. For instance,

points near x are similar to several reference points,

some of which are far apart (eg y2, y9). To satisfy

uniqueness condition, we need

        ||yi-yj||>r => R(yi) INT R(yj) = null.


 
 

CSE668 Sp2011 Peter Scott 08-48



In equation (9) in the paper, Nelson estimates the

spurious match probability ps in terms of the volume

in pattern space of the recognition neighborhood:
 

        ps ~ pf vtot/vrn

where pf is the far-match probability (probability

that two points far apart picked randomly will match

d(x,y)>=r(x), s(Px,Py)>=t), vrn is the average volume

of a recognition neighborhood, and vtot the total

volume of the domain of competence C.
 

Roughly, vtot/vrn measures the number of far recognition

neighborhoods a point x might find a spurious match in

at random, and pf the probability of a far-match there.
 
 

CSE668 Sp2011 Peter Scott 08-49



Eg: Consider a building divided into N identical rooms,

    and we must home into the center of one selected room

    Then pf is proportional to N, and thus so is ps.
 

Nelson performed some tests of his ideas using a 2-D

motion space (x-y translation, no rotation) corresponding

to "flying" above "Tinytown" with a downlooking camera.

He set G to 5, t to 0.5 and took 120 reference images

over a region of competency of roughly four "camerasfull"

of the Tinytown model. Eight edge directions were used.
 

CSE668 Sp2011 Peter Scott 08-50



 

He defined a homing action, a step direction to home

correctly (move towards the goal state) for each

reference image, and tested to see if goal states would

be reached.
 

Tests were done by defining a sampling grid on C with

4x as many points as reference points, so test points

were separated by half a reference point separation.

For most tested points in C, there was a match and the

reference pattern matched was the correct one, ie.

a non-spurious match which therefore pointed in the right

direction (towards the goal state). For some tested

points there was no match, and for one there was a

spurious match. This means the homing sequence would

always work, as long as randomized moves were made

in the absence of match.
 

CSE668 Sp2011 Peter Scott 08-51



Comments:

    * Small 2-D test (4 camerasfull, 5x5 grid). Beware

      of combinatorial explosion when scaling toy problems
    
      up!

    * Other features can be used, including features not

      limited to grid cell definition

    * Can distribute reference points more densely as

      the goal state is approached for fine motor

      control (eg. docking).

    * Same associative memory principle can be used for

      object recognition.
 

CSE668 Sp2011 Peter Scott 08-52