08

Vision for navigation

In this section of the course we will consider the "motion competencies," those animate vision skills associated with relative motion between the camera and objects in the scene.

Motion competencies include:

egomotion estimation
obstacle avoidance
visual servoing
tracking
homing

We cannot go into much depth but rather we will take a representative approach to each of these problems and read the paper and describe the method.

Egomotion estimation

In the Fermuller-Aloimonos hierarchy, motion competencies are the most basic and among them, egomotion estimation comes first. There is good reason for both of these choices.

Motion competencies are the most basic, they argue, because they are the simplest and most primitive in natural vision. The snail has a single photodetector and senses changes in brightness as possible shadow of a predator looming.

Egomotion estimation comes first because it is necessary to separate egomotion from independent motion of objects in the scene in order to do any other motion competency.

E.g.: Cannot effectively avoid an obstacle unless you know if it is approaching you, or you approaching it.
E.g.: Cannot devise a good homing strategy on a moving object unless you know how it is moving.

One way to do egomotion estimation is to determine the 3-D locations of several stationary feature points in an image sequence and then apply the Structure-from-motion Theorem (Sonka p. 510). With as little as three projections of four points, we can recover both the 3-D rigid displacement vectors (shape) and the motion parameters, the translation and rotation vectors. Since the points are stationary, the camera motion will be the negative of the perceived object motion.

Problems: need a calibrated camera, must solve the correspondence problem for each feature point, and must be sure that the feature points are stationary.

E.g.: You are sitting on a train in the station and the train next to you starts to glide slowly backward. Are you moving forward or it moving backward?

In contrast, active vision approaches do not recover the 3-D structure of the scene in order to determine egomotion.

They are based on the apparent motion field created as the camera moves about while fixating on scene points.

Apparent motion: motion of a point in an image sequence measured in image coordinates.

E.g.: If you are fixated on an scene point right along your egomotion translational velocity vector, and you are not rotating the camera as you move, the apparent motion field is purely radial.

So, if you can produce radial optical flow, then the egomotion must be pure translational along the optical axis of the system, which points directly at the focus of expansion (FOE).If you are moving away from the object, the FOE becomes the focus of contraction (vanishing point).

Using the above, its easy to estimate egomotion parameters in world coordinates, like "I'm moving straight towards the middle of the closest face of that rectangular object."

Note that we can't make a quantitative estimate of the magnitude of our egomotion vector, i.e. our egomotion speed, but only its direction. We would need to know how big the objects in our FOV are in order to determine speed as well.

E.g.: Situation 1: you are a distance d from an object O oriented with pose p and are moving directly towards it with a velocity vector e. Situation 2: you are a distance 2d from an object O' which is identical to O except twice as big, is oriented with pose p and you are moving directly towards it with a velocity vector 2e. You cannot distinguish these situations.

Note that in this example you cannot quantify speed, but you can quantify time-to-collision, which is the same for both situations. We will consider this in studying the problem of obstacle avoidance.

Determining apparent motion

We consider only the case of stationary objects in which the optical axis is aligned with the egomotion velocity vector (i.e., you are looking where you are going).

Fixation frame: coordinate system shown with origin at O', the point we are looking at
Fixation plane: plane containing O' and perpendicular to the optical axis

Let P=(x,y,z) be a point in camera coords. Define the view angle θ as the plane angle shown below in the O-O'-P plane. In this plane,

$\tan θ = r/z$

$d(\tan θ) = d θ /(\cos^2 θ)$

$d(r/z) = -r/z^2 dz = -\tan θ /z dz$

Then

$dθ/dz = (\cos ^2 θ)(-\tan θ /z) = (-\cos θ)(\sin θ)/z = -\sin(2 θ)/(2z)$

Let the egomotion speed be s = -dz/dt. Then

$d θ /dz = -d θ /(s dt) $

and

$d θ /dt = (s/(2z))(\sin(2 θ))$

This tells us how fast the view angle θ of a point will be changing as we move in the direction we are looking. Note that it depends on the ratio s/z, and on the view angle θ itself.

E.g.: The image of a stick (camera coords of endpoints shown) which we see while walking with speed 1 m/s is (units: m)

For P,

$θ = \arctan \sqrt{x^2+y^2}$ = .955 rad.

$d θ /dt = (s/2z)(\sin 1.91)$ = .472 rad/sec

For Q,

$θ = \arctan \sqrt{x^2+y^2}$ = .340 rad.

$d θ /dt = (s/2z)(\sin 0.68)$ = 0.08 rad/sec$

So after a short period of time, the image will be

The apparent motion of P in the image plane will be .472/.08, or about 6 times that of Q. Why is this? P is closer to the camera, so its apparent motion will be greater.

View angle and (u,v)

For $u_0=v_0=0$, there is a simple proportionality between the radius of a point in the z-plane and the radius on the image plane:

By similar triangles, $r_z/r_i = z/f$ or $r_i = (f/z) r_z$.

If in addition there is no skew (intrinsic parameter b=0) then the image coordinates are related to the camera coordinates by the linear relationship (u,v) = (f/z) (x,y)

The relationship between $r_i$ and θ is gained by noting that $ \tan θ = r_z/z$, so $r_i = f\tan θ$

$θ = \arctan r_i/f$

$dr_i/dt = f (1/\cos^2 θ)(s/2z)(2\sin θ \cos θ) = (s f\tan θ )/z$

The apparent radial motion of an image point is proportional to the egomotion velocity s and to the tangent of the view angle θ. This of course assumes the fixation point is stationary in world coords, and you are looking where you are going (optical axis and egomotion velocity vector are collinear).

E.g.: For $u_0=v_0=b=0$, f=50mm, what is the image radius $r_i$ and view angle of the scene point (x,y,z) = (100,10,3000) (units:mm). (u,v)=(f/z)(x,y)= (50/3000)(100,10)=(1.67mm,.17mm)

$r_i = \sqrt{1.67^2+.17^2}$ = 1.68mm

$θ = \arctan 1.68/50$ = .0336 rad = 2^o.

The Barth-Tsuji algorithm

Reading: Barth and Tsuji, Egomotion determination through an intelligent gaze control strategy, IEEE Trans Systems, Man and Cybernetics, v 23 (1993), 1424-1432

This algorithm for estimating egomotion is based on the previously stated idea that assuming pure translation, there is no apparent motion along the optical axis when the optical axis is aligned with the translational velocity vector.

There is no radial motion because the tangent of the view angle is zero, and no other motion because all apparent motion is radial in this case.

Barth and Tsuji assume there is a single mobile camera The algorithm proceeds by specifying a sequence of sudden changes of the optical axis of the system, called saccades, which are designed to bring the optical axis into alignment with the translational velocity vector. Once they are in alignment, then

translational egomotion velocity vector is aligned with the optical axis and thus its direction is known;

rotational egomotion velocity vector is the negative of the apparent motion along the optical axis.

The latter follows from the fact that since there is no translational component, all motion at the FOE must be due to rotational movement of the camera.

The logic of the saccade calculation is split into two parts: first consider which direction to saccade in, then how large a saccade to make in that direction.

1. Saccade direction:

Suppose at time t you are maintaining fixation at the scene point Z. Assume that close (in image coordinates) to Z there are scene points which are at different distances to the camera than Z. Then in order to move towards alignment of the optical axis with the velocity vector, you must saccade in the opposite direction of apparent motion of the scene points which are closer to the camera than Z.

How do we know which points are closer? Remember from our previous discussion of motion parallax that points behind the fixation point move in the same direction as the egomotion translation producing the motion parallax, points in front move in the opposite direction.

Why does saccading in opposite direction of close points work?

One way to recognize the closer points, F+, is to note that during the maneuver the camera head must rotate either cw or ccw to maintain fixation at Z. Points that appear to rotate in the same direction are called F+ and opposite, F-.

Another, if camera head rotation is not known, is to note that average velocity in F+ is higher (equal displacement of a point nearer the camera results in greater displacement on the image plane). This works as long as the points in F+, F- are equally distributed around the fixation point, or can be guaranteed if a single point in F+ is closer than half the distance to the fixation point.

2. Saccade magnitude:

How much do you rotate? In the paper, it is shown that if you use the iterative rule

$ \delta θ = -K(|F+|-|F-|)$

where F+ is the average apparent motion of the set of points near the fixation point in image space which are moving in the same direction as the camera is rotating in order to maintain fixation on Z, and F- the average for the opposite, then for small displacements along the velocity vector, this is well-approximated by

$\delta θ \approx -K' \sin(θ)$

which converges to zero for suitable choice of K'. Here θ measures the angle between optical axis and velocity vector.

Finally, as mentioned earlier, once you have alignment (or near-alignment) between the translational velocity vector and the optical axis, you can determine the yaw and pitch rotational speeds by observing the apparent rotational velocity of points along the optical axis.

Yaw is rotation about the vertical axis, roll is rotation about the optical axis, and pitch about the other horizontal axis

E.g.: Suppose that at t, we have alignment and are fixated to a point on the optical axis. As the camera moves in yaw (rotation about the y axis) and pitch (rotation about the x axis) to maintain fixation, the negative of these angular rates of change are the rotational velocities in these directions. Note that the rotational velocity in the roll direction (rotation about the optical axis) cannot be determined. However, this can be easily determined from the apparent tangential motion of all other points.

Note how natural this animate vision algorithm is. We seek to look in the direction we are moving. We get clues for where to saccade next by a sequence of fixations during which we observe the motion parallax. Once we are looking where we are going, we both know where we are going (translational motion) and how our head is twisting (rotational motion) from the view along the foveal axis.

As a final note, lets measure the Barth-Tsuji scheme against the active vision and animate vision paradigms. How would it score?

Active vision:
- Vision is not recovery.
- Vision is active, ie. current camera model selected in light of current state and goals.
- Vision is embodied and goal-directed.
- Representations and algorithms minimal for task
Animate vision: Anthropomorphic features emphasized, incl.
- Binocularity
- Fovea
- Gaze control
- Use fixation frames and indexical reference

Collision avoidance

Reading: Kundur and Raviv, Active vision-based control schemes for autonomous navigation tasks, Pattern Recognition v 33 (2000), 295-308

If the camera platform (robot, vehicle, etc.) is to move in the real world of objects which are also moving, and of stationary objects whose locations are not mapped in advance, we must detect collision threats and react to them. This behavior is called collision avoidance.

An equivalent term is obstacle avoidance. We will use the former term, as do Kundur and Raviv (KR).

The recovery-based approach to collision avoidance would be to recover the relative shape and motion parameters of each object surface in the FOV and determine which, if any, will collide with the camera platform or pass too close to it for comfort. Then revise egomotion parameters to protect against collision while maintaining task awareness.

The problem with this approach is that the representations and algorithms are vastly over-complicated for the task. We don't have to recover an accurate shape model and pose and translational and rotational components of motion of the person we are about to pass in the hallway, we just have to make sure we are out of his way as we pass.

Details of shape and motion of threat objects are irrelevant once they have been recognized as threat objects and coarse estimates of their time-to-collision have been made. For non-threat objects, shape and motion are of no consequence (in the context of collision avoidance).

Recognizing threats

How do we accomplish obstacle avoidance when we are out walking? We are constantly on the lookout for objects that are pretty near to us and which we are approaching, these represent threats. Once identified, each threat can be given our attention, from the most threatening on down.

Consider a threat map, which is a mapping from (u,v), the current image, into R^1, the degree of collision threat associated with each pixel. How could we determine such a useful map? One way would be to estimate the time- to-collision $t_c$ with each visible object surface patch.

Then the inverse of $t_c$ could be the threat metric.

Threat map construction for stationary scenes

In case that the only scene motion is due to egomotion, this paper shows a strategy which starts from "looking where we are going" and ends with the computation of time-to-collision $t_c$ for each object pixel in the scene. Thus we get the threat map in a straightforward fashion, and the threat map is all you need for collision avoidance.

Note that "looking where you are going" is the output of the algorithm described by Barth and Tsuji. This "visual behavior" can only be accomplished if you have mastered the motion competency of qualitative egomotion estimation. So the output of egomotion estimation is a precondition of obstacle avoidance. This is an example of Fermuller and Aloimonos' concept that competencies build in a sequential fashion.

Let O be an object point in the scene. Consider the plane containing O and the optical axis z of the camera. Let the coordinates of this plane be (r,z), where z is the optical axis and r is the radial distance of O from the optical axis.

By similar triangles,

$f/r_i = z_0/r_0 , r_i = f r_0 /z_0$

$dr_i /dt = d/dt (f r_0 /z_0) = (-f r_0 /z_0 ^2) dz_0 /dt$

Substituting in r_i,

$dr_i /dt = (r_i)(-dz_0 /dt)/z_0$

Now if the egomotion velocity $dz_0/dt$, assumed to lie along the optical axis, is maintained, the point O will collide with the z=0 zero-depth plane at time

$t_c = z_0 /(-dz_0 /dt)$

Thus the time-to-collision $t_c$ for O can be computed from the optical flow at O's image point $r_i$ as

$t_c = r_i /(dr_i /dt)$

Now if we take the inverse of $t_c$ to be the threat metric, we can compute the threat map pixel by pixel using the radial optical flow $dr_i/dt$,

$1/t_c = (dr_i /dt)/r_i$

Thus we can do obstacle avoidance in a stationary scene by "worrying about" objects or surface patches which are moving rapidly away from the FOE. Note that this can also be written

$1/t_c = (dr_i /dt)/r_i = d(\ln r_i)/dt$

though we will not use this form in the present context.

Object avoidance based on the visual threat cue (VTC)

A few equations above we wrote

$dr_i /dt = (r_i)(-dz_0 /dt)/z_0$

Dividing by $r_i$,

$(dr_i /dt)/r_i = (-dz_0 /dt)/z_0$

So $1/t_c$ we computed above can be rewritten

$1/t_c = (-dR/dt)/R$

where (changing notation to follow the KR paper) R is the depth range of the object point. Note: $r_i$ is the radius in the image plane at which the object point is seen, R is the z-coordinate value of the object point in camera coordinates.

The authors next modify this $1/t_c$ threat metric in two following ways. First, lets define a desired minimum clearance $z=R_0$. This corresponds to a "personal space" or "no-fly zone" we wish to maintain in front of the camera coordinate origin to keep the camera platform safe from an unexpected maneuver. Then the time at which the object will penetrate this zone is

$t_0 = (R-R_0)/(-dR/dt)$

So $t_0$ is the time-to-minimum-clearance.

Second, lets scale the resulting threat metric

$1/t_0 = (-dR/dt)/(R-R_0)$

by $R_o/R$, which measures how close the threat object is to the desired minimum clearance. Time-to-minimum-clearance $t_0$ assumes that the object will continue on its current course. But if it is close, it might speed up or head right at us, so we should take its proximity into account also. Combining the time-to-minimum-clearance threat factor and the proximity-threat factor, we get the Kundur-Raviv (KR) Visual Threat Cue (VTC).

$VTC = (R_0/R)((-dR/dt)/(R-R_0))$

If we could construct a threat map using the VTC, we could use it as above. But it involves the range R to the object. If we need to compute that from the image, this is another recovery-based method, not a true active vision method.

To get around that, KR define an apparently unrelated quantity which can be computed directly from an image without recovery, the Image Quality Metric IQM, is a mapping from a specified region of the field of view to the reals as follows:

IQM: Image region $\rho$ -> Real number

define an image region $\rho$;
define a window W;
for each point (x,y) in $\rho$, sum absolute differences in gray levels between (x,y) and all other points in W(x,y);
sum the results of 3. over all (x,y) in $\rho$;
normalize by the total number of terms being summed.

So the IQM is a measure of how variable the image is over spatial scales limited to the chosen window size. This measure is averaged over a selected image region $\rho$. We can think of the IQM as an indicator of how sharply in focus the object segment surrounding that pixel location is.

KR claim: d(IQM)/dt is a proxy for the VTC.

By a proxy, we mean it can be used to "stand for" the VTC, i.e. it is a reasonably accurate model of the VTC. As an object gets closer you can see its textured surface better, thus IQM increases.

In other papers referenced in the KR paper, they give some empirical evidence for this claim.

The basic action of their obstacle avoidance algorithm is to segment out a candidate image region and measure d(IQM)/dt with $\rho$ = segment. Based on prior training (fuzzy system, neural network, etc.) a control scheme reacts to the proxy VTC by stopping, moving away, etc.

How good an active-animate method is the KR algorithm? That remains to be seen as they explore into more natural environments (see paper). But other than employing indexical reference and being active, it is suspect on all other grounds (goal-seeking, embodied, binocular, foveal, gaze control).

Visual servoing

Reading: Vassallo et al, Visual servoing and appearance for navigation, Robotics and Autonomous Systems, v 31 (2000), 87-97.

In control theory, a servomechanism is a system including a controlled plant with measurable outputs, a controller, and a feedback loop between them, which controls the plant outputs to follow a desired or reference trajectory.

Servoing: process of following that trajectory

Visual servoing: is the control of egomotion guided by visual image sequences.

Examples of visual servoing:

Moving the end-effector of a robot so that it can pick up a machine part on a conveyor belt (robotic hand-eye coordination).
Pursuit and interception of an incoming missile by an anti-missile missile.
Guiding a welding robot to lay the weld along a visible seam between sheet metal parts.
Maintaining an autonomous vehicle in the center of the right lane of a highway even as the road curves and turns guided by visual data.
Visual navigation along a desired path, for instance, following a corridor.

Visual servoing is a class of motion competencies, built on egomotion estimation, that are essential for autonomous movement using vision sensors.

Note that visual servoing is often combined with other visual competencies in developing useful behaviors. For instance, we can home to a point adjacent to a conveyor belt, then visually track a part moving along the belt, then perform hand-eye coordination to pick it up, then home to a parts bin, then do hand-eye to position the hand above the bin, then release the part. Along the way, we have done frequent egomotion estimation, homing, and perhaps obstacle avoidance as well as visual servoing. They are the motion competencies necessary to accomplish this vision-supported behavior.

Vassalo et al develop a nice example of visual servoing embedded within a larger visual navigation system in their paper "Visual servoing and appearance for navigation"

Their task scenario: A robot must be programmed to find and enter selected offices and rooms in a one-floor office space.

Design option 1: No visual servoing (no camera needed). Memorize floorplan. Then from known initial location and heading can compute the full sequence of control signals to get you wherever you want to go "with your eyes closed."

Design option 2: Full visual servoing using a camera and appearance-based control. Offline training: for each possible position of the robot and orientation of the optical axis, store an image. Then each time a new image is registered, search the library of images for best match. Turn and move the robot based on its current position, orientation and the desired destination.

Neither of these prove satisfactory design options. Problems:

DO1. Small initial condition errors will produce increasing navigational errors as you proceed. Also, cannot react to obstacles, determine closed/open doors, etc.

DO2. Large set of images to be stored and at each frame input, searched for best match. So, high computational complexity and large storage requirements.

Vassalo et al propose a compromise (DO3) in which we store a small set of indexed landmark images.

When we start the robot, it performs a visual servoing behavior we will call "corridor-following." As each new image comes in, we compare it with a single image, the first landmark image in the route. When we arrive at the first landmark image location, we update our location data, replace the image with the next indexed landmark image and proceed. This is landmark-based navigation.

So DO3 combines pure visual servoing during corridor- following with appearance-based (view-based localization) navigation at a sparse set of landmarks.

The landmark-image matching is done by straight image correlation. How is the corridor-following done?

Servoing to the vanishing point

We can follow the corridor by locating the ground plane vanishing point and turning the robot to home in on it.

Schematic of an image of the corridor

Vanishing point is the intersection of the corridor-ground plane edges.

To find the corridor-ground plane edges:

Use Sobel operators to find pixels with both significant horizontal and vertical gradients;
Separate into positive and negative gradient orientation pixel sets (left and right edges).
Erode with diagonal line mask to remove pixels not on diagonal lines.
Find the positive and negative straight lines with the largest number of pixel "hits."

Sobel Operator

Sobel is discrete differentiation operator, convolution with two kernels to give approximations to the horizontal and vertical derivative $G_x$ and $G_y$ of the image A. The kernels are separable.

Separable Kernel: Kernel can be express as the outer product of two vectors

$K_x = [1, 2, 1]^T [-1, 0, 1]$

$K_y = K_x^T$

$G_x = K_x \ast A$

$G_y = K_y \ast A$

Alternatively

$G_x = [1,2,1]^T * ([-1,0,1] * A)$

$G_y = [-1,0,1]^T * ([1,2,1] * A)$

Gradient magnitude is $\sqrt{G_x^2 + G_y^2}$ and direction is $\arctan (G_x/G_y)$

What is the link between convolution and correlation?

In the paper it is shown that the horizontal image coordinate of the vanishing point $^ix_v$ satisfies (eqn 7)

where $S_u$ is an intrinsic camera model parameter and $c_p$ is the cosine of the pitch (downlooking angle). θ is the deviation between the robot heading and the direction of the corridor. So controlling $^ix_v$ to zero (centering the vanishing point in the image) is the same as driving θ to zero (orienting the robot to follow the corridor).

Visual homing

Reading: Nelson, From visual homing to object recognition, in Aloimonos (ed.) Visual Navigation, Lawrence Erlbaum Associates, 1997

As was pointed out in the Vassallo paper, their scheme combining visual servoing and landmark memory can form the basis for visual homing, movement to a desired location using visual data.

Visual homing is important in several applications contexts:

Robotic tool positioning
Robotic or autonomous vehicle navigation
- Docking for recharge
Material handling
Object set manipulation, e.g. assembly tasks

We will study an approach to visual homing presented in the paper by Randal Nelson, "From visual homing to object recognition" (see above for full citation). As the title implies, Nelson's approach has implications for active vision approaches to the shape competencies.

The central questions in visual homing are: given the goal (homing to a specific location),

What is my current state (location and pose)?
Given my current state, what motion control actions should I take?

The second question is a control engineering question tied to the mechanics of the mobile platform and is outside the range of this course. We will simply assume that the motion control problem can be solved, and concentrate our attention on using visual data for state determination.

Visual associative memory

State determination can be done based on visual recovery. This passive approach is inefficient, however. Nelson advocates the use of a pattern-recognition based procedure called a visual associative memory. In general, an associative memory is a memory addressed by partial content.

E.g.: Phone number 645-23?? -> 645-2394
E.g.: Who do I know with a big nose and blond hair? That must be Bob.

A visual associative memory is a store of many 2-D images, and given a new image, addresses the most similar image.

Let $A_k$ be the kth feature of a pattern, with $A_k = \{a_k\}$, where each $a_k$ is a definite value for that feature. Define a pattern as an m-tuple $P=(a_1,...,a_m)$ from the pattern space $A=A_1 \times A_2 \times .. A_m$ ($\times$ is Cartesian product). So a pattern is a set of feature values from the pattern space.

Define a similarity function $s: A \times A \rightarrow R$ (R is the set of real numbers) which compares patterns. There are many such mappings which can be used as similarity functions, for instance the norm-induced correlation similarity measures similarity between $P_x$ and $P_y$ as

$s_1(P_x,P_y) = ||P_x-P_y||\sqrt{||P_x||*||P_y||}$

where ||*|| is a selected norm over A.

For his visual associative memory, Nelson uses the similarity measure

$s(P_x,P_y) = 1 - d(P_x,P_y)/m$

where d is the cardinality of the set of features that are not identical, and m is the dimension of the pattern space. d is sometimes called the "Hamming Distance."

Hamming Distance: Number of positions at which the symbols of two equal length strings differ

$P_x$ = (0.384,2,green,0,to the left of);

$P_y$ = (0.383,2,green,0,to the right of)

$s(P_x,P_y)$ = 1 - 2/5 = 3/5

So s measures the fraction of features that are identical.

Nelson selects this similarity measure because it can be used for categorical variables, those for which there are only labels without requiring any ordering function, as well as numerical variables.

Suppose we have n patterns M={P} stored in our associative memory. Then when presented with an input (index pattern) $P_x$, the associative memory's output will be the $P_y$ in M which best matches $P_x$

$\max_M s(P_x,P) = s(P_x,P_y)$

if the similarity exceeds a threshold t

$s(P_x,P_y) >= t$

If the best-match does not exceed t we return "no match."

Here is how we will use an associative memory for visual homing.

Store a set M of n reference patterns (camera views mapped down to their feature values) taken at known states (locations and poses). Label one or more of them as goal states.
For each reference pattern, store the value of the desired control action from that state.
For each new incoming image, map to its feature set and pass that feature set through the associative memory.
1. If there is a matched pattern, perform the reference pattern control action.
2. If there is no match, randomly change state and go to 3.
Repeat 3 until the goal state has been reached.

E.g.: Want to home to a point X in front of the cubic object with the camera optical axis perpendicular to the closest face. Assume camera can translate horizontally (not vertically) and rotate in yaw and pitch but not roll. These are reasonable DOFs for a mobile ground plane robot.

Pick a sampling grid on the ground plane and a sampling grid in pose space (camera yaw x pitch). For each point in this 4-D discrete state space, take a image. Define a control action for each point as "take a step towards the point X, and orient towards the center of any visible object."

We next need to specify the visible features of this scene that define the pattern space. There are many ways to do this, Nelson selects a vertex feature scheme. Here is how a binarized version of his vertex-finder algorithm works. Divide the image space into a $K_v$ by $K_h$ grid of cells. For a given image, use a vertex-finder algorithm to determine the cells in which vertices are determined to be visible. Assign cell (i,j) to have a feature value a=1 if there is a vertex there, otherwise a=0. The pattern corresponding to the goal state X will look like

...........................
...........................
.........1.....1...........
...........................
...........................
...........................
.........1.....1...........
...........................
...........................

If we are at a position like Y, the index pattern would be

...........................
...........................
...........1.1..1..........
...........................
...........................
....1...........1..........
...........1.1.............
...........................
...........................

and, if we have a reference pattern from approximately that same pose and location, it would have similarity 1. The action attached to that pattern would be something like "move one step in a direction 30 degrees ccw from optical axis and orient the camera to the centroid of the visible vertices."

In this example we used a binary feature space. But in general some or all features could be numerical.

In order for this scheme to work well, two conditions should be satisfied:

Match existence condition: for each x in C, the space of competence , there is a y in the set R of reference patterns such that s(P_x,P_y)>=t. In other words, there is at least one match.
Match uniqueness condition: for all x and y, s(P_x,P_y)>=t implies ||x-y||<r(x). r(x) is the desired accuracy of the state determination. In other words, there is at most one match.

If these conditions are met, then we will never have to pick a state randomly, nor confuse two states which are far apart. If they are not met, this algorithm may never converge. For instance if no y in R is visible from any point we might randomly move to, we will keep making random viewpoint changes forever.

Note that ideally both these conditions are met, but even if they are not, the algorithm may still converge.

Nelson defines his vertex-finder algorithm as follows. Divide the image into a GxG grid. Using an edge operator, find the cells that have strong edges in them, and the dominant edge direction in each such strong-edge grid cell. Quantize that direction into d values, eg. d=8 for D={N,NE,E,SE,S,SW,W,NW}. Then the pattern for an image consists in $P_x = (a_1,a_2,..,a_{G^2})$ where $A_k={D \cup 0}$, $\cup$ is the union operator and 0 is the null edge direction.

In implementing such a system, a basic design question is how many reference patterns M must be stored in order to satisfy the two match conditions above. By setting G and/or t low, we can reduce card(M) and still satisfy the match existence condition throughout C, but the lower G or t is set, the more likely the match uniqueness condition will be violated somewhere in C.

By setting G or t high, we are less likely to violate the uniqueness condition but at the expense of the cardinality of M.

Define the recognition neighborhood R(y) of a reference pattern $P_y$ as the largest connected set of similar points containing y (similar points are pairs (x,y) for which $s(P_x,P_y)>=t$), i.e. points that would be labelled by the same reference pattern). As long as $\cup_i R(y_i)=C$ for all reference patterns $M=\{P_y\}$ the match existence condition will be met throughout C.

This is a 2-D example of a design satisfying the existence condition but not the uniqueness condition. For instance, points near x are similar to several reference points, some of which are far apart (e.g. $y_2$, $y_9$). To satisfy uniqueness condition, we need

$||y_i-y_j|| > r \Rightarrow R(y_i) \cap R(y_j) = \emptyset$.

In equation (9) in the paper, Nelson estimates the spurious match probability $p_s$ in terms of the volume in pattern space of the recognition neighborhood:

$p_s \sim p_f v_{tot}/v_{rn}$

where $p_f$ is the far-match probability (probability that two points far apart picked randomly will match $d(x,y)>=r(x)$, $s(P_x,P_y)>=t$), $v_{rn}$ is the average volume of a recognition neighborhood, and $v_{tot}$ the total volume of the domain of competence C.

Roughly, $v_{tot}/v_{rn}$ measures the number of far recognition neighborhoods a point x might find a spurious match in at random, and $p_f$ the probability of a far-match there.

E.g.: Consider a building divided into N identical rooms, and we must home into the center of one selected room Then $p_f$ is proportional to N, and thus so is $p_s$.

Nelson performed some tests of his ideas using a 2-D motion space (x-y translation, no rotation) corresponding to "flying" above "Tinytown" with a downlooking camera. He set G to 5, t to 0.5 and took 120 reference images over a region of competency of roughly four "camerasfull" of the Tinytown model. Eight edge directions were used.

He defined a homing action, a step direction to home correctly (move towards the goal state) for each reference image, and tested to see if goal states would be reached.

Tests were done by defining a sampling grid on C with 4x as many points as reference points, so test points were separated by half a reference point separation. For most tested points in C, there was a match and the reference pattern matched was the correct one, i.e. a non-spurious match which therefore pointed in the correct direction (towards the goal state). For some tested points there was no match, and for one there was a spurious match. This means the homing sequence would always work, as long as randomized moves were made in the absence of match.

Comments:

Small 2-D test (4 cameras full, 5x5 grid). Beware of combinatorial explosion when scaling toy problems up!
Other features can be used, including features not limited to grid cell definition
Can distribute reference points more densely as the goal state is approached for fine motor control (e.g. docking).
Same associative memory principle can be used for object recognition.