CSE668 Principles of Animate Vision Spring 2006
 

Systems and Control

           

Required reading: [23] Maki, Nordlund and Eklundh, Attentional scene segmentation: Integrating depth and motion from phase, Computer Vision and Image Understanding v 78 (2000), 351-373.


We conclude our study of active and animate vision

this semester with a look at how the various skills

(motion, OR, etc.) can be pulled together into a

goal-directed system, and how the behavior of that

system can be controlled.



An important element of any active vision system is

gaze control, that is, control of the field of view by

manipulation of the optical axis and other extrinsic

camera model parameters, and control of the regions of

interest (ROI) within the current field of view which

require further processing.

 


CSE668 Sp2006 Peter Scott 10-01


 
Even without moving your eyes, you can change the

locus of attention from one region of the field of

view to another. This is called
attentional gaze

control
, and is the subject of the Maki paper. He

likens this to cognitively "shining a spotlight" on

a selected portion of the retinal image without

any change in the camera model.



For artificial active vision systems, it is essential

to be able to select those ROI's that need further

analysis effectively and quickly. The identification

of these ROI's is
attentional scene segmentation, ie.

segmenting out interesting parts of the scene for

further processing.

CSE668 Sp2006 Peter Scott 10-02


 

         The approach to attentional scene segmentation proposed

    in this paper is a cue integration approach. Three

    visual cues which together define the most important

    ROI's for additional consideration and processing are

    computed, masks defining their locations and extent

    within the field of view determined, and the their

    intersection determined. These regions correspond to

    the ROI's for "attention."

CSE668 Sp2006 Peter Scott 10-03



   
    Why do we need attentional segmentation?


    Both in natural and in computer vision, in general

    there are two kinds of processing of raw image

    brightness data: preattentive and attentive
        
    processing.


   
Preattentive processing: pre-processing which occurs

    automatically and uniformly over the field of view

    independent of space or time.


   
Attentive processing: processing of ROI's determined

    by the vision system's attention control mechanism to

    merit additional "attention," ie. analysis.

   

CSE668 Sp2006 Peter Scott 10-04




    Eg: There is a parked car 100 meters ahead in the

    periphery of our field of view as we are walking in a

    mall parking lot. If we are simply doing obstacle

    avoidance, a stationary distant car does not trigger

    our attention control mechanism, we need not attend to

    it. But if we are trying to find our car in the lot,

    we will need to process that information further,

    probably by foveating to it.


    Like everything else in an active vision system, the

    attentive control decision process is both task and

    embodiment dependent.

CSE668 Sp2006 Peter Scott 10-05



   
    Often the result of attentive scene segmentation is to

    identify ROI's we need to foveate to. But another use

    is to trigger
reactive (stimulus-response) rather than

    cognitive
(stimulus-analysis-planning-response) behavior.


    Eg: we are walking through the woods and are about to

    collide with an overhanging branch which suddenly

    appears in our FOV. We attentively segment out the

    branch and react to its radial motion and location,

    ducking away from it without foveating to it if the

    threat is immediate. Note: recall the threat metric


                    1/tc = (dri/dt)/ri

    and that foveation requires at least 100 ms for the

    human eye.

CSE668 Sp2006 Peter Scott 10-06




    Preattentive processing produces, at each time t and

    at each point (u,v) in the image space, a vector of

    values for the preattentive image parameters at that

    point. Each component can be considered a preattentive

    cue for attentive segmentation. Maki et al select 3:

    image flow, stereo disparity, and motion detection.

    There are many other preattentive cues that might have

    been selected: edges, colors, color gradients,region

    sizes, etc. They pick these because they believe the

    most important things to attend to in in general, ie.

    pre-attentively, for further processing are nearby

    objects in motion relative to the viewer.


CSE668 Sp2006 Peter Scott 10-07


 




Attentional integration of selected preattentional cues


       

CSE668 Sp2006 Peter Scott 10-08




    Stereo disparity is the difference between where a given

    point in world coordinates shows up on the two image

    planes. From stereo disparity we can determine relative

    depth. Attentional attractiveness in this cue is low

    relative depth . Maki et al compute a dense stereo

    disparity map by using a phase based method. There are

    other ways to do this, including the sparse feature-based

    and dense correlation-based methods we discussed earlier.

CSE668 Sp2006 Peter Scott 10-09



   
    Optical flow may be used to determine motion in the plane

    perpendicular to the optical axis. Under orthographic

    projection, optical flow is completely insensitive to

    depth motion, and for distant objects, perspective

    projection and orthographic projection are very

    similar.Maki et al choose to measure only the 1-D

    horizontal component of flow in the plane perpendicular

    to the optical axis because they can reuse the 1-D depth

    discovering disparity algorithm to do this.


   

CSE668 Sp2006 Peter Scott 10-10




    Motion detection is employed to segment regions where

    motion is occurring, and together with optical flow then

    used to determine the constancy of that motion in image

    regions for purposes of multi-object segmentation.

CSE668 Sp2006 Peter Scott 10-11




    Stereo disparity cue


    Stereo disparity measures relative depth. Image points

    of equal disparity are at the same depth.

CSE668 Sp2006 Peter Scott 10-12


 

   
    We see that depth is not a function of rt, just of

    stereo baseline 2ro and disparity. So for a given

    intrinsic stereo camera model, disparity measures

    relative depth.


    The algorithm for determining the disparity at image

    point (x,y) used in Maki et al uses phase difference

    at a known frequency to determine spatial difference.


        1. Using a Fourier wavelet-like kernel, compute

        the magnitude and phase V(x) of the horizontal

        frequency component at the image location x.

        Repeat for left and right images.

        2. Disparity(x) = ((ang(vl(x))-ang(vr(x)))/2pi)*T

        3. T = 2pi/w, where w ~ d/dx1 (ang(v(x))


    For details see the earlier paper by Maki et al [24].

CSE668 Sp2006 Peter Scott 10-13



    Optical flow cue

      
    Suppose we determine Vt1(x) and Vt2(x), where t1 and t2

    are successive image acquisition times, Vt1(x) and Vt2(x)

    are taken by the same camera. Then the "disparity" measured

    as above corresponds to image flow in the horizontal

    direction. In this case T corresponds to a temporal rather

    than a spatial period as before.

CSE668 Sp2006 Peter Scott 10-14



   
    A problem with this common method for disparity and

    optical flow is that it may give unreliable point

    estimates of these two cues. They define a confidence

    parameter (certaintly map) C(x) which is the product of

    the correlation coefficient between the two V's and their
   
    combined strength (geometric mean magnitude). The idea

    is that x's where the V's are similar and are strong are

    reliable for using their associated cues to determine

    attentional masks.

CSE668 Sp2006 Peter Scott 10-15



   
    Having now mapped both disparity and optical flow

    separately over the field of view, we need to segment

    out regions that are identified by these cues as requiring

    further processing. This of course is task dependent. Maki

    et al define two generic task modes, pursuit mode and

    saccade mode. Pursuit mode includes visual servoing, shape

    from motion, and docking. Saccade mode includes visual

    search, object detection, object recognition.

CSE668 Sp2006 Peter Scott 10-16




    Pursuit mode
 

    This is also called tracking mode, where the task is to

    continue to attend to a moving target that has already

    been identified. The procedure to create the disparity mask

    Td(k) and image flow mask Tf(k) at time k is the same:


    1. Predict the change P(k) from k-1 to k in the disparity

    (flow) target value based on a simple linear update of

    P(k-1).

    2. Let the selected disparity (flow) value at k be the

    sum of that at k-1 plus P(k).

    3. Find the nearest histogram peak to the selected value

    and define that histogram mode as target values. Note: the

    histograms are confidence-weighted.

   
    Finally, the target mask Tp(k)is defined as the

    intersection of the disparity mask Td(k) and image flow

    mask Tf(k). This guarantees that the target we will attend

    to is both in the depth range we predict and has the

    horizontal motion we predict.

CSE668 Sp2006 Peter Scott 10-17



    Saccade mode


    Here the task is to identify the region containing the

    most important object that you are not currently attending

    to. The authors admit that importance is definitely task

    dependent, but stick with their assumption that close

    objects in motion are the most important ones.


    1. Eliminate the current target mask from the images and

    compute confidence-weighted disparity histograms.


    2. Back-project the high disparity (low depth) peaks,

    again eliminating current target mask locations.


    3. Intersect with motion mask.


    Finally, the least-depth region in motion is the output

    saccade mode target mask.

CSE668 Sp2006 Peter Scott 10-18



    Choosing between pursuit and saccade at time k


    At each time k, we may choose to continue pursuit of the

    current target in the pursuit map Tp(k) (if there is one),

    or saccade to the new one in the saccade map Ts(k).

   
    There are many criteria that could be employed to

    preference between these alternatives. Some take their

    input directly from the two maps, others reach back into

    the data used to compute them. For instance:


    1. Pursuit completion: pursue until a task-completion

    condition is met. For instance, object recognition or

    interception.


    2. Vigilant pursuit: pursue for a fixed period of

    time, then saccade, then return to pursuit.


    3. Surveillance: saccade until all objects with certain

    characteristics within a set depth sphere have been
   
    detected.

CSE668 Sp2006 Peter Scott 10-19



    Consistent with their depth-motion framework, Maki et al

    suggest two different pursuit-saccade decision criteria:



    Depth-based: pursue until the saccade map object is closer

    than the pursuit map object.


    Duration-based: pursue for a fixed length of time.


    Clearly, neither of these is sufficient. The depth-based

    criterion is not task or embodiment dependent. We may let

    an important target "get away" by saccading to a closer

    object that is for instance moving away from us. And bad

    things can happen to us during the duration-based fixed

    time that we are not alert to obstacles and threats.

   

CSE668 Sp2006 Peter Scott 10-20



    Example from their paper:

 

CSE668 Sp2006 Peter Scott 10-21


 

CSE668 Sp2006 Peter Scott 10-22


 

 

    As a final note, lets measure the Maki et al attentive

    gaze control approach against the active vision and

    animate vision paradigms.

    Active vision: Compute what you need as you need it

    * Representations and algorithms minimal for task.

    * Vision is not recovery.

    * Vision is active, ie. current camera model selected

    in light of current state and goals.

    * Vision is embodied and goal-directed.


 

    Animate vision: Anthropomorphic features emphasized

    * Binocularity: NA

    * Foveal (multiresolution) focal plane geometry

    * Gaze control, both between and within eye movements

    * Use fixation frames and indexical reference
 

 

CSE668 Sp2006 Peter Scott 10-23


   
    Presentations

   
    Begin next class April 20.

   
    Format: 20 min Powerpoint (or equivalent) presentation

    followed by 5 min questions and answers.
Please email

    your presentation to me by 5:00PM on the day before you

    are to give your talk.
I will post links to these

    files on the course home page, and you will be able

    to open them from the classroom pc and projector.

   
    Suggestions:


    1. No more than 4-5 bullets any single slide.

    2. Illustrate with graphics wherever possible.

    3. Do at least one dry run to determine how many

    slides you are likely to cover in 20 min.
Make sure

    you will have time to get to the critical slides.

CSE668 Sp2006 Peter Scott 10-24




    4. Have a core set of slides (15-25) and a supplementary

    set (0-10+) which you can cover if you have time.

    5. Your core set should include clear statements of:


        * your problem

        * where it fits: 3-D/motion? passive/active?

        * your basic approach

        * the literature and your relation to it

        * what you have produced to date

        * what your minimal goals for the project include

        * what you will do if you have additional time


    It may include anything else as well, but these items

    are valuable in framing your project.

    6. Assume audience is knowledgeable in computer vision

    but not necessarily in the exact area of your project.

    7. Oral presentation: nervousness, which everyone feels,

    tends to make you speak quickly. Speak at a moderate pace,

    be careful not to speak too quickly, particularly if you

    speak English with an accent.

    8. Mention at the beginning of your presentation if you

    want questions as they occur to people, or to hold

    questions until you are done.

       

CSE668 Sp2006 Peter Scott 10-25