CSE473/573 Fall 2010 Lecture Charts  
 

Chapter 8: Image Understanding


Recalling the image understanding - computer vision

processing chain discussed early this semester, we

have arrived at the final step: image understanding.


 

CSE 573 Fa2010 Peter Scott 08-01


 
 
 

Image understanding is the task of extracting complete

semantic and contextual information from a scene. This

information is usually task-dependent.
 

Eg:






    Image understanding for an air-to-air missle

    would require object classification: high-per-

    formance fighter, possibly detecting the national

    insignia. Image understanding for a "scientifically

    aware" machine vision system would require further

    detecting the plume and interpreting it as signalling

    that the aircraft just broke the sound barrier.
 

CSE 573 Fa2010 Peter Scott 08-02

In general, image understanding requires semantic labelling

of all significant objects in the scene, and semantic

labelling of their relationships. This scene data together

with domain knowledge is then used to derive the image

understanding.
 

Note: semantic labelling is labelling an object or image

      part by its meaning. Eg: classifying an object, or

      noting an object to be a threat. Index labelling is

      labelling an object or image part by an index code

      without meaning, eg. enumerating connected segments.
 

Semantic labelling of objects and relationships is the

key step. The methods for doing semantic labelling include

all the methods discussed in Chapter 7: Object Recognition

plus some more: rule-based systems, semantic networks,

reinforcement learning systems, etc. In this chapter we

will survey some of these methods as applied to image

understanding.

CSE 573 Fa2010 Peter Scott 08-03

Control strategies:
 

Both image data and domain knowledge tend to be large

and weakly structured databases. So the control strategies,

that is, the order in which datasets from these databases

are selected, referenced and accessed, is very important

to the usefulness of an image understanding scheme.
 

    Parallel and serial control schemes: often, tasks

can be performed in parallel. If the right hardware

is available to take advantage of parallelization,

efficiency can be improved.
 

        Eg: Divide scene into regions of interest, seach

            each ROI in parallel for interesting objects.
 

Sometimes the algorithm is inherently serial, however.
 

        Eg: Find the likeliest candidate to match a

    desired object. Accept or, if reject, move to the

    next most likely.
 

CSE 573 Fa2010 Peter Scott 08-04

    Top-down and bottom-up control schemes:
 

Bottom-up control is a sequence of transformations

beginning with the original image data, at each step

using more compact representations of image attributes

until desired semantics can be recognized. This

corresponds to a natural process of summarizing data

on multiple levels.
 

    Eg: Pixels -> connected segments -> image regions

        -> objects -> object arrangements -> understanding.
 

Bottom-up control is said to be data-driven, since the

decision trees are governed by the image data itself.

Bottom-up control moves up through levels of increasing

coarseness (decreasing resolution) until matching metrics

can be applied and the scene understood.
 

CSE 573 Fa2010 Peter Scott 08-05

Top-down control is a sequence of transformations

beginning with the most highly summarized image

representations available, and descending through

representations containing increasing image detail.

At each step, image structures  are matched against a

library of model-based template structures of the same

coarseness (level of resolution) to determine likely

candidates for further descent and investigation.
 

At each stage of descent, sub-goals are set and

executed. Top-down control is goal oriented.
 

CSE 573 Fa2010 Peter Scott 08-06

    Eg: We are searching for all roads in a satellite

        image. Bottom-up, we perform edge enhancement

        on the original image, then edge thinning and

        connection, then find all linear features of

        a minimum length, then connect them with angle

        constraints, etc. Top-down, we take a low-res

        version of the image, find all linear features,

        search each one of those at increasing resolution

        to verify/falsify the road hypothesis.
 

Top-down control is model-based (model-driven). Decisions

are governed at every stage by the model attributes we

are interested in finding in the image. In the case of

bottom-up, data-driven algorithms, domain knowledge

becomes significant only in the final stages.
 

CSE 573 Fa2010 Peter Scott 08-07


 
Other control strategies include hybrids incorporating

both bottom-up and top-down stages (see for instance

the coronary artery border detection algorithm on p. 371),

and non-heirarchal approaches such the blackboard control

scheme. The blackboard is an open message-passing center

in which task results can be written and read, and

autonomous agents called daemons can communicate. For

instance, there can be an edge-enhancing daemon and a

segmentation daemon, along with daemons whose job is

tor identify the existence of specific states or objects

in the scene.

CSE 573 Fa2010 Peter Scott 08-07

Snakes
 

One class of model-driven control strategies is based

on active contour models, models of bounding contours

of regions whose shapes can be incrementally adjusted to

fit the image data. Active contour models which adjust

their shape consistent with minimizing a "shape energy"

functional are called snakes.
 

    Eg: Snake model for the outline of a face. Let the

    shape energy be the sum of a term penalizing the snake

    for regions of sharp bending, and a term rewarding the

    snake for moving to pixels of high gradient.

CSE 573 Fa2010 Peter Scott 08-08

    Using a gradient descent algorithm on the shape energy,

    convergence to a local minimum would proceed in steps

    such as these:

CSE 573 Fa2010 Peter Scott 08-09

The energy function to be minimized can take many forms,

but often consists of a sum of terms reflecting three

kinds of considerations:
 

    1. Internal energy: we typically prefer smoother,

       straighter snakes to those with kinks, sharp

       curvatures, and jump discontinuities.
 

    2. Image forces: we may want to reward snakes that

       tend to move to bright areas (dark areas),

       that tend to track image edges, or move to

       terminations (end pixels of lines or regions).
 

    3. Constraint forces: require the snake to not be

       longer than a certain legth, not enclose more

       than a certain area, not be nearer a given image

       feature than a certain distance, etc.
 

CSE 573 Fa2010 Peter Scott 08-10


 

For instance, if we wished the snake to be no more than

lo in length, we could have a term of the form
 

            max ( 0, [l(s)-lo]3 )
 

as a constraint force. So any snake with length > lo

will be sharply penalized.
 

Let v(s)=[x(s),y(s)] be a point along a snake, for

s=[0,1]. s=0 corresponds to an arbitrary starting point

along the snake, and a complete circuit of the snake is

made as s ranges from 0 to 1, v(0)=v(1).


Expressing the energy functional as an integral,

The first term in Eint penalizes "stretch," or fast

movement of the parameter s along the snake and is

called the elasticity term. The second penalizes sharp

bends or high curvature, and is the stiffness term.

CSE 573 Fa2010 Peter Scott 08-11


 

The image force Eimage can take many task dependent forms.

A common generic form is to express it as a linear comb-

ination of line, edge and termination terms
 

    Eimage = wlineEline + wedgeEedge +wtermEterm
 

If we wish to attract the snake to existing edges, for

instance, we can use the specific form
 

        Eedge = -| grad f(x(s),y(s)) |2
 

Forms for the other terms are given in the text on p376.

CSE 573 Fa2010 Peter Scott 08-12


 

Using the calculus of variations, a necessary condition

for the minimal energy snake can be derived called the

Euler-Lagrange condition. This is a partial differential

equation which can be solved iteratively using standard

PDE methods.
 

Alternatively, the first variation (denoted dEsnake*), of

Esnake* with respect to specific variations dv(s) of the

function v(s),  can be computed using standard variational

arguments. dEsnake* tells us how much Esnake* will change

if we make a small change dv(s) in v(s). The first

variation is the function-space equivalent of the first

derivative. Selecting that dv(s) which minimizes the

first variation dEsnake* we have a steepest descent

algorithm for snake construction.

CSE 573 Fa2010 Peter Scott 08-13


 

Snake growing is an alternative procedure for finding the

best snake in the minimum-energy sense. It substitutes

many "easy" or small, well-conditioned snake problems for

a single "hard," badly-conditioned one. The procedure is:
 

    1. Get an initial snake.

    2. Measure the energy density (integrand of Esnake*)

       along the snake and eliminate areas of high density.

    3. Grow the remaining segments by extending them along

       their natural tangents.

    4. Solve the snake problem for each extended segment.

    5. Merge the snake segments.

CSE 573 Fa2010 Peter Scott 08-14


 
 

 
Point Distribution Models
 

The idea is to model an class of objects by the

locations of a set of points along its boundary,

together with the observed variations of this

pointset in a training dataset. Then when a new

instance of this object class is detected, its

set of boundary points are found by fitting to the

boundary pointset of the model (PDM). The points

in the object's PDM are connected to form the

estimate of the object boundary.
 

CSE 573 Fa2010 Peter Scott 08-15


 

The first step in this shape description process is to

compute the PDM for the object class. The PDM consists

of two things: the vector of average locations of the

boundary points, and a set of vectors which describe

the most common variations of the boundary points from

the average, ie. the set of eigenvectors.
 

For each of M object images in our training dataset,

mark N "landmark" points around their boundary.

Let {(xij,yij),i=1..M,j=1..N} be this set of points.
 
 

CSE 573 Fa2010 Peter Scott 08-16

    Eg: Training set for PDM for handwritten "T"

CSE 573 Fa2010 Peter Scott 08-17



To construct our PDM from the training dataset, first

the affine transformation that best matches xi with

x1 is found for each i. The average of the M aligned

training atoms is found, and is then aligned with x1.

Each of the aligned atoms is then realigned to match

this mean. The mean is then recomputed at the process

iterated. When converged, we have the aligned training

set {x_hat1...x_hatM}.
 
 

CSE 573 Fa2010 Peter Scott 08-18




Computing the average of the aligned training set
 

        x_bar = 1/M (x_hat1+x_hat2..x_hatM)
 

x_bar is the expected shape of the object. But the

object class shows variations from instance to instance

which has to be modelled as well. Let
 

        dxi = x_hati - x_bar
 

be the vector of variations or "delta" of the ith

training atom from the mean point set. Then the

2Nx2N covariance matrix of the training dataset can

be estimated as the average outer product of the

deltas,
 

     S = 1/M ([dx1dx1]T+[dx2dx2]T+..+[dxMdxM]T)
 

CSE 573 Fa2010 Peter Scott 08-19



Now S shows us the expected variation within the class.

In directions in which S is large, there is a lot of

variability in the training dataset, which in directions

of small value of S, there was little variation.
 

This information is captured in the set of 2N

ordered eigenvalues and eigenvectors of the S,
 

            S pi = lami pi
 

where lam1>lam2>...lam2N are the eigenvalues and

{pi, i=1..2N} are the associated eigenvalues.
 

CSE 573 Fa2010 Peter Scott 08-20


    Eg: Consider a 2-D random vector x with mean mu

    and covariance matrix S given by

CSE 573 Fa2010 Peter Scott 08-20a



Any vector x can be represented as a linear combination

of the eigenvectors, with those corresponding to the

largest eigenvalues on average carrying the most weight.

The lower-indexed eigenvectors correspond to directions

of frequent distortion of the model class from its

mean PDM vector x_bar, the higher ones to minor

variations and noise. So the principal components

are determined by truncating the set of eigenvectors

at t<<2N and the resulting fit is
 

        x = x_bar + (b1p1+b2p2+..+btpt)
 

where the bi's are weight coefficients which tell how

much of each of the "modes of variation" p1..pt have

been detected in a given instance of the object.
 

For details of the calculation of the bi, see sec 8.3.
 

CSE 573 Fa2010 Peter Scott 08-21






 

Pattern recognition methods in image understanding
 

Object recognition from feature vectors was discussed in

Ch. 7. If we assign a feature vector to each pixel, the

same minimum distance and statistical techniques can be

used to answer image understanding questions as well.
 

Eg: Supervised classification of multi-spectral data.

    Observing the same patch of the Earth's surface

    from a satellite using multiple images taken at

    m different wavelengths (IR, UV, visible), let

    X(i,j) be, for each pixel location (i,j), an

    m-vector of brightness values. Suppose we have a

    training dataset of m-vectors corresponding to

    classes wheat, corn, soybeans, trees, water, grass.

    Can do min distance classification of each pixel,

    then postprocess with a  sequence of closing

    morphological operations to eliminate scattered

    mislabellings.
 

CSE 573 Fa2010 Peter Scott 08-22


 

Eg  (continued): Each class of the training dataset can

    be used to estimate the class-conditional

    probabilities p(X|wheat), p(X|corn) etc. For

    instance, for each class, we can assume Gaussian

    statistics and take the sample mean as the estimated

    Gaussian mean vector and the sample mean outer

    product as the estimated covariance matrix
 

            Mu = (1/m)(X1+...+Xm)

            Psi = (1/m)(X1*X1T+...+Xm*XmT)

    where there are m training atoms for this class.
 

CSE 573 Fa2010 Peter Scott 08-23


 

    The required class-conditional probabilities

    p(X|wheat), etc. then take on the usual Gaussian

    form

CSE 573 Fa2010 Peter Scott 08-24


 

Unsupervised (clustering) techniques can also be used

on the pixel feature vectors to segment them into image

regions whose feature values are similar.
 
 

Once individual pixels are labelled, post-processing

techniques can be used to complete the image analysis.

For instance, using a rule-based system we may decide

that small regions of corn completely contained within

larger regions of soybeans should be relabelled soybeans.
 
 

CSE 573 Fa2010 Peter Scott 08-25


 

An interesting algorithm due to Kittler and Foglein uses

contextual information as the input for post-processing.

The context of a pixel is defined as the set of values of

the pixels in a neighborhood surrounding it. The basic

idea is to label a pixel by the class label wr with the

highest posterior probability given the feature vectors

of that pixel and those of all pixels in its context.
 

    1. Define a mask which specifies the shape and

       size of the neighborhood for each pixel.

    2. For each image pixel, find the context feature

       vector psi by concatenating all features of all

       pixels in its neighborhood.

    3. Using the training dataset, estimate the class-

       conditional probabilities p(psi|wr) and prior

       probabilities for each wr.

    4. Use Bayes' Law to determine which posterior

       probability p(wr|psi) is greatest. Label the

       pixel at the origin of the mask accordingly.
 

CSE 573 Fa2010 Peter Scott 08-26


 

Eg: Lets use a mask which biases contexts in the

    horizontal direction. Mark the pixel in the mask

    whose context is being defined as x0.

    Assume in this example pixels are specified by a

    scalar gray value in the range 0 to 255. Suppose

    the upper left hand corner of the image is

    Then the context feature vector psi(1,2) is the

    vector [55 3 21 72 78 26 87]T, while

    psi(2,4)=[77 87 83 82 90 78 36]T etc.
 

CSE 573 Fa2010 Peter Scott 08-27


 
 

    Assume there are just two classes here, w1 and w2.

    We can gather all the w1 training data and compute the

    sample mean mu1 and sample covariance matrix Psi1, and

    assuming Gaussian statistics, we have p(psi,w1) from

    the general Gaussian form (with n=7). Repeat for w2,

    yielding p(psi,w2). A count of the training data in

    each class gives estimates p(w1), p(w2) (eg. if there

    are 100 w1 and 200 w2 training data, we estimate

    p(w1)=1/3, p(w2)=2/3.
 
 

CSE 573 Fa2010 Peter Scott 08-28


 
 

    Then to label pixel (1,2), we just compare the products

        p([55 3 21 72 78 26 87]T | w1) p(w1)

        p([55 3 21 72 78 26 87]T | w2) p(w2)

    and choose the label which agrees with the larger. To

    label pixel (2,4), compare

        p([77 87 83 82 90 78 36]T | w1) p(w1)

        p([77 87 83 82 90 78 36]T | w2) p(w2)
 
 

CSE 573 Fa2010 Peter Scott 08-29

Scene labelling and constraint propogation
 

Image understanding is the task of extracting semantic

and contextual information from a scene. That is,

labelling objects and determining the significance

of their arrangement in the scene.
 
 

Image understanding differs from object recognition

in that object recognition uses the properties

of individual objects to do labelling, while ignoring

context. Image understanding requires the labelling of

all objects in the scene consistently, that is, making

sure the labels agree with both properties of individual

objects and the permissible relationships between label

object classes. Specified properties and relationships

are constraints on the labels which can be applied.
 
 

A choice of object labels for the objects in a scene

that do not violate any property or relationship

constraint is said to be consistent. Scene labelling

is the discovery of a consistent set of labels for

all objects in a scene.
 

CSE 573 Fa2010 Peter Scott 08-30

 
The scene labelling problem:


Given:

 

    1. Set of objects in the scene R={Ri,i=1..N}

       with specified properties and relationships

    2. Set of labels O={Oj,j=1..M}

    3. Set of unary relations (properties) associated

       with each label Pj={pj1,...,pjJ, j=1...M}

    4. Set of binary or m-ary relations associated

       with pairs or m-tuples of labels,
 

find a consistent labelling, that is, a labelling of

each object in R such that the labels agree with all

the specified properties and relationships.
 

    CSE 573 Fa2010 Peter Scott 08-31

Eg: Scene with:

        1. 9 objects R1..R9

            R1, R4, R7, R8, R9 are all oval, rest

            circular. R3 is in R2 is in R1; R6 is

            in R5 is in R4. R8 is below R7. R7 is

            between R1 and R4.
 

        2. Set of labels: O1=face, O2=mouth, O3=nose,

           O4=eye, O5=iris, O6=pupil.

        3. Set of properties: P1={oval, square},

           P2={oval, horizontal}, P3={oval, vertical},

           P4={oval}, P5={circular}, P6={circular, dark}.

        4. Set of relations: (Ok in O1) for k=2..6,

           (O6 in O5), (O5 in O4), (O2 below O3),

           (O3 on centerline of O1),

           (O3 between O4 and O4).

CSE 573 Fa2010 Peter Scott 08-32

One approach to scene labelling is discrete relaxation.

    1. Assign all labels to each object which agree with

       the object properties (unary relations).

    2. Set i,j=1. For each object i=1,N,

          a. For assigned label j, check consistency

             with all relations.

          b. If inconsistent with one or more, discard

             that label.

          c. If no labels remain, stop.

          c. Repeat 2a., 2b. for all j.

    3. Repeat 2. for all i.

    4. Repeat 2.-3. until no further changes in label

       sets for any object.

CSE 573 Fa2010 Peter Scott 08-33

Eg (continued):

    1. Check unary relations:

       R1: face|mouth|nose|eye
       R2: iris|pupil
       R3: iris|pupil
       R4: face|mouth|nose|eye
       R5: iris|pupil
       R6: iris|pupil
       R7: face|mouth|nose|eye
       R8: face|mouth|nose|eye
       R9: face|mouth|nose|eye

    2. Check other relations:

       R1: R2 in R1 => face|eye
           R7 between R1 and R4 => eye
       R2: R3 in R2 => iris
       R3: R3 in R2 => pupil
       R4: R5 in R4 => face|eye
           R7 between R1 and R2 => eye
       R5: R6 in R5 => iris
       R6: R6 in R5 => pupil
       R7: R7 between R1 and R2 => nose
       R8: R8 below R7 => mouth
       R9: face|mouth|nose|eye
 

CSE 573 Fa2010 Peter Scott 08-34


 
 
 

    There are no empty label sets so a consistent

    labelling is still possible. Repeat the check

    against all non-unary relations. Find no change

    so stop. We have an unique labelling of everything

    except the face-object. Need one more relation in

    our knowledge base concerning the found objects

    in this image, such as such as "R7 is in R9."
 
 

CSE 573 Fa2010 Peter Scott 08-35


 

Sometimes it can take many iterations for constraints

to propogate from local label eliminations to more

distant label eliminations. For instance, if this is

not a bird then that is not a wing and then that is

not a feather may take three iterations if the bird

object is checked last, wing object next to last.
 

Discrete relaxation results in consistent labelling

wherever that is possible. Sometimes, however, due to

occlusion, noise, error in computing object properites,

etc. the only consistent labelling is quite unlikely to

be correct. Sometimes we would prefer an inconsistent

but more likely labelling.
 

CSE 573 Fa2010 Peter Scott 08-36


 

Eg: Analyzing an old photograph taken in Antarctica

    in 1910. There is an object in the sky, about as

    large as the sun, but with property "oval." The

    discrete relaxed solution would be to label the

    object as a blimp. But how likely is that? Looking

    at the photo, we are more likely to interpret the

    object as the sun blurred by movement of the camera

    during the exposure, considering the oval property

    to be an error. This would be an example of

    probablistic rather than discrete relaxation.
 
 

CSE 573 Fa2010 Peter Scott 08-37


 

Probablistic relaxation
 

Define the support qj for a labelling ti=wk of object Ri

due to an interacting object Rj with label tj as
 

    qj(ti=wk) = sum_over_l{r(ti=wk,tj=wl) P(tj=wl)}
 

The total support for ti of Ri is
 

    Q(ti=wk) = sum_over_j{cij*qj(ti=wk)}
 

where {cij} are the strength of interaction coefficients.

We iteratively relax the "probabilities" P(ti=wk) using
 

  P(ti=wk) <= Q(ti=wk)*P(ti=wk)/sum_over_l{Q(ti=wl)*P(ti=wl)}

where <= is the replacement operator.

 
 

CSE 573 Fa2010 Peter Scott 08-38


 

There are many variations of this basic relaxation

procedure (see text p. 462-3). They all start by setting

intial label probilities for each object in the scene

by determining image object unary and m-ary relations,

then estimating the answers to questions such as
 

    How likely is an iris to be square?

    How likely is a UFO to appear in this old photo?
 

The label probabilities are then all iteratively changed

(relaxed) using a rule such as the above. When no further

changes occur, the most likely labels for each object

are accepted, regardless of their consistency.
 
 

CSE 573 Fa2010 Peter Scott 08-39


 

Finally, relaxation can be avoided by using an

interpretation tree. This is typically used in

conjunction with depth-first search with

backtracking to find a consistent labelling.
 

Eg: Scene has three objects, each may be an aircraft,

    bird or cloud.


 

    CSE 573 Fa2010 Peter Scott 08-40



 
 
 

    Construct interpretation tree. For N objects and

    M labels, tree has N+1 levels, each node has M

    descendents, total of MN interpretations (leaves).

    Working left to right, from root node descend to

    first consistent label (no violation of unary

    property for object 1). Then descend to first

    consistent label from that node (no violation of

    any binary property involving objects 1 and 2).

    Continue until leaf reached or no consistent

    label on next level. If leaf, you are done. If

    no consistent label, backtrack one level and

    repeat. If no new unexplored paths, declare

    failure.  Comment: MN is a lot of leaves!


 
 

CSE 573 Fa2010 Peter Scott 08-41


 
 
 
   

CSE473/573 Fall 2010

Summary of Topics for Final Exam




First half of the semester we worked on low-level

(image processing) steps in the image-processing -

computer vision paradigm. The second half we

concentrated on higher-level (computer vision)

steps in which the mappings were between abstract

data structures, not pixel brightness maps, and task

definitions and domain knowledge were increasingly

important.
 

CSE573 Fa2010 FinRev-01


Chapter 11,
Morphology: sections 4-7 only

  How can morphology be applied to analyze images?


   
Gray level morphology

    Skeletons

    Granulometry

CSE573 Fa2010 FinRev-02


Chapter 6: Shape reprentation and description

 

  How do we "write down" the shape of a segment,

  region or object so we can perform object

  recognition?
 

    Region labelling

    Contour-based representations and descriptors

        Chain code

        Geometric descriptors: perimeter, area, etc.

        Sequences: syntactic, polygonal, etc.

        Scale space, B-splines

        Invariants

    Region-based representations and descriptors

        Scalar descriptors: area, Euler number, etc.

        Moments

        Convex hull, extreme points

        Skeleton graphs

        Geometric primitives
 

CSE573 Fa2010 FinRev-03


 

Chapter 7: Object recognition
 

  Given the shape description of an object and a

  database of domain knowledge of object classes

  and their characteristics, object recognition is

  the assignment of a class label to the object.
 

    Knowledge representation: parameter vectors,

        probility fns, graphs, etc.

    Object recognition by analysis

        Nearest neighbor classification

        Statistical pattern recognition

            Decision boundaries, descrimination fns.

            Bayes Law

            Minimum error classification

            Minimum-average-loss classification

            Learning of statistical parameters

        Neural nets

            Artificial neurons

            Learning as search over weight space

            Gradient descent, backpropogation
 

CSE573 Fa2010 FinRev-04


 

  Object recognition by synthesis

        Syntactic (grammatical) classification

            Grammars

            Regular grammars and finite state auto.

            Parsing regular grammars

        Graph matching

            (Sub-)graph isomorphisms: node assoc'n fn

            The assignment graph: maximal cliques
 
 

CSE573 Fa2010 FinRev-05


Chapter 8: Image understanding

 

  Extract the information from the image which you

  need. Recognized objects must be related and

  characterized, ie. the scene contents semantically

  labelled.

 
    Control strategies   

    Contextual classification

    Constraint propogation

        Discrete relaxation

        Probablistic relaxation

        Interpretation tree
 

CSE573 Fa2010 FinRev-06