Ch 8

CSE473/573 Fall 2010 Lecture Charts

Chapter 8: Image Understanding

Recalling the image understanding - computer vision

processing chain discussed early this semester, we

have arrived at the final step: image understanding.

CSE 573 Fa2010 Peter Scott 08-01

Image understanding is the task of extracting complete

semantic and contextual information from a scene. This

information is usually task-dependent.

Eg:

Image understanding for an air-to-air missle

would require object classification: high-per-

formance fighter, possibly detecting the national

insignia. Image understanding for a "scientifically

aware" machine vision system would require further

detecting the plume and interpreting it as signalling

that the aircraft just broke the sound barrier.

CSE 573 Fa2010 Peter Scott 08-02

In general, image understanding requires semantic labelling

of all significant objects in the scene, and semantic

labelling of their relationships. This scene data together

with domain knowledge is then used to derive the image

understanding.

Note: semantic labelling is labelling an object or image

part by its meaning. Eg: classifying an object, or

noting an object to be a threat. Index labelling is

labelling an object or image part by an index code

without meaning, eg. enumerating connected segments.

Semantic labelling of objects and relationships is the

key step. The methods for doing semantic labelling include

all the methods discussed in Chapter 7: Object Recognition

plus some more: rule-based systems, semantic networks,

reinforcement learning systems, etc. In this chapter we

will survey some of these methods as applied to image

understanding.

CSE 573 Fa2010 Peter Scott 08-03

Control strategies:

Both image data and domain knowledge tend to be large

and weakly structured databases. So the control strategies,

that is, the order in which datasets from these databases

are selected, referenced and accessed, is very important

to the usefulness of an image understanding scheme.

Parallel and serial control schemes: often, tasks

can be performed in parallel. If the right hardware

is available to take advantage of parallelization,

efficiency can be improved.

Eg: Divide scene into regions of interest, seach

each ROI in parallel for interesting objects.

Sometimes the algorithm is inherently serial, however.

Eg: Find the likeliest candidate to match a

desired object. Accept or, if reject, move to the

next most likely.

CSE 573 Fa2010 Peter Scott 08-04

Top-down and bottom-up control schemes:

Bottom-up control is a sequence of transformations

beginning with the original image data, at each step

using more compact representations of image attributes

until desired semantics can be recognized. This

corresponds to a natural process of summarizing data

on multiple levels.

Eg: Pixels -> connected segments -> image regions

-> objects -> object arrangements -> understanding.

Bottom-up control is said to be data-driven, since the

decision trees are governed by the image data itself.

Bottom-up control moves up through levels of increasing

coarseness (decreasing resolution) until matching metrics

can be applied and the scene understood.

CSE 573 Fa2010 Peter Scott 08-05

Top-down control is a sequence of transformations

beginning with the most highly summarized image

representations available, and descending through

representations containing increasing image detail.

At each step, image structures are matched against a

library of model-based template structures of the same

coarseness (level of resolution) to determine likely

candidates for further descent and investigation.

At each stage of descent, sub-goals are set and

executed. Top-down control is goal oriented.

CSE 573 Fa2010 Peter Scott 08-06

Eg: We are searching for all roads in a satellite

image. Bottom-up, we perform edge enhancement

on the original image, then edge thinning and

connection, then find all linear features of

a minimum length, then connect them with angle

constraints, etc. Top-down, we take a low-res

version of the image, find all linear features,

search each one of those at increasing resolution

to verify/falsify the road hypothesis.

Top-down control is model-based (model-driven). Decisions

are governed at every stage by the model attributes we

are interested in finding in the image. In the case of

bottom-up, data-driven algorithms, domain knowledge

becomes significant only in the final stages.

CSE 573 Fa2010 Peter Scott 08-07

Other control strategies include hybrids incorporating

both bottom-up and top-down stages (see for instance

the coronary artery border detection algorithm on p. 371),

and non-heirarchal approaches such the blackboard control

scheme. The blackboard is an open message-passing center

in which task results can be written and read, and

autonomous agents called daemons can communicate. For

instance, there can be an edge-enhancing daemon and a

segmentation daemon, along with daemons whose job is

tor identify the existence of specific states or objects

in the scene.

CSE 573 Fa2010 Peter Scott 08-07

Snakes

One class of model-driven control strategies is based

on active contour models, models of bounding contours

of regions whose shapes can be incrementally adjusted to

fit the image data. Active contour models which adjust

their shape consistent with minimizing a "shape energy"

functional are called snakes.

Eg: Snake model for the outline of a face. Let the

shape energy be the sum of a term penalizing the snake

for regions of sharp bending, and a term rewarding the

snake for moving to pixels of high gradient.

CSE 573 Fa2010 Peter Scott 08-08

Using a gradient descent algorithm on the shape energy,

convergence to a local minimum would proceed in steps

such as these:

CSE 573 Fa2010 Peter Scott 08-09

The energy function to be minimized can take many forms,

but often consists of a sum of terms reflecting three

kinds of considerations:

1. Internal energy: we typically prefer smoother,

straighter snakes to those with kinks, sharp

curvatures, and jump discontinuities.

2. Image forces: we may want to reward snakes that

tend to move to bright areas (dark areas),

that tend to track image edges, or move to

terminations (end pixels of lines or regions).

3. Constraint forces: require the snake to not be

longer than a certain legth, not enclose more

than a certain area, not be nearer a given image

feature than a certain distance, etc.

CSE 573 Fa2010 Peter Scott 08-10

For instance, if we wished the snake to be no more than

l_o in length, we could have a term of the form

max ( 0, [l(s)-l_o]³ )

as a constraint force. So any snake with length > l_o

will be sharply penalized.

Let v(s)=[x(s),y(s)] be a point along a snake, for

s=[0,1]. s=0 corresponds to an arbitrary starting point

along the snake, and a complete circuit of the snake is

made as s ranges from 0 to 1, v(0)=v(1).

Expressing the energy functional as an integral,

The first term in E_int penalizes "stretch," or fast

movement of the parameter s along the snake and is

called the elasticity term. The second penalizes sharp

bends or high curvature, and is the stiffness term.

CSE 573 Fa2010 Peter Scott 08-11

The image force E_image can take many task dependent forms.

A common generic form is to express it as a linear comb-

ination of line, edge and termination terms

E_image = w_lineE_line + w_edgeE_edge +w_termE_term

If we wish to attract the snake to existing edges, for

instance, we can use the specific form

E_edge = -| grad f(x(s),y(s)) |²

Forms for the other terms are given in the text on p376.

CSE 573 Fa2010 Peter Scott 08-12

Using the calculus of variations, a necessary condition

for the minimal energy snake can be derived called the

Euler-Lagrange condition. This is a partial differential

equation which can be solved iteratively using standard

PDE methods.

Alternatively, the first variation (denoted dE_snake*), of

E_snake* with respect to specific variations dv(s) of the

function v(s), can be computed using standard variational

arguments. dE_snake* tells us how much E_snake* will change

if we make a small change dv(s) in v(s). The first

variation is the function-space equivalent of the first

derivative. Selecting that dv(s) which minimizes the

first variation dE_snake* we have a steepest descent

algorithm for snake construction.

CSE 573 Fa2010 Peter Scott 08-13

Snake growing is an alternative procedure for finding the

best snake in the minimum-energy sense. It substitutes

many "easy" or small, well-conditioned snake problems for

a single "hard," badly-conditioned one. The procedure is:

1. Get an initial snake.

2. Measure the energy density (integrand of E_snake*)

along the snake and eliminate areas of high density.

3. Grow the remaining segments by extending them along

their natural tangents.

4. Solve the snake problem for each extended segment.

5. Merge the snake segments.

CSE 573 Fa2010 Peter Scott 08-14

Point Distribution Models

The idea is to model an class of objects by the

locations of a set of points along its boundary,

together with the observed variations of this

pointset in a training dataset. Then when a new

instance of this object class is detected, its

set of boundary points are found by fitting to the

boundary pointset of the model (PDM). The points

in the object's PDM are connected to form the

estimate of the object boundary.

CSE 573 Fa2010 Peter Scott 08-15

The first step in this shape description process is to

compute the PDM for the object class. The PDM consists

of two things: the vector of average locations of the

boundary points, and a set of vectors which describe

the most common variations of the boundary points from

the average, ie. the set of eigenvectors.

For each of M object images in our training dataset,

mark N "landmark" points around their boundary.

Let {(xⁱ_j,yⁱ_j),i=1..M,j=1..N} be this set of points.

CSE 573 Fa2010 Peter Scott 08-16

Eg: Training set for PDM for handwritten "T"

CSE 573 Fa2010 Peter Scott 08-17

To construct our PDM from the training dataset, first

the affine transformation that best matches xi with

x1 is found for each i. The average of the M aligned

training atoms is found, and is then aligned with x1.

Each of the aligned atoms is then realigned to match

this mean. The mean is then recomputed at the process

iterated. When converged, we have the aligned training

set {x_hat¹...x_hat^M}.

CSE 573 Fa2010 Peter Scott 08-18

Computing the average of the aligned training set

x_bar = 1/M (x_hat¹⁺x_hat²..x_hat^M)

x_bar is the expected shape of the object. But the

object class shows variations from instance to instance

which has to be modelled as well. Let

dxⁱ = x_hatⁱ - x_bar

be the vector of variations or "delta" of the ith

training atom from the mean point set. Then the

2Nx2N covariance matrix of the training dataset can

be estimated as the average outer product of the

deltas,

S = 1/M ([dx¹dx¹]^T+[dx²dx²]^T+..+[dx^Mdx^M]^T)

CSE 573 Fa2010 Peter Scott 08-19

Now S shows us the expected variation within the class.

In directions in which S is large, there is a lot of

variability in the training dataset, which in directions

of small value of S, there was little variation.

This information is captured in the set of 2N

ordered eigenvalues and eigenvectors of the S,

S p_i = lam_i p_i

where lam₁>lam₂>...lam_2N are the eigenvalues and

{p_i, i=1..2N} are the associated eigenvalues.

CSE 573 Fa2010 Peter Scott 08-20

Eg: Consider a 2-D random vector x with mean mu

and covariance matrix S given by

CSE 573 Fa2010 Peter Scott 08-20a

Any vector x can be represented as a linear combination

of the eigenvectors, with those corresponding to the

largest eigenvalues on average carrying the most weight.

The lower-indexed eigenvectors correspond to directions

of frequent distortion of the model class from its

mean PDM vector x_bar, the higher ones to minor

variations and noise. So the principal components

are determined by truncating the set of eigenvectors

at t<<2N and the resulting fit is

x = x_bar + (b₁p₁+b₂p₂+..+b_tp_t)

where the bi's are weight coefficients which tell how

much of each of the "modes of variation" p1..pt have

been detected in a given instance of the object.

For details of the calculation of the bi, see sec 8.3.

CSE 573 Fa2010 Peter Scott 08-21

Pattern recognition methods in image understanding

Object recognition from feature vectors was discussed in

Ch. 7. If we assign a feature vector to each pixel, the

same minimum distance and statistical techniques can be

used to answer image understanding questions as well.

Eg: Supervised classification of multi-spectral data.

Observing the same patch of the Earth's surface

from a satellite using multiple images taken at

m different wavelengths (IR, UV, visible), let

X(i,j) be, for each pixel location (i,j), an

m-vector of brightness values. Suppose we have a

training dataset of m-vectors corresponding to

classes wheat, corn, soybeans, trees, water, grass.

Can do min distance classification of each pixel,

then postprocess with a sequence of closing

morphological operations to eliminate scattered

mislabellings.

CSE 573 Fa2010 Peter Scott 08-22

Eg (continued): Each class of the training dataset can

be used to estimate the class-conditional

probabilities p(X|wheat), p(X|corn) etc. For

instance, for each class, we can assume Gaussian

statistics and take the sample mean as the estimated

Gaussian mean vector and the sample mean outer

product as the estimated covariance matrix

Mu = (1/m)(X₁+...+Xm)

Psi = (1/m)(X₁*X₁^T+...+Xm*Xm^T)

where there are m training atoms for this class.

CSE 573 Fa2010 Peter Scott 08-23

The required class-conditional probabilities

p(X|wheat), etc. then take on the usual Gaussian

form

CSE 573 Fa2010 Peter Scott 08-24

Unsupervised (clustering) techniques can also be used

on the pixel feature vectors to segment them into image

regions whose feature values are similar.

Once individual pixels are labelled, post-processing

techniques can be used to complete the image analysis.

For instance, using a rule-based system we may decide

that small regions of corn completely contained within

larger regions of soybeans should be relabelled soybeans.

CSE 573 Fa2010 Peter Scott 08-25

An interesting algorithm due to Kittler and Foglein uses

contextual information as the input for post-processing.

The context of a pixel is defined as the set of values of

the pixels in a neighborhood surrounding it. The basic

idea is to label a pixel by the class label w_r with the

highest posterior probability given the feature vectors

of that pixel and those of all pixels in its context.

1. Define a mask which specifies the shape and

size of the neighborhood for each pixel.

2. For each image pixel, find the context feature

vector psi by concatenating all features of all

pixels in its neighborhood.

3. Using the training dataset, estimate the class-

conditional probabilities p(psi|w_r) and prior

probabilities for each w_r.

4. Use Bayes' Law to determine which posterior

probability p(w_r|psi) is greatest. Label the

pixel at the origin of the mask accordingly.

CSE 573 Fa2010 Peter Scott 08-26

Eg: Lets use a mask which biases contexts in the

horizontal direction. Mark the pixel in the mask

whose context is being defined as x0.

Assume in this example pixels are specified by a

scalar gray value in the range 0 to 255. Suppose

the upper left hand corner of the image is

Then the context feature vector psi(1,2) is the

vector [55 3 21 72 78 26 87]^T, while

psi(2,4)=[77 87 83 82 90 78 36]^Tetc.

CSE 573 Fa2010 Peter Scott 08-27

Assume there are just two classes here, w₁ and w₂.

We can gather all the w₁ training data and compute the

sample mean mu₁ and sample covariance matrix Psi₁, and

assuming Gaussian statistics, we have p(psi,w₁) from

the general Gaussian form (with n=7). Repeat for w₂,

yielding p(psi,w₂). A count of the training data in

each class gives estimates p(w₁), p(w₂) (eg. if there

are 100 w₁ and 200 w₂ training data, we estimate

p(w₁)=1/3, p(w₂)=2/3.

CSE 573 Fa2010 Peter Scott 08-28

Then to label pixel (1,2), we just compare the products

p([55 3 21 72 78 26 87]^T| w1) p(w1)

p([55 3 21 72 78 26 87]^T| w2) p(w2)

and choose the label which agrees with the larger. To

label pixel (2,4), compare

p([77 87 83 82 90 78 36]^T | w1) p(w1)

p([77 87 83 82 90 78 36]^T | w2) p(w2)

CSE 573 Fa2010 Peter Scott 08-29

Scene labelling and constraint propogation

Image understanding is the task of extracting semantic

and contextual information from a scene. That is,

labelling objects and determining the significance

of their arrangement in the scene.

Image understanding differs from object recognition

in that object recognition uses the properties

of individual objects to do labelling, while ignoring

context. Image understanding requires the labelling of

all objects in the scene consistently, that is, making

sure the labels agree with both properties of individual

objects and the permissible relationships between label

object classes. Specified properties and relationships

are constraints on the labels which can be applied.

A choice of object labels for the objects in a scene

that do not violate any property or relationship

constraint is said to be consistent. Scene labelling

is the discovery of a consistent set of labels for

all objects in a scene.

CSE 573 Fa2010 Peter Scott 08-30

The scene labelling problem:

Given:

1. Set of objects in the scene R={R_i,i=1..N}

with specified properties and relationships

2. Set of labels O={O_j,j=1..M}

3. Set of unary relations (properties) associated

with each label P_j={p_j1,...,p_jJ,j=1...M}

4. Set of binary or m-ary relations associated

with pairs or m-tuples of labels,

find a consistent labelling, that is, a labelling of

each object in R such that the labels agree with all

the specified properties and relationships.

CSE 573 Fa2010 Peter Scott 08-31

Eg: Scene with:

1. 9 objects R1..R9

R1, R4, R7, R8, R9 are all oval, rest

circular. R3 is in R2 is in R1; R6 is

in R5 is in R4. R8 is below R7. R7 is

between R1 and R4.

2. Set of labels: O1=face, O2=mouth, O3=nose,

O4=eye, O5=iris, O6=pupil.

3. Set of properties: P1={oval, square},

P2={oval, horizontal}, P3={oval, vertical},

P4={oval}, P5={circular}, P6={circular, dark}.

4. Set of relations: (Ok in O1) for k=2..6,

(O6 in O5), (O5 in O4), (O2 below O3),

(O3 on centerline of O1),

(O3 between O4 and O4).

CSE 573 Fa2010 Peter Scott 08-32

One approach to scene labelling is discrete relaxation.

1. Assign all labels to each object which agree with

the object properties (unary relations).

2. Set i,j=1. For each object i=1,N,

a. For assigned label j, check consistency

with all relations.

b. If inconsistent with one or more, discard

that label.

c. If no labels remain, stop.

c. Repeat 2a., 2b. for all j.

3. Repeat 2. for all i.

4. Repeat 2.-3. until no further changes in label

sets for any object.

CSE 573 Fa2010 Peter Scott 08-33

Eg (continued):

1. Check unary relations:

2. Check other relations:

       R1: R2 in R1 => face|eye
           R7 between R1 and R4 => eye
       R2: R3 in R2 => iris
       R3: R3 in R2 => pupil
       R4: R5 in R4 => face|eye
           R7 between R1 and R2 => eye
       R5: R6 in R5 => iris
       R6: R6 in R5 => pupil
       R7: R7 between R1 and R2 => nose
       R8: R8 below R7 => mouth
       R9: face|mouth|nose|eye

CSE 573 Fa2010 Peter Scott 08-34

There are no empty label sets so a consistent

labelling is still possible. Repeat the check

against all non-unary relations. Find no change

so stop. We have an unique labelling of everything

except the face-object. Need one more relation in

our knowledge base concerning the found objects

in this image, such as such as "R7 is in R9."

CSE 573 Fa2010 Peter Scott 08-35

Sometimes it can take many iterations for constraints

to propogate from local label eliminations to more

distant label eliminations. For instance, if this is

not a bird then that is not a wing and then that is

not a feather may take three iterations if the bird

object is checked last, wing object next to last.

Discrete relaxation results in consistent labelling

wherever that is possible. Sometimes, however, due to

occlusion, noise, error in computing object properites,

etc. the only consistent labelling is quite unlikely to

be correct. Sometimes we would prefer an inconsistent

but more likely labelling.

CSE 573 Fa2010 Peter Scott 08-36

Eg: Analyzing an old photograph taken in Antarctica

in 1910. There is an object in the sky, about as

large as the sun, but with property "oval." The

discrete relaxed solution would be to label the

object as a blimp. But how likely is that? Looking

at the photo, we are more likely to interpret the

object as the sun blurred by movement of the camera

during the exposure, considering the oval property

to be an error. This would be an example of

probablistic rather than discrete relaxation.

CSE 573 Fa2010 Peter Scott 08-37

Probablistic relaxation

Define the support q_j for a labelling t_i=w_k of object R_i

due to an interacting object R_jwith label t_jas

q_j(t_i=w_k) = sum_over_l{r(t_i=w_k,t_j=w_l) P(t_j=w_l)}

The total support fort_i of R_i is

Q(t_i=w_k) = sum_over_j{c_ij*q_j(t_i=w_k)}

where {c_ij} are the strength of interaction coefficients.

We iteratively relax the "probabilities" P(t_i=w_k) using

P(t_i=w_k) <= Q(t_i=w_k)*P(t_i=w_k)/sum_over_l{Q(t_i=w_l)*P(t_i=w_l)}

where <= is the replacement operator.

CSE 573 Fa2010 Peter Scott 08-38

There are many variations of this basic relaxation

procedure (see text p. 462-3). They all start by setting

intial label probilities for each object in the scene

by determining image object unary and m-ary relations,

then estimating the answers to questions such as

How likely is an iris to be square?

How likely is a UFO to appear in this old photo?

The label probabilities are then all iteratively changed

(relaxed) using a rule such as the above. When no further

changes occur, the most likely labels for each object

are accepted, regardless of their consistency.

CSE 573 Fa2010 Peter Scott 08-39

Finally, relaxation can be avoided by using an

interpretation tree. This is typically used in

conjunction with depth-first search with

backtracking to find a consistent labelling.

Eg: Scene has three objects, each may be an aircraft,

bird or cloud.

CSE 573 Fa2010 Peter Scott 08-40

Construct interpretation tree. For N objects and

M labels, tree has N+1 levels, each node has M

descendents, total of M^N interpretations (leaves).

Working left to right, from root node descend to

first consistent label (no violation of unary

property for object 1). Then descend to first

consistent label from that node (no violation of

any binary property involving objects 1 and 2).

Continue until leaf reached or no consistent

label on next level. If leaf, you are done. If

no consistent label, backtrack one level and

repeat. If no new unexplored paths, declare

failure. Comment: M^N is a lot of leaves!

CSE 573 Fa2010 Peter Scott 08-41

CSE473/573 Fall 2010

Summary of Topics for Final Exam

First half of the semester we worked on low-level

(image processing) steps in the image-processing -

computer vision paradigm. The second half we

concentrated on higher-level (computer vision)

steps in which the mappings were between abstract

data structures, not pixel brightness maps, and task

definitions and domain knowledge were increasingly

important.

CSE573 Fa2010 FinRev-01

Chapter 11,Morphology: sections 4-7 only

How can morphology be applied to analyze images?

Gray level morphology

Skeletons

Granulometry

CSE573 Fa2010 FinRev-02

Chapter 6: Shape reprentation and description

How do we "write down" the shape of a segment,

region or object so we can perform object

recognition?

Region labelling

Contour-based representations and descriptors

Chain code

Geometric descriptors: perimeter, area, etc.

Sequences: syntactic, polygonal, etc.

Scale space, B-splines

Invariants

Region-based representations and descriptors

Scalar descriptors: area, Euler number, etc.

Moments

Convex hull, extreme points

Skeleton graphs

Geometric primitives

CSE573 Fa2010 FinRev-03

Chapter 7: Object recognition

Given the shape description of an object and a

database of domain knowledge of object classes

and their characteristics, object recognition is

the assignment of a class label to the object.

Knowledge representation: parameter vectors,

probility fns, graphs, etc.

Object recognition by analysis

Nearest neighbor classification

Statistical pattern recognition

Decision boundaries, descrimination fns.

Bayes Law

Minimum error classification

Minimum-average-loss classification

Learning of statistical parameters

Neural nets

Artificial neurons

Learning as search over weight space

Gradient descent, backpropogation

CSE573 Fa2010 FinRev-04

Object recognition by synthesis

Syntactic (grammatical) classification

Grammars

Regular grammars and finite state auto.

Parsing regular grammars

Graph matching

(Sub-)graph isomorphisms: node assoc'n fn

The assignment graph: maximal cliques

CSE573 Fa2010 FinRev-05

Chapter 8: Image understanding

Extract the information from the image which you

need. Recognized objects must be related and

characterized, ie. the scene contents semantically

labelled.

Control strategies

Contextual classification

Constraint propogation

Discrete relaxation

Probablistic relaxation

Interpretation tree

CSE573 Fa2010 FinRev-06