06

Natural Vision

Reading: Levine Ch 3 - files/Levine_Ch3.pdf

About animate vision principles and systems. Animate vision refers to the physiological and anatomical principles of vision in animal species. Since the animate vision approach to design of computer vision systems is inspired by natural (biological) vision, it would do us well to look at how vision works in humans and other animals. So that is what we will do.

Biological vision systems input energy from a range of frequencies in the electromagnetic spectrum (usually but not exclusively the "visible spectrum" 400 to 700 nm wavelengths), and output behaviors.

You see a car coming, so you stop crossing the street.
You recognize your friend and you smile.
A snail detects a moving shadow so it gathers itself into its shell and stops.

Because the output of vision is behavior, it is difficult to do input-output (blackbox) analysis. Behavior is complicated, and is the outcome of more than just the visual or general perceptual cues that impinge on the animal.

We will describe some elements of the anatomy and the physiology of vision.

Anatomy: description of the physical parts,
Physiology: description of the functioning of these parts.

A biological vision system can be divided into three major parts:

The eye, that part which receives the visible energy input and converts it to an electrical signal that the animal's nervous system can work with;
The optic nerve, which transmits the coded signals to the central nervous system;
The brain, which interprets the vision signals, integrates them with other processes such as cognitive and motor, and produces behaviors.

The eye

The eye serves as the camera for the vision system. Here we will describe some of its anatomy and physiology.

Please refer to the Levine reading, which has some excellent detailed graphics I cannot reproduce here.

The action of the eye is to focus an image onto the retina. This is the "image plane." If the eye were a digital camera, this is where the CCD chip would be.

If it were an analog camera, this is where we would place the film. The retina, covering a bit more than the back half of the spherical eyeball, is clearly not a plane. It should really be called the "image hemisphere" rather than the "image plane."

In order to be focused onto the retina, lightrays in any system other than a pinhole camera must be bent.

An image is said to be focused when all the light radiating from a given point (x,y,z) in world coords irradiates only a single point (u,v) on the image plane (or image hemisphere in the case of the eye).

In the eye, this focusing function is done by the cornea and anterior chamber (where the aqueous humor is) and the lens. Together they form a multielement or compound lens.

This is a diagram of a compound lens in a camera. Normally, layers of lens are glued together and the focal plane, where the image is formed, is flat.

A camera needs a diaphragm, to control the size of the aperture (hole) through which light travels. The human eye diaphragm is the pupil system, whose aperture is controlled by muscle fibers. There are sphincter (tangential) muscles to stop the eye down, and dilator (radial) muscles to open up the pupil when more light is needed.

An imperfect lens can produce distortions on the image surface. That is the case with the human eye. Yet these distortions are uncompensated. It is the brain that "cleans up" the image so it appears undistorted. But the spatial bandwidth limitation cannot be compensated.

Eg: You look at a stripe pattern. As it gets further and further away, at some point you can no longer see individual stripes. This is the spatial bandwidth, expressed in lines/mm.

The retina

The image is formed on a thin layer called the retina.

The retina covers more than 180 degrees around the interior surface of the eyeball (\(>2\pi\) steradians), allowing us to see a few degrees towards the back while we are looking forward.

The retina consists of five layers of cells arranged radially. One layer, the photoreceptive layer, is devoted to electrooptical transduction. There are two kinds of photoreceptors in that layer: rods and cones.

Cones: photopic color-sensitive photoreceptors.
Rods: scotopic brightness photoreceptors.

Photopic vision is normal day-vision. Scotopic is night-vision.

Both rods and cones are unevenly distributed along the retina. Cones are packed closely together in the fovea, or optical center of the retina. Their density falls off as distance from the fovea.

Rods are absent from the central part of the fovea, are dense in the para-fovea, the area surrounding the fovea, and their density also falls off with distance from the fovea thereafter. The area far away from the fovea is called the peri-fovea or peripheral retina.

The fovea is a depression in the retina about 1.5mm in diameter, and subsumes about 5.2 degrees of visual angle. Its central part, only 0.3mm, is where there is the maximum density of cones and thus maximum spatial actuity (resolution, bandwidth). This highest acuity part of the visual field, the part impinging on the center of the fovea, is about the size of your thumbnail viewed at arm's length, roughly 1.0 degrees of visual angle. That's all we can see in great detail, the rest is blurred.

The spatial resolution in the periphery is about two orders of magnitude less than that in the central fovea.

Concerning visual angle: When an angle is measured in degrees or radians rather than steradians, it refers to a plane angle, not a solid angle. The solid angle subtended in \(R^3\) due to a plane angle in \(R^2\) is computed by rotating the arms of the angle around their bisector to form a cone.

In the case in point, the plane is that defined in space by the three points \(x_1\): top of fingernail held at arm's length, \(x_2\) center of fovea, \(x_3\): bottom of fingernail. Then 1 degree is the plane angle formed by the connected line segments \(x_1 \to x_2 \to x_3\) at the vertex \(x_2\).

The photoreceptors contain dyes, or photopigments, that absorb photons and change their transmissivity. There are three dyes in cones, just one in rods. Light passing through these dyes strikes a photosensitive membrane within the rod or cone, which absorbs photons and builds up a transmembrane potential. When this TMP reaches threshold, the cell "fires off" an action potential. The action potential is an electrical pulse which is then communicated to other cells by wiring structures called synapses.

Retinal neurons

There are four other layers of cells in the retina that preprocess the electrical signals that originate in the photoreceptors, prepare them for transmission to the central nervous system via the optic nerve. They are all nerve cells, or neurons.

Horizontal cells: interface with rods and cones.
Bipolar cells: blend together the outputs of several horizontals.
Amacrine cells: create horizontal filtering, eg. center-surround filters.
Ganglion cells: interface with the optic nerve.

There is strong evidence for hardwired preprocessing at the retinal level to accomplish various "preattentive" operations, including:

Contrast enhancement;
Motion detection;
Elimination of redundant information;
Anti-aliasing;
Noise suppression;

For our purposes, it is enough to say that the signals transmitted to the brain are preprocessed to enhance important image features and suppress noise and distortion.

Preattentive means processing that is always done regardless of what part (if any) of the visual data we are actually paying attention to.

An important fact is that the optic nerve contains about \(10^6\) fibers, while there are 10-100 \(\times 10^6\) photoreceptors.

So another important property of the retina is data compression.

Pathways beyond the retinal optic nerve

The left and right eye branches of the optic nerve merge beneath the brain in the optical chiasm , where fibers from the two left-half images merge, and likewise the two right-half images merge. This is for stereo.

The two halves of the optic nerve then project to the superior colliculus and the visual or striate cortex.

The superior colliculus is common to man and lower animals, the cortex is the part of the brain in which higher cognitive processing occurs. More primitive visual functions, like deciding eye movements, occur in the superior colliculus while abstract image understanding occurs in the striate cortex and the brain regions this projects to.

Neurons and neuronal signal processing

Neural nets have been on the scene a very long time, perhaps a half billion years. We are just getting around to engineering them artificially.

Information processing in the human brain takes place in a vast interconnection of specialized cells called neurons.

When the cell body (soma) reaches threshold potential, an action potential is fired down the axon.

When the action potential reaches a synapse, or connection with another neuron, it causes a neurotransmitter to flow into the synaptic cleft.

Neurotransmitters include the organic chemicals dopamine, acetylcholine, seritonin and norepinepherine. They each drive different types of neurons.

The neurotransmitter causes the post-synaptic membrane to become more negative (inhibitory synapse) or more positive (excitatory synapse). The net effect of all the post-synaptic membrane stimuli is to change the soma potential closer to or further from threshold.

In a human brain,

There are 10-100 billion neurons;
There are an average of 1-10 thousand synapses per neuron (10-100 trillion synapses altogether);
Action potentials are fired asynchronously (no central clock signal);
No information is conveyed in the shape of the action potential, only its presence or not. That is, channel coding is binary.
The maximum action potential rate is about 200-500/sec for a given neuron.

Why binary channel coding? Since an action potential is an "all-or-nothing" response, it can be regenerated. This would not be possible with analog channel coding.

Symbol coding in natural neural nets is pulse frequency modulation. That is, the information in a neural signal is conveyed in the rate of pulses (action potentials) per unit time.

Summary

This review of the human visual system is admittedly superficial. Our goal in this course is to understand how to synthesize and analyze computer vision systems for 3-D and motion, not human. We cannot take the time to do anything but scratch the surface.

But even the surface will have things to teach us. As we will see, many of the basic design principles exhibited by animate vision systems can be suitably migrated to artificial vision systems with great benefit to the computer vision designer.