John A. Perrone

Vol. 9, No. 2/February

1992/J. Opt. Soc. Am. A

177

Model for the computation of self-motion in biological systems John A. Perrone Vision Group, Human Interface Research Branch, National Aeronautics and Space Administration, Ames Research Center,MS 262-2, Moffett Field, California 94035-1000;Department of Psychology, Stanford University,Stanford, California 94305 Received October 22, 1990; revised manuscript received August 13, 1991; accepted August 18, 1991

I present a method by which direction- and speed-tuned cells, such as those commonly found in the middle temporal area of the primate brain, can be used to analyze the patterns of retinal image motion that are generated during observer movement through the environment. For pure translation, the retinal image motion is radial in nature and expands out from a point that corresponds to the direction of heading. This heading direction can be found by the use of translation detectors that act as templates for the radial image motion. Each translation detector sums the outputs of direction- and speed-tuned motion sensors arranged such that their preferred direction of motion lies along the radial direction out from the detector center. The most active detector signifies the heading direction. Rotation detectors can be constructed in a similar fashion to detect areas of uniform image speed and direction in the motion field produced by observer rotation. A model consisting of both detector types can determine the heading direction independently of any rotational motion of the observer. The model can achieve this from the outputs of the two-dimensional motion sensors directly and does not assume the existence of accurate estimates of image speed and direction. It is robust to the aperture problem and is biologicallyrealistic. The basic elements of the model have been shown to exist in the primate visual cortex.

INTRODUCTION Humans can move rapidly through complex environments while avoiding obstacles in their path. Somehow we are able to extract three-dimensional (3-D) information, such as the structure of the environment and the heading direction, from the two-dimensional (2-D) changing light pattern that is projected onto our retina. The apparent effortlessness with which this navigational ability is carried out by humans as well as many other species contrasts with the underlying complexity of the task. This is even more outstanding if we consider that self-motion (egomotion) is usually not in a straight line and is accompanied by motion of the eyes, such that the resulting image motion is a complex mixture of rotational and translational motion.' 3 How this difficult problem is solved by biological visual systems still remains a largely unanswered question. This paper focuses primarily on the visual information that can be used for self-motion computations, but it is important to keep in mind that other inputs (e.g., vestibular4 ) also contribute to this task. Figure 1 depicts the main elements involved in the classical problem of determining self-motion. The observer is usually considered to be freely moving in an unconstrained space, and the task is to infer his or her 3-D motion parameters from just the optical information being projected onto some 2-D image plane. The most general form of this problem considers the case of arbitrary 6-degree-offreedom observer motion. From image motion alone, the goal is to recover the instantaneous heading direction, any rotation of the observer reference frame about the three axes (yaw, pitch, and roll), and the relative distances of the points in the environment (scaled range map). The image 0740-3232/92/020177-18$05.00

motion is also usually constrained to arise from a single instant of observer motion, so that only a single velocity vector field is used in the computations. Efforts to understand the self-motion estimation problem have been boosted in recent years by the desire to construct machine-vision systems for robots and autonomous vehicles that could emulate this ability. These machine-vision applications have to some extent been motivated by the theoretical analyses that demonstrate that there are many potential sources of navigational information available in image sequences. 3 ,5 -" While several elaborate solutions have been developed in this area, few of them qualify as potential biological systems. The majority of the proposed algorithms were not intended as physiological models, and some authors were careful to stress this.3 The aim was to demonstrate merely that sufficient information was present in the optic flow alone. There is a long history of attempts to define the possible visual sources of self-motion information that are used by humans. (See Cutting 2 and Warren et al.' 3 for reviews.) A recent series of psychophysical experiments by Warren et al. 3 and Warren and Hannon, 1415 as well as some earlier studies,' 6 '18 has helped to define the performance limits of

the human observer and has constrained the type of mechanism that could plausibly be utilized in the selfmotion estimation process. In addition to these psychophysical experiments, physiological studies of the middle temporal (MT) area of primates 9 2 3 and the medial superior temporal24 2 9 (MST) area have also served to constrain the type of signal available for self-motion computations in the primate visual system. The receptive fields of the cells in area MST that respond preferentially to the type of image motion generC 1992 Optical Society of America

178

J. Opt. Soc. Am. A/Vol. 9, No. 2/February

John A. Perrone

1992

|YAW?|

CD Y

OBSERVER

-.

~

(XYZ)

X F ) PITCH?

(a)

(b)

Fig. 1. (a) 2-D image-motion information on a projection plane is used to extract the observer's heading direction, any rotation about the three coordinate axes, and the relative distance to points in the XYZ coordinate space. (b) 2-D velocity vector field (flow field) for a 0.25-s instant of pure translatory motion through a volume of points randomly distributed in XYZ space. The heading direction is 100 to the right of center and 5 up. The vectors represent the 2-D image motion that is produced by the observer translation, and they all radiate out from one point. This is called the focus of expansion (FOE), and it is marked by a square.

ated during self-motion tend to be large and integrate information from nearly the entire visual field. Their responses to mixed rotational and translational motions are complex, and their patterns of responses are only just starting to be classified.2 6 ' 0 One of the benefits of a model such as the one proposed in this paper is to help to guide the current physiological experiments involving areas such as MST, in which visual, vestibular, and eyemovement signals seem to interact in a complicated fashion.3 '3 Many of the analytical approaches to the self-motion problem assume the existence of an accurate 2-D velocity vector field, i.e., a metrical readout of speed and direction at each point in the visual field. For instance, local differential motion techniques for recovering self-motion" rely on accurate image-motion estimates since this method is based on taking velocity vector differences in a small local region of the image. Any technique that necessitates finding vector differences, either locally or over larger areas of the image plane,'4 ' 6 is inherently reliant on good initial estimates of the flow vectors. Unfortunately the relatively broadly tuned direction- and speedtuned cells that have thus far been found in area MT2122 and other visual areas of the primate brain do not meet the often stringent requirement for precise local speed measurements

dictated by some algorithms.'

However, I

acknowledge that such units could exist in other unexplored areas or that speed information could be precisely coded in some elaborate and yet unexplained manner. Even if units are found that explicitly code the speed and the direction of image motion, they will always be vul7 40 For generation of a nerable to the aperture problem.3 sufficiently dense flow field, the motion sensors need to be small relative to the size of most moving contours in an image, and so only the component of the motion orthogo-

nal to the direction of the contour can be registered. This is called the aperture problem. As a result, the image-motion estimates will be perturbed from the true image-motion direction by contours oriented in a variety of directions. The image speed will be reduced by the cosine of the angle between the true direction and the direction of the contour normal. The resulting edge velocity vectors can therefore deviate by as much as ±90 from the true flow vectors and represent poor input for selfmotion estimation algorithms that rely on the detection of small vector differences. There are techniques for overcoming the aperture problem,334 '4 5 but these are usually effective only for the case of uniform 2-D motion with no discontinuities present in the flow field. They are unsuitable for the type of image motion that results from 3-D self-motion.4 6 Many of these techniques tend to smooth out the vector fields at object boundaries and where depth discontinuities are present, thus degrading an important source of information for many classes of self-motion estimation algorithm. The aperture problem could be minimized to a certain extent by giving greater weight to the outputs of endstopped cells that respond best to line terminators and isolated points. 47 However, there will usually be some areas

of the visual field that suffer from the aperture problem. Some techniques for measuring image motion track image features such as corners and can thus minimize the aperture problem (see Aggarwal and Nandhakumar 4 8 for a review). Unfortunately these methods tend to be computationally intensive, suffer from the correspondence problem,48 and are not good candidates for biological motion processing. In contrast, techniques for measuring edge motion have been well studied, and several biologically feasible approaches have been proposed; see, e.g., Refs. 4952. However, without a solution to the aperture problem,

John A. Perrone

Vol. 9, No. 2/February

the outputs of such motion sensors are useless for the majority of existing methods of self-motion estimation. It would seem at this point that the computation of selfmotion is an impossible task. Yet obviously many biological visual systems have developed a solution. The present paper presents a model of how a biological visual system can solve the self-motion estimation problem by using 2-D sensors with well-known properties of cells in area MT of the primate brain. Area MT contains well-characterized motion-sensitive cells that are retinotopically organized in cortical columns. They are speed tuned and direction selective,'9 2- 3 and these properties are incorporated into the 2-D motion sensors used in the model. I show how the properties of the 2-D motion sensors can be used to bypass the aperture problem and hence overcome one of the main obstacles of the traditional approaches. Cells in area MST of the primate cortex have large receptive fields and are sensitive to the type of full-field image motion that results from self-motion.2 6 3'0 They are the putative processors of self-motion information and respond to radial expansion patterns, to full-field uniform motion, and to combinations of both. These properties are incorporated into the translation, the rotation, and the heading detectors of the model.

I show how the 2-D mo-

tion sensors can be connected into simple networks that can recover the heading direction and pick up any rotational motion of the observer. Because area MT provides a major projection to area MST,5 3 the model represents a theoretical framework for understanding how MST receptive fields might be built up from MT cells.

1. TRANSLATION DETECTORS A. Focus of Expansion Figure

1(b) shows a typical flow field or velocity vector

field generated during an instant of pure translational motion through a random field of points distributed throughout XYZ space. The velocity vectors represent the image motion that would result from the specified observer motion. The velocity vectors all radiate out from one point on the image plane. This point is referred to as the focus of expansion5 4 (FOE). Knowing the location of the FOE on the image plane is equivalent to knowing one's heading direction; the FOE is the position in the image plane through which the heading vector passes. Knowledge of the location of the FOE also permits the determination of the relative distance to the points in the field and time to impact.5'5 5 However, the FOE is not a reliable source of heading information. 56 57 58 Although experiments by Warren et al.' 3 have shown that the FOE can be located precisely (75% correct thresholds less than 1) when a stimulus generated by points at a range of distances is used, problems occur if the observer rotates as well as translates. For instance, both curvilinear paths and eye movements dur-, ing the pure translation add a rotational component to the simple radial expansion pattern, and there is no longer a focus of expansion at the heading point. A new zerovelocity point or singularity is created elsewhere in the field.'55 6 Since primates will normally fixate a stationary point during motion (see, e.g., Refs. 59-61), the zerovelocity point will be at the point of fixation and is not a reliable indication of heading.

1992/J. Opt. Soc. Am. A

179

I will sidestep this issue for the time being and considerhow a biological system could make use of the focus of expansion when no rotations are present. The challenge is to consider a mechanism by which a biological system could locate the focus of expansion in moving retinal image sequences during pure translation motion [e.g., Fig. 1(b)] by using only cell types that have been found to exist in the primate visual system. B. Locating the Focus of Expansion

At first glance the estimation of the position of the FOE seems trivial. We could select any two noncollinear vectors and find their point of intersection. However, this assumes some form of triangulation ability on the part of the visual system, and, more importantly, it ignores the aperture problem discussed in the Introduction. Because of the aperture problem, we cannot expect as input the perfectly accurate flow vectors such as those depicted in Fig. 1(b).46 The 2-D motion information will be noisy, and so we require a scheme that integrates velocity information over a large area of the visual field and not just a small subset of the possible motion inputs. This notion is supported by psychophysical evidence that showed a drop in heading-estimation performance with a decrease in the number of dots in a display.'5 Several computer vision algorithms for locating the FOE from motion information over the whole image have been proposed.'62 65 Furthermore, there have been several suggestions made for expansion detectors and looming detectors of various types.2 2 327 57 66 70 The existence of such units in area MST of the primate brain has been confirmed by several physiological studies. 24

29

However,

for the most part, attempts at specifying possible physiological implementations of these detectors are rare, and only a qualitative description of how such units would operate has been presented so far. I will first describe a detector designed to pick up the radial pattern of image motion that is generated during forward translation of an observer. The basic elements of these detectors are the speed- and direction-tuned cells commonly found in area MT of the primate brain. 9 21 2 2 Directionally selective cells respond maximally to motion across their receptive fields in one specific preferred direction. If the motion is not along this direction, their output is reduced by an amount dependent on the directional bandwidth of the cell. A wide range of bandwidths seems to be present in area MT, but the average is approximately 900 for the half-height full bandwidth (see Fig. 9 of Ref. 22). As a convenient model of the directional tuning of these cells, I used the following Gaussian function, normalized so that the maximum output is 1.0 when the image motion is along the preferred direction for the sensor [see Figs. 2(a) and 2(b)]: = exp[-0.5(x/30) 2 ] _ 8

1-

(1)

where Od is the output (direction response) and x is the difference (in degrees) between the direction of the image motion (0) and the preferred direction of the motion sensor (). 8 determines the amount of inhibitory output from the unit, and it is set to a value of 0.05. Hence there is a small amount of inhibition for absolute angle differ-

180

J. Opt. Soc. Am. A/Vol. 9, No. 2/February

1.2 -i

John A. Perrone

1992

(a)

1.2 -, 1.0

1.0 0.8 4.4

0

(C)

0.8 4J

0.6 0.4

0.2 0.0

-0.2 1 -180

--

J

B

:J

0

0.4

-

-120

-60

0

60

120

0.06 012 0.25 0.5

180

1

2

4

8

16

Speed Relative To Optimum

Direction Relative to Optimum () 1.2

(b)

1.0

0.8

9O0

0 -4

0.6 0.4

180O

-

A

/

-

0o 0.2 0.0 1.5 3.0 6.0 12.0

2700

Preferred Speed (/s)

Fig. 2. Direction- and speed-tuning functions used to model motion sensitive cells in visual area MT of the primate brain. (a) Direction. (b) Direction-tuning curve in polar plot form. (c) Speed-tuning curve. The horizontal axis represents the ratio of the image speed to the optimum speed for the sensor and is plotted on a log scale. (d) The model uses a small set of speed-tuned sensors at each location to span a wide range of possible image speeds.

ences (10- 01)greater than 90°. Figure 2(b) shows the more common directional polar plot form of the curve. The exact values for the parameters of the tuning curve are not critical. No attempt was made to optimize the model in this respect. The cells in area MT also tend to respond maximally when the motion across their receptive fields falls within a fairly narrow band of speeds.2' For the cells sampled by Maunsell and Van Essen,2 ' the range of preferred speeds found across different cells was wide (0.5-250'/s), but this range can be spanned by a relatively small number of cells (see Fig. 6 of Ref. 21). The speed tuning of MT cells is also modeled by using the following Gaussian: 0,(x) = exp[-0.5(x) 2 ],

(2)

where x is the difference between the actual speed of the image motion over the engor (r) and the optimum speed for the sensor (R) in octave (log 2) units; i.e., x = log2 (r/R). The standard deviation of the Gaussian is therefore 1.0 octave, and the maximum output is again

normalized to 1.0. As with the direction-tuning curves, the exact parameter values are not critical. As was pointed out above, these sensors do not produce an output linearly related to the speed of the image motion across their receptive fields. The model uses the output or the activity of these sensors directly, not an estimate of the image speed and direction. The vector flow fields used in the simulations in this paper represent the actual image motion that occurs given a specified observer motion. This image motion results in a particular output in the simulated cells, and it should be emphasized that it is this output that is used by the model, not the velocity vectors themselves. There is currently no physiological evidence for cells that can provide the velocity vector inputs required by the majority of other approaches to the selfmotion estimation problem. Although the model does not require an accurate estimate of the image speed at each location in the image, it does need at least one sensor at that location to respond to the motion. To ensure this response, we need to assume

John A. Perrone

just that each location is sampled by a range of sensors, each tuned to a different band of preferred speeds. For modeling purposes and for simplicity, I have sampled each image location with just four speed ranges [see Fig. 2(d)]. The direction and the speed responses of the local motion sensors are assumed to be separable, with the total output being the product of the direction-tuning output and the speed-tuning output. We can combine these motion-sensitive sensors to construct a translation detector that responds selectively to

the radial expansion pattern surrounding a potential FOE. Consider the detector, shown in Fig. 3, that sums the outputs of speed- and direction-selective cells that are aligned in a radial fashion. This detector samples a possible candidate heading direction (a, 6) represented by a particular image location (x,y). Like MST neurons, the detector sums motion information from over the full visual field.24 29 Each image location is sampled by sensors with a range of speed preferences (only one such set is shown in the figure) but each with a preferred direction aligned with the radial direction out from (x,y). The maximum output from the set of motion sensors at a given location is summed into the total activity for the translation detector. The rationale for this is presented in Subsection 1.C. Motion directed toward the FOE results in the subtraction of a small amount of activity from the total activity of the detector (inhibition). Consider the response of such a translation detector if it were located at a position such that x, y coincided with the position of the FOE in the visual field. We will ignore for now the presence of the aperture problem. Since by definition the image motion will be radially directed out from the FOE, the direction of image motion will be optimum for all the sensors making up the detector. The speed at each location could fall anywhere within a wide range, since it is a function of the distance to the point in the environment generating the image motion and the image location relative to the position of the FOE. However, the speed will fall within the range of one of our sensors located at that position, and it will produce an output close to its maximum. We would therefore expect a large total response from the detector as a whole. If, on the other hand, the translation detector were not centered on the FOE, then some of the image motion would not be in the optimal direction of all the sensors, and the detector response would be less than the maximum. Therefore a smaller total response would be generated in a translation detector not located at the FOE. For a better understanding of the principle behind this mechanism, it is useful to think of the detector in Fig. 3 as a template placed at different positions over the flow field in Fig. 1(b). In order to sample the full possible range of heading directions, we obviously need to have many detectors. What is a sensible way of sampling this space? Human and animal locomotion is usually constrained to forward movement over a ground plane, and anomalous perceptual effects can occur when the motion is outside these constraints (e.g., backward motion.7" 72 ). Therefore, for modeling purposes, only the forward hemisphere of possible heading directions is sampled. There is some psychophysical evidence showing that heading-estimation performance deteriorates as the true heading-direction angle increases relative to the straight-ahead direction.'8 How-

Vol. 9, No. 2/February

1992/J. Opt. Soc. Am. A

181

ever, for simplicity, the heading-direction space is sampled in the model by using equal, arbitrary 5 steps of azimuth (a) and elevation (). Both azimuth and elevation angles span a range from -85° to + 850, thus generating a 35 x 35 array of translation detectors. The task of finding the FOE becomes one of finding the detector with the greatest total activity. If the aperture problem did not exist, there would always be just one peak in the distribution of activity across the array. The FOE is, by definition, a single point from which the image motion radiates. In the presence of the aperture problem, there is a small likelihood that more than one peak in the distribution could arise. However, it is shown below that, as long as there exist a sufficient number of motion vectors in the field, the peak is always well defined and lies close to the true position of the FOE. The computations for finding the activity of a given translation detector can be summarized as follows: The detector activity at ai,,i is

E max{Od(Oj -

O)°,

102

Ri ) " ,

(3)

where m is the total number of motion sensors responding to the image motion represented by the flow field, n is the number of speed-tuned motion sensors at each image location representing each direction Oj(n = 4 in the following simulations), Od is the direction tuning of the motion sensors [see Eq. (1)], and Q, is the speed tuning of the motion sensors [see Eq. (2)].

In theoretical terms the translation-detector approach for finding the FOE involves maximizing the function given on the right-hand side of Eq. (3) across the possible DIRECTION-SELECTIVEAND SPEED-TUNEDMOTIONSENSORS

, '-

Fig. 3. Translation detector made up of direction-selective and speed-tuned motion sensors. The detector sums the output from sensors over the full field of view. The preferred direction of each sensor is aligned with the radial direction from a candidate heading direction (a, f) represented by the position (x,y) on the image plane. Each radial direction is represented by a range of speed-tuned sensors (only one such set is shown), and the maximum output from the set is summed into the total detector activity. Motion toward the center produces negative output and is subtracted from the total activity of the detector.

182

J. Opt. Soc. Am. A/Vol. 9, No. 2/February

1992

John A. Perrone

INPUT

OUTPUT (C)

(a)

c

*,-

-..

0

= 10°

= -5°

- -

400

-

-.

-

I>....

0 4

-

-

-

-

/1

h

A

i

.

,

a

/

,.

,

'..

-

*-

N -

-

4 v/ a= 10°0 =-5'

(d)

(b)

'A'.

a = 100

= -50

.

-4--

-J

Z-.-

I'

'.

*

I-

-

A

/

p

i..r.:: a

*

'400

I:

.7T.

..

Jo

-/ 1

£

I

N

/

KS

I ac=100 =-5' Fig. 4. (a) Test input for the translation-detector array. The field of view is 800 x 80°, and the simulated heading (square) is in a direction 10° to the right and 5 below the line of sight (cross). The ground plane is located at 30 units below eye level, and the observer speed is 40 units/s. (b) Edge velocity vector field for simulating the aperture problem. (c) Perspective view of the translation-detector array with the activity of each detector plotted in the vertical direction. The most active detector is the one tuned to the heading direction of a = 100, = - 5. Note that the FOE does not have to be within the image plane to be detected by this system. (d) Output of translation detectors in response to edge velocity vector field input.

heading directions a , p3i. The method could also be expressed as the minimization of an error function. Similar approaches to the FOE localization problem have been used by Jain6 and by Lawton.64 C. Testing the Translation-Detector Array

The input shown in Fig. 4(a) was used to test the translation-only model. This example will also serve to introduce certain graphical conventions that will be used throughout the remainder of this paper. This input represents the flow field that would be generated by an observer moving over a ground plane toward two hills, heading in the direction of the square. The observer or the camera is looking in a direction 100 to the left of the heading direction and 5 above it, looking in the

direction of the cross. In other words, his or her heading direction relative to this line of sight is a = 100, 13= -5°. Figure 4(c) shows the output of the translation-detector array, with the activity of each detector shown in the vertical dimension. The peak of the distribution occurs in the detector corresponding to a heading direction of a = 100 and P = -5°. The model has therefore successfully located the FOE. However, this is a simple example, and many other schemes could be used to find the same solution. A more challenging example is shown in Fig. 4(b). This time the aperture problem has been simulated, with the direction of each vector in Fig. 4(a) being perturbed by a random amount drawn from a uniform distribution ranging from ±850. The length of the vector is reduced by the

John A. Perrone

Vol. 9, No. 2/February

cosine of the perturbation angle. This is an extreme case and simulates the aperture problem at all points in the field and the existence of a wide range of oriented contours in the scene. In reality there is always a certain proportion of corners and points that do not suffer from the aperture problem. The activity distribution across the translation-detector array for this input is shown in Fig. 4(d). The peak of the distribution is not so prominent as in the true flow case, but the correct heading direction was still found. The network of translation detectors is capable of determining the heading direction with edge velocity vectors instead of the true flow vectors. The input field shown in Fig. 4(b) would be problematic to models of

self-motion estimation that rely on accurate estimates of the true motion vectors. One simple property

of the edge velocity vector fields

permits the model to determine the heading despite the aperture problem: The edge-normal vectors are constrained to lie within ±90° of the true motion direction. The broad directional tuning of the motion sensors making up the translation detectors ensures that the motion sensors still respond to these off-axis motions, albeit at a lower level of output.

When summed over a large number

of sensors, these reduced outputs can dominate the response. The importance of having a set of speed ranges present at each location now becomes apparent. The aperture problem results in not only a perturbation of the direction of the flow vector but also a reduction in speed. By selection of the maximum output from a set of speedtuned cells, the falloff in response becomes a function of the directional error alone. This is desirable because, for the most part, speed is an unreliable source of heading information because it is a function of the unknown distance of the points in the scene. Motion directions within ±90° of the true direction still generate responses in the appropriate translation detector. If the perturbations of direction around the true direction are randomly distributed, the responses will sum when taken in sufficient numbers (see Fig. 5) to produce a peak response in the true direction. Motion directions outside the bandwidth of the sensors will tend to cancel because of the inhibitory outputs generated for these directions. If only a small number of motion vectors are used as input or if the distribution of edge orientations is biased in some way, some errors will occur in the heading estimation. For a determination of the robustness of the technique, a simulation was performed in which the number of

vectors in the field was manipulated.

1992/J. Opt. Soc. Am. A

183

velocity vectors are required for the heading to be determined to within the accuracy limits of the translation detector sampling array. For the conditions used in the simulation, the number was close to 40, but changes to the parameters of the model (e.g., the bandwidth of the sensors) will have an influence on this value. If the aperture problem is not present and the true velocity vectors are available, then accuracy remains high to the minimum number of vectors tested. The model therefore offers a robust means of determinIt ing the heading direction during pure translation. could be implemented as a parallel network of connections among a particular set of direction- and speed-selective motion-sensitive cells that have already been shown to exist within the MT area of the primate brain. In fact the retinotopic and columnar organization of the MT area7 3 appears ideal for minimizing the length of the connections required by such a network. It should be noted that, since the model determines heading direction relative to eye direction, one also needs to have knowledge of eye position if the heading direction relative to the head is to be found. However, certain areas of cortex have been postulated to be involved in similar transformations from retinal to spatial coordinates.7 4 The network described above is useful for determining heading direction in situations in which the eye or the camera is fixed in a particular direction relative to the path of motion. It assumes that the line of sight will remain in the same direction during the instant of motion that generated the flow field. However, primates do not always gaze in one direction while translating through the environment. In fact, the optokinetic reflex and vestibulo ocular reflex (VOR)(e.g., see Refs. 59-61 and 75) both act to stabilize gaze onto a stationary point during translational movements and hence add a rotational component to the translational flow field. Motion along curvilinear 30 -{-I

0

0

to

020

Edge Motion True Motion

-

1-4

l

0)

10 -

Points were ran-

domly positioned within a 400 x 40° image area and assigned a random depth value from 50 to 500 units. A

heading direction was randomly selected on each trial to lie somewhere within the 40° x 400 image area.

For each

trial, two responses were generated; the first used the true optic flow as the input, and the second used velocity vectors perturbed by the aperture problem as input. The length of the vectors was reduced by the cosine of the perturbation angle. The heading error was computed as the angle between the true heading and the model estimate. The mean errors for 200 simulated trials are presented for each condition in Fig. 5. This figure shows that only a modest number of edge

----

-

-

--

---

_________

----

3-L->

v U'

0

2 . 4

20

40

60 . 60

.

80

.

.

100

.

.

120

Number of Vectors Fig. 5. Testing the ability of the translation detectors to determine heading in the presence of the aperture problem. The simulated observer motion was at 40 units/s in a random direction lying somewhere in the 40 x 40° image area. For the edgemotion simulations the velocity vectors were perturbed by a random angle drawn from a uniform distribution with a range of ±85. The plot shows mean heading errors from 200 simulations as a function of the number of vectors in the input field. The dashed line indicates the resolution of the detector array (5°). The error bars represent the standard error of the mean.

184

J. Opt. Soc. Am. A/Vol. 9, No. 2/February

1992

(a)

\ \

\

I

I

/

/

\N \ \\ 't 1 / // // /

John A. Perrone

paths also introduces a rotation to the flow field. If the viewer actively suppresses the VOR during curvilinear motion or if the VOR is not completely compensatory, a combined rotation-translation flow field is produced. It is time, therefore, to consider the effect of eye rotations and curvilinear trajectories on self-motion perception.

2. ROTATION DETECTORS _

__

_.

..

//

---

_

_

>

\"

/~ / / I I //

I I \ \ \

_

_

-

_

_-

_

-

_- --

_

_

_

_

_

_

-

_

_

A.

Detecting Rotation

Figure 6(a) represents a theoretical flow field for an observer moving through a corridor. This is a somewhat artificial situation, but it is useful for showing certain properties of the flow field. The motion is in the direction of the line of sight and parallel to the sides of the corridor. Thus the FOE is in the center of the frame. Figure 6(b) is for the same corridor, but it represents the flow field for rotation of the line of sight at 90/s about the Y axis (yaw) while the observer is stationary. This flow field conveys an important property of rotationally induced flow fields; the lengths of the flow vectors are independent of the distance to the points., 3 This means that the direction and the length of the vectors can be precomputed from knowledge of just the rotation rate. Figure 6(c) is the flow field that would result from a simultaneous combination of the two motions shown in Figs. 6(a) and 6(b) (equal to the vector sum of the corresponding vectors). Even though the actual instantaneous heading is still in the direction specified by the square at the center of the frame, there is no FOE at this position. A new zero-velocity point has been produced to the right. Although it is not strictly a true FOE (the vectors are not all directed exactly radially outward), it is sufficiently like one to produce a peak in the translation detector array. The translation-only model generates an incorrect heading estimate of a = 200 and G = 0 for this input flow field. However this problem can be rectified. B. Removing the Effect of Rotation

(c)

\sof \ X\\I It // //

2// / I \ \

If the rotational flow field is the result of an eye rotation, then knowledge of the eye-rotation rate could in theory be used to decompose a flow field such as that shown in Fig. 6(c). An extraretinal occulomotor signal could be used somehow to cancel the optical effect of the rotation.7 6 Such signals are available within the primate brain.32 77 There is, however, evidence against this approach. Psychophysical experiments show that the decomposition of translational and rotational fields is possible, at least in some circumstances, without an extraretinal signal.4 578 79 However, Warren and Hannon 5 did notice a dependence on eye movements in one of their heading-estimation tasks. Therefore the model must deal effectively with rotation by using only visual information, although the role of eye movements cannot be excluded. A common method for avoiding the effect of rotation is

to take local vector differences, such that a vectorFig. 6. (a) Pure translation in the direction a = 0, 6 = 0 along a corridor of regularly spaced points. The observer speed is 32 units/s, and the ground plane is 30 units below eye level. The field of view is 800 x 800. (b) Flow field produced by pure rotation of the oyo or camera about the vertical axis at 9/9. (c)Combined flow field for simultaneous translation and rotation. The instantaneous heading is still in the direction of the square, but a singularity or pseudo FOE is generated to the right of center.

difference field is used to determine heading instead of the original flow vectors.l0 36 If vector differencing is carried out in a small local neighborhood with the use of points at different depths, the effect of the rotation is removed since the rotation component of the vector is independent of depth and common to all the vectors.3 80 As long as certain restrictions are met, the heading can be

John A. Perrone (a)

Vol. 9, No. 2/February 0o

1800

A

C-C- - C CC- 0 C

a

B

4

___V

-I ,:

Ce

oN

-1@

_>

_:_

-- @i >

eC

'A

0' oUTPUT

(b)

360'RANGE~~~~~~~~~~~~~~~0

0

/s12.0

3.0'/s [YIS

Fig. 7. Rotation detectors. (a) Rotation detector tuned to fullfield motion to the right (00). It is made of directionally selective motion sensors with 0' preferred direction and just one common speed preference. The output of sensors tuned to the left (180') is subtracted from the 0' output. (b) Rotation detectors tuned to the full 360' of possible directions are used, and the range is sampled in 5' steps. Five sets of detectors, each tuned to a different preferred speed, are incorporated into the model.

found independently of the rotation with the use of this principle.' 0 However, as was mentioned in the Introduction, these techniques rely on the existence of an accurate velocity vector field and would be severely challenged by the type of edge motion shown above in Fig. 4(b).

Fur-

thermore, recent psychophysical experiments by Perrone and Stone7 9 brought into question the appropriateness of such local vector differencing techniques as models of human self-motion perception. It is, however, possible to detect the rotation component of the flow field despite the presence of the translational component. The rotation is detected independently of the

translation first rather than translation being independent of rotation, as is the case in the above vector differencing methods. C. Detecting Rotation Independently of Translation

Notice in Fig. 6(c) that the distant points contain mainly rotation information. Except for some confined spaces, this will be the normal state of affairs for most environments. The image motion resulting from forward translation falls off rapidly with the distance to points in the world. Points on the ground farther than approximately 12-16 eye-height units away are generating image speeds of only approximately 0.20/s. These points are approximately 4 below the horizon, and, given that there are often many distant features above eye level, there is usually a large part of the visual field that does not regis-

1992/J. Opt. Soc. Am. A

185

ter the translational movement. Also, the translational component of the image speed is reduced (as a sine function of the eccentricity angle) for points close to the focus of expansion. We can capitalize on these properties of the motion flow fields to create a detector designed to register the rotation field, even in presence of translation. The detectors need to sample motion from a large area of the visual field to pick up the full-field motion that results from rotations, and they need to sum the activity from motion sensors tuned to the same direction. The speed tuning for each motion sensor making up the detector network also needs to be identical for all positions in the visual field. The characteristic properties of a rotation-induced flow field are the unidirectional nature of the motion and the uniformity of the image speeds (if the distortion introduced by the use of flat projection planes is ignored). In contrast, image motion with a large translation component tends to be multidirectional and, for most scenes, exhibits a range of image speeds. Given that we need to detect differing amounts of rotation, we require a range of detectors, each tuned to a different level of rotation rate. This requirement means that each detector receives input from motion sensors tuned to only one range of speeds. The decision as to how many speed ranges to sample is somewhat arbitrary but should be tailored to the normal amounts of rotation observers are likely to encounter during normal locomotion. Except when stated, the model simulations in this paper use sensors tuned to the following range of speeds: 1.5°, 3.0°, 6.00, 9.0°, and 12.00/s.

We also need detectors tuned to a range of directions. In the model I chose to use only combinations of Y-axis rotations (yaw) and X-axis rotations (pitch). Furthermore, for simplicity the model uses an arbitrary, isotropic sampling of rotation directions corresponding to a uniform set of image motion directions. Thus the model has detectors tuned to cover the full 360° of directions about the center of the field, with 50 steps between them. An obvious refinement that could be added is for the range of directions sampled to be clustered around the axes corresponding to the main body rotations that are apt to be experienced by the observer or to coincide with the axis of the semicircular canals.8 ' In this way a small basis set of directions

could be used to encode the whole range of pos-

sible rotations. Figure 7(a) shows such a detector designed to pick up Y-axis (yaw) rotation to the left. This rotation generates image motion to the right. Other rotation detectors would be tuned to different directions and would represent different combinations of yaw and pitch. The detector shown in Fig. 7(a) sums the activity from motion sensors that have their peak responses to rightward motion at a given speed. The total output of the directionally selective sensors tuned to 00 and to a particular preferred speed is summed over the whole field. The total output for sensors tuned to 1800(but with the same speed preference as the 00 sensors) is also summed and subtracted from the total 00 output to give a net output for the 00 direction rotation detector. The sensitivity of the detector is improved by subtracting any activity from motion sensors with a preferred direction that is 1800 from the primary direction of the sensors in the detector. The

186

J. Opt. Soc. Am. A/Vol. 9, No. 2/February

1992

John A. Perrone

subtraction has the desirable feature of producing low levels of activity in the rotation detector when the field contains motion in many directions, as is the case during forward translation. For the rotation detectors, the model assumes directional- and speed-tuning curves and output mechanisms of the 2-D sensors identical to those used for the translation detectors. The procedure for determining the different levels of total activity is also similar to that described for the translation detectors. For each direction 'y(where 'ycovers the range 0-360° in 5 steps), the output for each sensor in the field is calculated based on the image speed relative to the optimum speed for the sensor and the difference between the sensor's preferred direction and the image-motion direction. The speed and the direction outputs are multiplied together to give the combined activity for that sensor, and that activity is summed with the activity from all the other sensors in the field to give a total output level for the rotation detector. This is repeated for the other rotation detectors, tuned to different directions and rotation rates. In all, the total activity for each of 72 directions with five speed levels at each direction is calculated. A subset of these is shown in Fig. 7(b). Again the columnar organization of motion-sensitive cells found in area MT73 is ideal for implementing the rotation detectors. The cells tuned to the opposite direction appear to lie in adjacent columns and so are ideally located for setting up the opponency required by the detector network. D. Testing the Rotation Detectors

The rotation-detector network was applied to the input flow fields shown in Figs. 6(a), 6(b), and 6(c). The outputs are shown in Figs. 8(a), 8(b), and 8(c), respectively. Starting with the pure translation input [Fig. 6(a)], the output of each rotation detector has been represented in Fig. 8(a). The horizontal axis represents the main direction tuning for the detectors. The vertical axis gives the total activity for each rotation detector. For the case of pure translation [Fig. 6(a)], the amount of rotation activity picked up by the detectors is small and barely registers when shown on a graph normalized relative to the output for the full rotation field. Figure 8(b) shows the rotation-

detector output for the pure rotation input shown in 400

(a)

> 200

Fig. 6(b). The different curves show the outputs for four sets of speed preferences. We can see that the most activity occurs in the detector tuned to 1800(to the left) and to a speed of 90/s. Thus the set of rotation detectors has correctly registered the rotation in this case. However, this pure rotation input [Fig. 6(b)] is a trivial test for the rotation detectors. The important test case is shown in Fig. 6(c). When the combined translation rotation field is

run through the rotation-detector network, the output shown in Fig. 8(c) results. The peak response is correctly found to be 1800 for direction and 90/s for speed. Since the output level for each detector is also determined by the number of vectors in the field (the number of motion sensors responding), we need some means of deciding what constitutes a significant level of rotation activity. One possibility is to divide the peak response in the distribution by the total number of vectors in the field to obtain the peak normalized response that could then be compared with some threshold level. For the Fig. 6(a) input, the peak normalized response is only 0.004 unit/ vector, and for Fig. 6(b) it is 0.98 unit/vector. For the combined translation-rotation input the peak normalized response is 0.49 unit/vector. This result shows that it is possible to detect the rotation component even in the presence of the translation field. This example may represent a special case, however, and so a more rigorous approach to testing the rotation detectors was adopted. Weneed to know just how robust this method is and whether it also works in the presence of the aperture problem. A study of the ability of the rotation detectors to detect rotation in the presence of translation and under a variety of conditions was carried out (see Fig. 9). An important parameter that will affect the estimation ability of the rotation detectors is the amount of depth in the scene. This was manipulated along with the rate of rotation in the following simulations. A fixed number of points (200) were randomly distributed over a 400 x 40° image plane and randomly assigned a depth value in the range of (a) 500, (b) 1000, or (c) 2000 units. The forward simulated motion was always 40 units/s, in the direction a = 0, /3 = 0. The true rotation was always about the Y axis (yaw) and on its own produced image motion in the 1800 direction (leftward). Rotation detection is most difficult for low rates of rotation, and so, for the study of the performance

- 3.0/s 9.0°/s

----- 12.0°/s

*4-

.>

00

U

< -200 -400 l.. 0

. .. . . . . 60 120 180 240 300 360

0

60 120 180 240 300 360

0

60 120 180 240 300 360

:PreferredDirection () Fig. 8. Output of rotation detectors. The arrows indicate the peak output direction. (a) Output for the pure translation flow field shown in Fig. 6(a). (b) Output for the pure rotation field shown in Fig. 6(b). (c) Output for the combined translation-rotation field shown in Fig. 6(c). The rotation detectors successfully picked up the rotation component of the motion in the presence of the translation vector field.

John A. Perrone

Vol. 9, No. 2/February

0

0

0 a) U Wr

Depth Range

(a)

-0-

201 (C)

500 1000

-0-

15 -

15-

2000

10-

10-

[l

4-

5-

5- +-

U

187

EDGE FLOW VECTORS

TRUE FLOW VECTORS 20

1992/J. Opt. Soc. Am. A

-

-

-

Model for the computation of self-motion in biological systems.

I present a method by which direction- and speed-tuned cells, such as those commonly found in the middle temporal area of the primate brain, can be us...
3MB Sizes 0 Downloads 0 Views