Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function

doi:10.4236/jilsa.2010.24024

Paper Menu >>

Journal Menu >>

Journal of Intelligent Learning Systems and Applications, 2010, 2, 212-220

doi:10.4236/jilsa.2010.24024 Published Online November 2010 (http://www.scirp.org/journal/jilsa)

Object Identification in Dynamic Images Based on

the Memory-Prediction Theory of Brain Function

Marek Bundzel, Shuji Hashimoto

Shuji Hashimoto Laboratory, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan

Email: Marek.Bundzel@gmail.com, shuji@waseda.jp

Received August 17th, 2010; revised September 20th, 2010; accepted September 25th, 2010.

ABSTRACT

In 2004, Jeff Hawkins presented a memory-prediction theory of brain function, and later used it to create the Hierar-

chical Temporal Memory model. Several of the concepts described in the theory are applied here in a computer vision

system for a mobile robot application. The aim was to produce a system enabling a mobile robot to explore its envi-

ronment and recognize different types of objects without human supervision. The operator has means to assign names

to the identified objects of interest. The system presented here works with time ord ered seq u ences of images. It utilizes a

tree structure of connected computational nodes similar to Hierarchical Temporal Memory and memorizes frequent

sequences of events. The structure of the proposed system and the algorithms involved are explained. A brief survey of

the existing algorithms applicable in the system is provided and future applications are outlined. Problems that can

arise when the robot’s velocity changes are listed, and a solution is proposed. The proposed system was tested on a

sequence of images recorded by two parallel cameras moving in a real world environment. Results for mono- and ste-

reo vision experiments are presented.

Keywords: Memory Prediction Framework, Mobile Robotics, Computer Vision, Unsupervised Learning

1. Introduction

This work focuses on visual data processing for use in

autonomous mobile robotics. All explanations and ex-

amples are oriented accordingly.

Humans (indeed many animals) possess the ability to

visually recognize objects in their environment thanks to

their highly developed brains. The attempts to build a

robust, multipurpose system with this ability for robots

have failed so far. Learning what the environment con-

sists of is the first step in the development of an intelli-

gent behavior.

The work described here aims to create a visual rec-

ognition system for a mobile robot. The training process

of the system is somewhat similar to the way a child

learns about the world. A child sees things and learns

about the existence of various categories of objects. No

adult trains a child to see. An adult tells the child the

names of some objects of interest so they can be referred

to in communication. The child does not learn about all

objects or their alternative appearances at once, but ra-

ther gradually increases its knowledge.

The system described here operates similarly. First,

the system collects visual data recorded at a steady

frame rate while moving around various objects. Unsu-

pervised learning is applied to identify entities in the

environment. Human operator can assign names to the

entities found. The possible advantage of such hu-

man-machine interaction is that it may be less demand-

ing than creating extensive training sets describing all

objects of interest. On the other hand, there is no direct

means to attract the attention of the system to particular

objects, and therefore the objects identified can differ

from those the operator would ideally like to obtain. The

criterion of the training is the frequency of occurrence of

spatial-temporal patterns. Therefore, anything that fre-

quently appears in the sensory input can be isolated as

an object. The unsupervised learning mechanism has

some features of the memory-prediction theory of brain

function [1].

2. Memory-Prediction Theory of Brain

Function

The memory-prediction theory of brain function was

created by Jeff Hawkins and described in the book [1].

The underlying, basic idea is that the brain is a mecha-

nism predicting the future and that hierarchical regions

of the brain predict their future input sequences.

Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 213

The theory is motivated by the observed fact that the

mammalian neocortex is remarkably uniform in appear-

ance and structure. Principally, the same hierarchical

structures are used for a wide range of behaviors, and if

necessary, the regions of the neocortex normally used for

one function can learn to perform a different task. The

memory-prediction framework provides a unified basis

for thinking about the adaptive control of complex be-

havior.

Hawkins made several assumptions [1]:

 patterns from different senses are equivalent in-

side the brain

 the same biological structures are used to process

the sensory inputs

 a single principle (a feedback/recall loop) under-

lies processing of the patterns

According to [1], discovering frequent temporal se-

quences is essential for the functioning of the brain. Pat-

terns coming from different senses are structured in both

space and time. What Hawkins considers one of the most

important concepts is that: “the cortex’s hierarchical

structure stores a model of the hierarchical structure of

the world” [1].

In the process of vision, the information moving up the

hierarchy starts as low-level retinal signals. Gradually,

increasingly complex information is extracted (presence

of sub-objects, motions, and eventually the presence of

specific objects and their corresponding behaviors). The

information moving down the hierarchy carries details

about the recognized objects and their expected behavior.

The patterns on the lower levels of the hierarchy change

quickly, and on the upper levels they change slowly.

Representations on the lower levels are spatially specific

while they become spatially invariant on the upper lev-

els.

The theory has given rise to a number of software

models aiming to simulate this common algorithm using

a hierarchical memory structure. These include an early

model [2] that uses Bayesian Networks and which served

as the foundation for later models like Hierarchical

Temporal Memory [3] or an open source project Neo-

cortex by Saulius Garalevicius [4].

3. Hierarchical Temporal Memory

Hierarchical Temporal Memory (HTM) is a machine

learning model developed by Jeff Hawkins and Dileep

George of Numenta, Inc. HTM models some of the

structural and algorithmic properties of the neocortex.

HTMs are similar to Bayesian Networks, but differ from

most in the way that time, hierarchy, action and attention

are used. The authors [3] consider the ability to discover

and infer causes to be the two most important capabili-

ties of HTM.

HTMs are organized as a tree-shaped hierarchy of

(computational) nodes. The outputs of nodes at one level

become the inputs to the next level in the hierarchy.

Nodes at the bottom of the hierarchy receive input from a

portion of the sensory input. There are more nodes in the

lower levels and fewer in the higher levels of the hierar-

chy. The output of HTM is the output of the top node.

A node works in two modes. In training mode, the

node consecutively groups spatial patterns and identifies

frequently observed temporal patterns. The grouping of

spatial patterns is performed by an algorithm for cluster

analysis. Spatial patterns are assigned to groups based on

their spatial similarity that are fewer in number than the

possible patterns, so resolution in space is lost in each

hierarchy level. A mechanism must be provided to de-

termine the probability that a new input belongs to the

identified spatial groups. Information on the membership

of the consecutive inputs to the spatial groups is recorded

in a time sequence and provides a basis for identification

of the frequently observed temporal patterns using a

Temporal Data Mining algorithm.

Despite emphasizing the importance of finding and

using frequent sequences in [1] and [3], it appears that

HTM, as initially implemented and published on the

Numenta’s website, stores only the information on spa-

tial patterns that appear frequently together and discards

the sequential information. This data structure is usually

referred to as a frequent itemset, e.g. [5]. An later

HTM-based system using a sequence memory is de-

scribed in [6]. In [6], a frequent sequence means a sub-

sequence frequently occurring in a longer sequence. This

is also known as a frequent episode, e.g. [7]. The length

of the stored frequent temporal patterns can be fixed or

variable, depending on the algorithms used and the user

settings.

In recognition mode, the node is confronted with con-

secutive inputs. Each input is assigned to one of the

stored spatial groups using the provided mechanism.

Then, the node combines this information with its previ-

ous state information and assigns a probability that the

current input is part of the stored frequent temporal pat-

terns. The output of the node is the probability distribu-

tion over the set of the stored frequent temporal patterns.

In the Numenta’s implementation of HTM, the output of

the HTM’s top node is matched with a name defined in a

training set using supervised learning, for example, a

Support Vector Machine.

4. Description of the Proposed System

4.1. Functions and Structure

Similarly to HTM, the proposed system is a hierarchy of

computational nodes, grouped into layers. A layer is a

Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function

214

two dimensional rectangular grid of nodes. A node N is

identified by indices l,x,y (l is the index of the node’s

layer, x,y are the node’s coordinates within the layer).

Sensory data (either raw or preprocessed) forms the bot-

tom layer’s input matrix. The sensory data is image data

from a single or two parallel color cameras, though only

grayscale images were used here. The preprocessing can

include any filtering or image processing algorithm

which will be considered beneficial for the application.

The receptive field of a node is a rectangular portion

of its layer’s input matrix, defined by width and height.

The receptive fields of the nodes within a layer do not

overlap and together they cover the input matrix. The

receptive field of a node in the bottom layer in the

stereoscopic setup is formed as shown in Figure 1.

The stimuli in the receptive field of a node at time t

forms a vectortyxl

RF ,,, . Ordering of the elements of the

portion of the layer’s input matrix corresponding to the

receptive field of a node into a vector is arbitrary, but

must remain constant.

Output of the nodes of a layer forms the input matrix

for the layer above. The top layer contains a single node.

Output of the top node represents the output of the sys-

tem. An example of the process is given in Figure 2.

A node operates in training and recognition modes.

Training of a node is performed in two stages. The node

performs spatial grouping of the training input patterns

appearing in its receptive field by means of an algorithm

for cluster analysis (clustering). Cluster analysis is a

deeply researched domain. A survey of clustering algo-

rithms [8] provides information on categorization of the

algorithms and illustrates their applications on some da-

tasets.

The number of the identified spatial groups (clusters)

reflects the structural complexity of the input data. The

parameters of the spatial grouping algorithm of the nodes

in separate layers are likely to require different settings.

The spatial grouping algorithm must provide a mecha-

nism for categorization of a novel input.

K-means clustering [9] was used in this work. The si-

milarity measure was Euclidean distance. The training

Figure 1. Forming a receptive field of a node in the bottom

layer in the stereoscopic setup-example

Figure 2. Example, two layer hierarchy of nodes in one

time step. Layer 1 contains a single node therefore input

matrix of the Layer 1 and receptive field of the node in

Layer 1 are identical

patterns for the cluster analysis in a nde N are repre-

sented by a set





end

ttt RFRFRF ,...,, 1

00 . For example, if

N is in the bottom layer, the training set will contain data

representing the patterns which were appearing over time

in the portion of the image data covered by the receptive

field of N. The algorithm produces a set of k centroids





110 ,...,, 

k

CCCC, where k is set by the user. The

centroids are vectors with the same number of elements

as the receptive field vector of the node.

After the spatial groups are identified, the node proc

esses the training patterns





end

ttt RFRFRF ,...,, 1

00 

ordered in time, starting with the oldest. Each training

pattern is assigned to exactly one spatial group. In this

work, each training pattern t

RF is assigned to the spa-

tial group that has the closest centroid in terms of

Euclidean distance. The index of the winning spatial

group },...,0{, kwwt



is appended to a time ordered

list S if 1



tt ww . The node ignores repeating states

both in training and recognition modes for the reasons

explained in Section 4.2.

The time ordered list of indices (a sequence of indices)

S represents the training data for a Temporal Data Min-

ing algorithm searching for frequent episodes within S.

t represents the state of the receptive field of the node

in time t and S represents the recording of the transitions

between the states. The Temporal Data Mining algorithm

used in this work is described in [10]. It is based on the

frequent episode discovery framework [7]. It searches for

frequent episodes with variable length. The frequent

episodes identified by N are stored in a list E of

lists





1-Ne0 EE ,..., , where Ne is the number of the identi-

fied frequent episodes. The user determines the minimal

length of the frequent episode to be stored. It is ensured

that the shorter episodes are not contained in the longer

episodes because it would create undesired ambiguity.

There is a scale of Temporal Data Mining algorithms

related to mining for frequent temporal patterns (e.g. [7,

Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 215

11,12]) sequence matching (absolute and approximate,

e.g [14]) and sequence clustering that can be applied in a

system like the one described here. Useful surveys on

temporal data mining techniques can be found in [9] and

[10].

Operation of N in recognition mode is divided into

two consecutive stages. First, a novel input t

RF is

categorized into one of the spatial groups identified in

the training process t

twRF . If 1, t is

appended to the list BS (the buffer stack) and the oldest

item of BS is deleted. Constant length of BS is thus

maintained. BS can be seen as a short term memory be-

cause it records the recent changes of states of the recep-

tive field of the node. The length of BS is defined by the

user. The elements of BS are initialized to -1 at the start

of the algorithm. -1 does not appear in the stored fre-

quent episodes therefore BS cannot be found in any of

them before it is filled with valid values after start or

after reset.

tt ww w

Second, in the given time step, the node tries to find

which of the frequent episodes stored in E contains BS

(in direct and reverse order). The purpose is to recognize

whether the sequence of the recent changes in the recep-

tive field has been frequently observed before. The out-

put of N in time t is a binary vector t

O. The elements of

Ocorrespond to the stored frequent episodes. If Ei con-

tains BS in the given time step, the i-th element of t

Ois

set to 1 otherwise it is set to 0. If Ei is shorter than BS the

corresponding number of older items in BS is ignored

and the matching is performed with the shortened buffer

stack.

There are several conditions modifying the behavior

of a node in recognition mode. The node can be active

(flag A = 1) or inactive (flag A = 0), with nodes initially

starting with A = 1. The conditions are checked in each

time step. If 1 the counter Tidle is incremented

by 1.

tt ww

O will be equal to 1t

O. If Tidle exceeds a user

defined timeout constant Tout, A is set to 0, and the ele-

ments of t

O are set to 0. The node remains inactive

Table 1. Algorithms used during training (T) and recogni-

tion (R)

Algorithm Mode Comments

Data preprocessing T,R e.g. Gabor filtering, normal-

izing to unit length etc.

Clustering T

Parameters set for each layer

separately

Categorization T,R t

twRF 

Temporal data min-

ing T Parameters set for each layer

separately

Sequence matching T,R finding BS in E

Name assignment T Putting a human readable

label on the objects found

until there is a significant change in its input (1



ttww).

If that happens, the node is reset: A is set to 1, Tidle is set

to 0 and the elements of BS are set to -1. This is to avoid

unrelated events lying further apart in time being consid-

ered one event by a node. For example, if only a portion

of the robot’s vision field is changing the nodes process-

ing the unchanging portion will turn inactive. This also

reduces the computational load.

Table 1 summarizes the algorithms used. Calculation

of the output t

O of a node in recognition mode in one

time step can be seen in pseudocode as follows:

{, BS, Tout, A have assigned values}

1t



1t

wCategorize(t

RF ,C) {Categorize current input

using the centroids}

if 1



tt ww then

if A = 0 then



end if

idle 1



Push(BS, wt) {Append wt to BS, delete oldest element

of BS}



OFindInEpisodes( BS, E) {Find which episodes

contain BS}

tt ww



1

tt OO

1

else

if A = 0 then

1

tt OO

else





idleidle TTTT 

out

idle



SetAllElements(BS, -1) {Set all elements of BS

to -1}

SetAllElementst

O,0) {Set all elements of t

to 0}

tt OO 

1

else

1

tt OO

end if

When BS is found to be part of a stored frequent epi-

sode, prediction of the future inputs already resides in the

remaining part of the frequent episode. The prediction

can be for example used to reduce ambiguity by catego-

rization of the incoming input if it is noisy. This feature

was not used here, however.

The user defines the structure of the hierarchy (i.e.

number of layers, dimensions of receptive fields for

nodes in each layer) and the setting of training algo-

Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function

216

rithms for each layer. The layers can be trained simulta-

neously, but it is more suitable to train the layers con-

secutively, starting with the bottom layer. In this way, it

is ensured that a layer about to be trained is getting

meaningful input.

In order to simplify the learning process and to in-

crease generality, a modified training approach can be

used. Instead of training nodes of a layer separately a

master node Nmaster is trained using the data from the

receptive fields of all nodes in the layer. The training set

of the master node is:

,0,0,,0,0,

,1,1, ,1,1,

{,,,

,,,

end

Lt Lt

Lm n tLmn t

RF RF

---- }



 



 



(1)

Where L is the index of the layer of m by n nodes being

trained. This implies that the receptive fields of all nodes

must have the same dimensions. When Nmaster is trained,

a copy of it replaces nodes at all positions within the

layer:

0, 1,,1;0, 1,,1:

Li jmaster

imjn"=- =-



(2)

Based on the assumption that objects can potentially

appear in any part of the image although they were not

recorded that way in the training images, the advantage

is that each node will be able to recognize all objects

identified in the input data. In this work, the modified

training approach was applied on each layer of the hier-

archy.

After the unsupervised learning of the system is com-

pleted, the operator assigns names to the objects the sys-

tem has identified. The output of the top layer (node) of

the hierarchy is a binary vector. All images from the

training set for which a particular element of the binary

vector is non-zero are grouped and presented to the op-

erator. The operator decides whether the group contains

a majority of pictures of an object of interest. This is

done for all elements of the output vector. When the sys-

tem is tested on novel visual data, ideally, the elements

of the output vector should respond to the same type of

objects as in the training set.

4.2. Domain Related Problems

The problems of the application of the described system

in a mobile robot are largely related to the balance which

must be achieved between the robot’s velocity, the scan-

ning frequency, the dimensions of the receptive fields of

the nodes in the bottom layer and the measure of the dis-

cretization of the input space into spatial groups (the

number of spatial groups). In this work, the parameters

were set based on logical assumption and/or trial and

error.

Let us assume the robot is in forward movement. The

objects in its vision field appear larger as the robot ap-

proaches and leave the vision field sideways as the robot

passes. Let us assume there is an object whose features

are consecutively assigned to three different spatial

groups A, B and C as it moves in the vision field. As-

suming a constant frame rate and image processing, the

observed sequence may be ...→ A → A → A → B → B

→ B → C → C → C → ..., or ...→ A → B → C → ...

or ...→ A → C → ..., depending on the velocity of the

robot. However, it is desirable that the object be charac-

terized by a constant temporal pattern within the range of

the robot’s velocity.

To minimize the influence of the changing velocity,

the nodes ignore repeating states. The disadvantage is

that objects distinguished by variable number of repeat-

ing features will be considered a single object type. A

lower velocity, higher scanning frequency and rougher

discretization of the input space increases the frequency

of the repeating states in the observed sequence and vice

versa. To ensure optimal performance these values must

be in balance.

The number of the spatial groups to be identified by a

node is relative to the structural complexity of the spatial

data. It is the value to start the tuning with. The scanning

frequency (including processing of the images) is largely

limited by the computational capacity of the control

computer. During training, the robot first collects the

image data without processing them so the scanning fre-

quency can be higher and the robot can move faster. It

should be taken into consideration that the robot’s veloc-

ity will likely have to be reduced when the system enters

recognition mode due to the reduction of the scanning

frequency. The velocity of the robot can be easily

changed but cannot be too low for meaningful operation.

Ideally, there will be frequently repeating states in the

sequence observed by a node regardless of the robot’s

velocity. The robustness of the system will be higher

assuming that no important features are being missed.

This problem is most critical with the nodes in the bot-

tom layer, because the input patterns are changing slower

in the higher levels of the hierarchy. Setting the dimen-

sions of the receptive fields of the nodes in the bottom

layer influences how long an object will be sensed by a

node. If the dimensions are too small given the velocity,

the scanning frequency and the discretization, it is more

likely that two unrelated objects will appear in the recep-

tive field of a node in two consecutive time steps. This

means that the identified frequent episodes (if any)

would include features of different objects instead of

including different features or positions of a single object.

The resolution of the recognition may deteriorate below

an acceptable level. On the other hand, if the receptive

Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 217

fields of the nodes in the bottom layer are too large, it is

more likely that multiple objects will be sensed simulta-

neously by a node. In every time step, the stimuli in the

receptive field is assigned to a single spatial group. One

of the sensed objects will thus become dominant. How-

ever, in the following time steps, other objects in the

receptive field may become dominant and the observed

sequence will lose meaning.

If there are multiple objects in the vision field of a ro-

bot during operation, the system may separate a single

object, a group of objects as a single object, may not

recognize the objects (all elements of the output vector

are 0) or may misinterpret the situation (an element of

the output vector will become active which is usually

active in the presence of a different object). Note that

any frequent visual spatio-temporal pattern may be iden-

tified as a separate object during training, and not neces-

sarily as a human would do it. No mechanism for covert

attention was implemented to the system at this point;

the system evaluates the vision field as a whole.

5. Comparison to Similar Systems

The proposed system is closely related to other models

based on the memory-prediction framework ([2-4, 6]). It

has the same internal structure as HTM. The sequence

memory system published by Numenta in 2009 [6] uses

a mechanism to store and recall sequences in an HTM

setting. The Temporal Data Mining technique used in [6]

aims to map closely their proposed biological sequence

memory mechanism. In contrast to the proposed system,

it enables simultaneous learning and recall. The system

proposed here could not utilize this feature now because

when a node identifies a new sequence, the dimension of

its output vector increases and retraining of the nodes in

the layers above is necessary. Neither the proposed sys-

tem nor [6] provide means of storing duration of se-

quence elements.

The proposed system can be considered an HTM with

a sequence memory. The products published by Numenta

and the proposed system use different algorithms for

cluster analysis and Temporal Data Mining, but the main

difference is that the proposed system is specialized for a

real time computer vision application on a mobile robot.

This required implementation of a mechanism for mini-

mizing the influence of the robot’s changing velocity on

storing and recalling the frequent episodes and usage of

relatively fast methods. In contrast to HTM, the system

does not supervise learning on the top level. The hu-

man-machine interface for labeling the categories of the

identified objects is used instead. In other words, the

proposed system is allowed to isolate objects on its own.

The supervision has a form of communication (although

primitive at the moment) instead of typical supervised

learning, when observations must be assigned to prede-

fined categories.

6. Experiments

6.1. Experimental Setup

The experiments were performed on image data recorded

with two identical parallel cameras installed on a mount.

Optical axes of the lenses were parallel to each other

(50mm apart) and to the floor (100mm above). The im-

age data was recorded in a portion of office space limited

by furniture (closet, drawer boxes etc.) and walls. There

were three small objects placed in the area: a toy dog, a

toy turtle and a toy car. The mount was moved around

the obstacles in changing paths at the average speed of

approx 50 mm/s.

The images were taken at 2 fps in 320x240 pixels res-

olution. These values were chosen so that the control

computer can process the incoming images in real time

with Gabor filtering being used (Gabor filtering is rela-

tively time consuming). The recorded data set contained

a total of 1140 time-ordered images from each camera.

The first 60% of the image sequence was used for train-

ing and the remaining 40% for validation.

The experiments were performed on monoscopic and

on stereoscopic data. Table 2 summarizes the system

preset values used in the experiments. In both the mono-

scopic and stereoscopic experiments, the system con-

sisted of two layers of nodes. Grayscale image data was

used. The use of Gabor filtering did not significantly

influence the results of the system and was not used in

order to save computational resources. Gabor filtering

significantly reduces the amount of information in the

image, potentially increasing the generalization per-

formance and the robustness of the system. However, in

this experiment the same relatively simple objects were

used for both training and validation. Since only trajec-

tories of the movement and views of the objects varied,

this reduction was deemed unnecessary. Euclidean dis-

tance was used to measure the similarity of the spatial

Table 2. Experimental settings.

Layer

Index/Experiment

Type

L 0/

Mono

L 1/

Mono

L 0/

Stereo

L 1/

Stereo

No. of Nodes 150 1 150 1

RF dim. 32x16 150x29* 64x16 150x26*

No. of Spatial

Groups 80 20 100 20

BS Length 3 2 3 2

Tout

(Time steps) 10 10 10 10

Freq. Ep. Min.

Length 3 1 3 1

* Value obtained after training layer 0.

Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function

218

patterns, and so changes in the cameras’ exposure setting

(over or under exposure) or a significant change in light-

ing would probably influence matching of the spatial

patterns in the bottom layer of the hierarchy negatively.

Using Gabor filtering would greatly reduce this negative

influence, but the image data was taken in a visually sta-

ble environment, a small area with dispersed light. Also,

there are simpler methods of exposure correction during

preprocessing, such as normalization of the input pat-

terns to unit length.

The relatively large receptive field of the nodes in

layer 0 ensured that the objects were sensed by a node

for several consecutive time steps and meaningful fre-

quent sequences could be identified. Using 2 fps frame

rate at the given velocity caused rapid relative movement

of the objects between consecutive images. As most of

the movement is along the horizontal axis, the width of

the receptive fields was larger than the height, so as to

capture this kind of movement.

6.2. Experimental Results

Assigning labels to the elements of the output vector of

the top node depends, to a certain extent, on the subjec-

tivity of the operator whose task it is to find a common

element in sets of pictures. For example, the majority of

pictures in a set show either the toy car or a corner of the

room or the toy car in a corner. The operator decides

whether to label the corresponding output as “the toy

car” or “a corner” or “the toy car in a corner”. Therefore,

in this case, the emphasis was on the consistency of the

results from the training and on the testing set. This

means that when the operator assigns a name to certain

outputs, these outputs will become active for the same

objects during the training and on the testing set. The use

of supervised learning is possible, but it would force the

system to find objects of predefined categories. However

the interest here was to study what objects would attract

the attention of the system.

Table 3 summarizes how many frequent episodes

were found and lengths of the episodes. Figure 3 shows

examples of the frequent episodes as sequences of cluster

centroids. It is obvious that what matters are the changes

in luminance rather than changes of texture.

Table 4 summarizes the experimental results. Varia-

tion of the information on the output of layer 1 was

lower, which confirmed expectations based on the mem-

ory-prediction theory. In the monoscopic experiment,

layer 0 transited 86 times between the identified spatial-

groups (between states) on the 640 images of the training

set. Typically, layer 0 persisted in certain states for sev-

eral time steps before it changed. This is consistent with

the fact that objects remain in the field of view for some

time before the field of view is dominated by a different

object. 8 names were assigned altogether. Groups con-

taining an uncharacteristic mixture of objects were

Table 3. Counts of frequent episodes identified in the ex-

periments

Layer Index/

Experiment Type L 0/

Mono

L 1/

Mono

L 0/

Stereo

L 1/

Stereo

Ep. Length = 1 - 20 - 20

Ep. Length = 2 - 3 - 3

Ep. Length = 3 7 1 12 0

Ep. Length = 4 12 0 8 0

Ep. Length = 5 10 0 6 0

Total 29 24 26 23

Table 4. Experimental results.

Monoscopic Experiment

Object Name Count

Train Count

Test Corr.

Train Corr.

Test O.*

empty carpet 112 59 85% 81% 8

drawer box 38 15 79% 80% 1

car 71 42 89% 90% 4

white table 45 28 88% 89% 2

turtle 28 18 82% 83% 1

dark table 88 57 90% 79% 1

el. socket 49 21 71% 81% 1

sliding door 55 34 73% 79% 1

unidentified 198 182 - - 5

Stereoscopic Experiment

Object Name Count

Train Count

Test Corr.

Train Corr.

Test O.*

empty carpet 105 52 74% 77% 7

drawer box 33 17 76% 76% 1

car 50 35 86% 80% 3

white table 45 28 87% 82% 3

turtle 22 15 68% 80% 2

dark table 88 57 83% 75% 1

el. socket 47 19 74% 63% 1

sliding door 55 28 78% 71% 2

unidentified 239 205 - - 3

* number of outputs responding to the object.

Figure 3. Frequent episodes 14-19 of cluster centroids

found in single camera experiment on lay er 0.

Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 219

Figure 4. Typical examples of the objects identified in the

single camera experiment. 1. “empty carpet”, 2. “drawer

box”, 3. “car”, 4. “white table”, 5. “turtle”, 6. “dark table”,

7. “corner with electric socket”, 8. “closet with sliding

door”.

designated as “unidentified”. Correctness of the output

on the testing set had to be established by manually

counting the images of the group containing the given

object. Automatic evaluation was not possible because it

was not know which objects will the system identify.

Figure 4 shows typical examples of the objects found.

In the stereoscopic experiment, the appearance of the

centroids identified at layer 0 and the overall behavior

was similar to that of the monoscopic experiment.

Therefore, the operator searched for the same objects so

that comparison would be possible. The experimental

results indicate that adding the information from the

second camera increased confusion rather than resolution.

It can be attributed to the slightly different setting of the

algorithms or to problems related to the alignment of the

cameras. In this setup, the cameras must be precisely

aligned so that the same object (shifted) will appear in

both halves of the aggregated receptive field of a node.

The confusion increased slightly more for objects near

the cameras. If an object is near the cameras, the node

may capture it only in one half of its receptive field,

loosing the depth information.

The experimental results indicate a strong consistency

between the results during the training and on the testing

sets. If there was an object identified during training, it is

very likely that the same elements of the output vector

will respond to the object on the testing set as well.

However, still a large portion of the sets are not identi-

fied, despite the presence of identified objects in the im-

ages.

7. Problems, Discussion and Future Work

One of the main problems of the system is a large num-

ber of user set parameters significantly influencing the

functioning of the system. A human operator is needed to

name the objects therefore trial and error approach is

time consuming. Algorithms for automatic optimization

of these parameters are to be developed.

Like HTM, the system can be modified for supervised

learning. A supervised learning setup can provide more

data to evaluate the behavior of the system and to de-

velop methods for optimization of the user set parame-

ters.

The algorithm matching BS with the frequent episodes

E searches for absolute match. Discretization of the input

space provides the only mean for generalization now.

Algorithm searching for approximate matches should be

employed. Also, predictions and feedback are to be im-

plemented.

The brain presumably includes information on the or-

ganism’s own movements when making predictions

about future inputs. We propose Tout to be inversely pro-

portional to the robots’ velocity in the future, to avoid

the system turning inactive if the velocity decreases or

the robot stops. Incremental learning is an important

feature of the system to be developed. Adding new fre-

quent episodes in layer L increases dimensionality of the

input matrix of the consecutive layer. Therefore the lay-

ers above L must be retrained, which requires storing the

corresponding training set. The problem considers

mainly higher levels of the hierarchy recognizing more

complex objects. If the lower levels of the hierarchy are

sufficiently trained the basic object comprising the world

has been correctly identified. More complex objects are

usually comprised from the same basic objects.

In addition to learning new sequences also forgetting

should be considered (also discussed in [4]). An

autonomous robot will presumably explore different en-

vironments and collect a large amount of knowledge.

Matching a current sequence to all of the stored frequent

Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function

220

episodes would become more and more demanding over

time. We propose that frequently used sequences would

be kept active and less frequently used sequences would

be transferred into storage or completely forgotten. In the

case that a sequence is not recognized the episodes in

storage should be checked first for a possible match and

retrieved if necessary. In this way, the robot could keep

active only those learned sequences which are relevant

for the current environment.

There is no mechanism that enables the system to

identify the position of a recognized object in the image

at this point. The system usually fails if there are multi-

ple objects in the scene. We plan to develop a mecha-

nism for covert attention, presumably using feedback, to

identify the location of an object, masking it and search-

ing for other objects in the scene.

The performance of the system deteriorated slightly in

the stereoscopic experiment. The causes of the problems

mentioned in the stereoscopic experiment may be elimi-

nated by using a hierarchy with more layers and inte-

grating the information from separate cameras at higher

levels of the hierarchy, not in layer 0 directly.

Strain on the human operator is high as are the time

demands. More sophisticated means of human-system

interaction should be developed, possibly using a

mechanism for the grouping of similar images and pre-

senting only model examples to the operator and/or high-

lighting the objects in the scene that have triggered the

response.

8. Conclusions

A system for unsupervised identification of objects in

image data recorded by moving cameras in real time

using a novel combination of algorithms was presented

here. The system has some features described in mem-

ory-prediction theory of brain function. A solution was

proposed for the elimination of uncertainty linked with

the variable speed of the robot.

The system represents an early attempt to make a ma-

chine that learns largely on its own and only needs occa-

sional advice from the human operator. It is possible that

this approach may be more suitable for creating intelli-

gent multipurpose systems than trying to heavily super-

vise the learning process.

9. Acknowledgements

This work is supported by Japan Society for the Promo-

tion of Science and Waseda University, Tokyo.

REFERENCES

[1] J. Hawkins and S. Blakeslee, “On Intelligence: How a

New Understanding of the Brain Will Lead to the Crea-

tion of Truly Intelligent Machines,” Times Books, 2004,

ISBN 0-8050-7456-2

[2] D. George and J. Hawkins, “A Hierarchical Bayesian

Model of Invariant Pattern Recognition in the Visual

Cortex,” Proceedings of 2005 IEEE International Joint

Conference on Neural Networks, Vol. 3, 2005, pp. 1812-

1817.

[3] J. Hawkins and D. George, “Hierarchical Temporal

Memory-Concepts,” Theory and Terminology, White

Paper, Numenta Inc, USA, 2006.

[4] S. J. Garalevicius, “Memory-Prediction Framework for

Pattern Recognition: Performance and Suitability of the

Bayesian Model of Visual Cortex,” FLAIRS Conference,

Florida, 2007, pp. 92-97.

[5] S. Laxman and P. S. Sastry, “A Survey of Temporal Data

Mining, Sadhana,” Academy Proceedings in Engineering

Sciences, Vol. 31, No. 2, 2006, pp. 173-198.

[6] J. Hawkins, D. George and J. iemasik, “Sequence Mem-

ory for Prediction, Inference and Behaviour,” Philoso-

phical Transactions of the Royal Society B, Vol. 364, No.

1521, 2009, pp. 1203-1209.

[7] H. Mannila, H. Toivonen and A. I. Verkamo, “Discovery

of Frequent Episodes in Event Sequences,” Data Mining

Knowledge Discovery, Vol. 1, No. 1, 1997, pp. 259-289.

[8] R. Xu and D. Wunsch II, “Survey of Clustering Algo-

rithms,” IEEE Transactions on Neural Networks, Vol. 16,

No. 3, 2005, pp. 645-678.

[9] J. B. MacQueen, “Some Methods for Classification and

Analysis of Multivariate Observations,” Proceedings of

5th Berkeley Symposium on Mathematical Statistics and

Probability, Berkeley, University of California Press,

USA, 1967, pp. 281-297.

[10] D. Patnaik, P. S. Sastry and K. P. Unnikrishnan, “Infer-

ring Neuronal Network Connectivity from Spike Data: A

Temporal Data Mining Approach,” Scientific Program-

ming, Vol. 16, No. 1, 2008, pp. 49-77.

[11] R. Agrawal and R. Srikant, “Mining Sequential Patterns,”

Proceedings of 11th International Conference on Data

Engineering, Taipei, 1995, pp. 3-14.

[12] R. Agrawal, T. Imielinski and A. Swami, “A Mining

Association Rules Between Sets of Items in Large Data-

bases,” Proceedings of ACM SIGMOD Conference on

Management of Data, USA, 1993, pp. 207-216.

[13] J. Han, H. Cheng, D. Xin and X. Yan, Frequent Pattern

Mining: Current Status and Future Directions,” Data

Mining and Knowledge Discovery, Vol. 14, No. 1. 2007,

pp. 55-86.

[14] S. Wu and U. Manber, “Fast Text Searching Allowing

Errors,” Communications of the ACM, Vol. 35, No. 10,

1992, pp. 83-91.