Paper Menu >>
Journal Menu >>
Journal of Intelligent Learning Systems and Applications, 2010, 2, 212-220 doi:10.4236/jilsa.2010.24024 Published Online November 2010 (http://www.scirp.org/journal/jilsa) Copyright © 2010 SciRes JILSA Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function Marek Bundzel, Shuji Hashimoto Shuji Hashimoto Laboratory, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan Email: Marek.Bundzel@gmail.com, shuji@waseda.jp Received August 17th, 2010; revised September 20th, 2010; accepted September 25th, 2010. ABSTRACT In 2004, Jeff Hawkins presented a memory-prediction theory of brain function, and later used it to create the Hierar- chical Temporal Memory model. Several of the concepts described in the theory are applied here in a computer vision system for a mobile robot application. The aim was to produce a system enabling a mobile robot to explore its envi- ronment and recognize different types of objects without human supervision. The operator has means to assign names to the identified objects of interest. The system presented here works with time ord ered seq u ences of images. It utilizes a tree structure of connected computational nodes similar to Hierarchical Temporal Memory and memorizes frequent sequences of events. The structure of the proposed system and the algorithms involved are explained. A brief survey of the existing algorithms applicable in the system is provided and future applications are outlined. Problems that can arise when the robot’s velocity changes are listed, and a solution is proposed. The proposed system was tested on a sequence of images recorded by two parallel cameras moving in a real world environment. Results for mono- and ste- reo vision experiments are presented. Keywords: Memory Prediction Framework, Mobile Robotics, Computer Vision, Unsupervised Learning 1. Introduction This work focuses on visual data processing for use in autonomous mobile robotics. All explanations and ex- amples are oriented accordingly. Humans (indeed many animals) possess the ability to visually recognize objects in their environment thanks to their highly developed brains. The attempts to build a robust, multipurpose system with this ability for robots have failed so far. Learning what the environment con- sists of is the first step in the development of an intelli- gent behavior. The work described here aims to create a visual rec- ognition system for a mobile robot. The training process of the system is somewhat similar to the way a child learns about the world. A child sees things and learns about the existence of various categories of objects. No adult trains a child to see. An adult tells the child the names of some objects of interest so they can be referred to in communication. The child does not learn about all objects or their alternative appearances at once, but ra- ther gradually increases its knowledge. The system described here operates similarly. First, the system collects visual data recorded at a steady frame rate while moving around various objects. Unsu- pervised learning is applied to identify entities in the environment. Human operator can assign names to the entities found. The possible advantage of such hu- man-machine interaction is that it may be less demand- ing than creating extensive training sets describing all objects of interest. On the other hand, there is no direct means to attract the attention of the system to particular objects, and therefore the objects identified can differ from those the operator would ideally like to obtain. The criterion of the training is the frequency of occurrence of spatial-temporal patterns. Therefore, anything that fre- quently appears in the sensory input can be isolated as an object. The unsupervised learning mechanism has some features of the memory-prediction theory of brain function [1]. 2. Memory-Prediction Theory of Brain Function The memory-prediction theory of brain function was created by Jeff Hawkins and described in the book [1]. The underlying, basic idea is that the brain is a mecha- nism predicting the future and that hierarchical regions of the brain predict their future input sequences. Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 213 The theory is motivated by the observed fact that the mammalian neocortex is remarkably uniform in appear- ance and structure. Principally, the same hierarchical structures are used for a wide range of behaviors, and if necessary, the regions of the neocortex normally used for one function can learn to perform a different task. The memory-prediction framework provides a unified basis for thinking about the adaptive control of complex be- havior. Hawkins made several assumptions [1]: patterns from different senses are equivalent in- side the brain the same biological structures are used to process the sensory inputs a single principle (a feedback/recall loop) under- lies processing of the patterns According to [1], discovering frequent temporal se- quences is essential for the functioning of the brain. Pat- terns coming from different senses are structured in both space and time. What Hawkins considers one of the most important concepts is that: “the cortex’s hierarchical structure stores a model of the hierarchical structure of the world” [1]. In the process of vision, the information moving up the hierarchy starts as low-level retinal signals. Gradually, increasingly complex information is extracted (presence of sub-objects, motions, and eventually the presence of specific objects and their corresponding behaviors). The information moving down the hierarchy carries details about the recognized objects and their expected behavior. The patterns on the lower levels of the hierarchy change quickly, and on the upper levels they change slowly. Representations on the lower levels are spatially specific while they become spatially invariant on the upper lev- els. The theory has given rise to a number of software models aiming to simulate this common algorithm using a hierarchical memory structure. These include an early model [2] that uses Bayesian Networks and which served as the foundation for later models like Hierarchical Temporal Memory [3] or an open source project Neo- cortex by Saulius Garalevicius [4]. 3. Hierarchical Temporal Memory Hierarchical Temporal Memory (HTM) is a machine learning model developed by Jeff Hawkins and Dileep George of Numenta, Inc. HTM models some of the structural and algorithmic properties of the neocortex. HTMs are similar to Bayesian Networks, but differ from most in the way that time, hierarchy, action and attention are used. The authors [3] consider the ability to discover and infer causes to be the two most important capabili- ties of HTM. HTMs are organized as a tree-shaped hierarchy of (computational) nodes. The outputs of nodes at one level become the inputs to the next level in the hierarchy. Nodes at the bottom of the hierarchy receive input from a portion of the sensory input. There are more nodes in the lower levels and fewer in the higher levels of the hierar- chy. The output of HTM is the output of the top node. A node works in two modes. In training mode, the node consecutively groups spatial patterns and identifies frequently observed temporal patterns. The grouping of spatial patterns is performed by an algorithm for cluster analysis. Spatial patterns are assigned to groups based on their spatial similarity that are fewer in number than the possible patterns, so resolution in space is lost in each hierarchy level. A mechanism must be provided to de- termine the probability that a new input belongs to the identified spatial groups. Information on the membership of the consecutive inputs to the spatial groups is recorded in a time sequence and provides a basis for identification of the frequently observed temporal patterns using a Temporal Data Mining algorithm. Despite emphasizing the importance of finding and using frequent sequences in [1] and [3], it appears that HTM, as initially implemented and published on the Numenta’s website, stores only the information on spa- tial patterns that appear frequently together and discards the sequential information. This data structure is usually referred to as a frequent itemset, e.g. [5]. An later HTM-based system using a sequence memory is de- scribed in [6]. In [6], a frequent sequence means a sub- sequence frequently occurring in a longer sequence. This is also known as a frequent episode, e.g. [7]. The length of the stored frequent temporal patterns can be fixed or variable, depending on the algorithms used and the user settings. In recognition mode, the node is confronted with con- secutive inputs. Each input is assigned to one of the stored spatial groups using the provided mechanism. Then, the node combines this information with its previ- ous state information and assigns a probability that the current input is part of the stored frequent temporal pat- terns. The output of the node is the probability distribu- tion over the set of the stored frequent temporal patterns. In the Numenta’s implementation of HTM, the output of the HTM’s top node is matched with a name defined in a training set using supervised learning, for example, a Support Vector Machine. 4. Description of the Proposed System 4.1. Functions and Structure Similarly to HTM, the proposed system is a hierarchy of computational nodes, grouped into layers. A layer is a Copyright © 2010 SciRes JILSA Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 214 two dimensional rectangular grid of nodes. A node N is identified by indices l,x,y (l is the index of the node’s layer, x,y are the node’s coordinates within the layer). Sensory data (either raw or preprocessed) forms the bot- tom layer’s input matrix. The sensory data is image data from a single or two parallel color cameras, though only grayscale images were used here. The preprocessing can include any filtering or image processing algorithm which will be considered beneficial for the application. The receptive field of a node is a rectangular portion of its layer’s input matrix, defined by width and height. The receptive fields of the nodes within a layer do not overlap and together they cover the input matrix. The receptive field of a node in the bottom layer in the stereoscopic setup is formed as shown in Figure 1. The stimuli in the receptive field of a node at time t forms a vectortyxl RF ,,, . Ordering of the elements of the portion of the layer’s input matrix corresponding to the receptive field of a node into a vector is arbitrary, but must remain constant. Output of the nodes of a layer forms the input matrix for the layer above. The top layer contains a single node. Output of the top node represents the output of the sys- tem. An example of the process is given in Figure 2. A node operates in training and recognition modes. Training of a node is performed in two stages. The node performs spatial grouping of the training input patterns appearing in its receptive field by means of an algorithm for cluster analysis (clustering). Cluster analysis is a deeply researched domain. A survey of clustering algo- rithms [8] provides information on categorization of the algorithms and illustrates their applications on some da- tasets. The number of the identified spatial groups (clusters) reflects the structural complexity of the input data. The parameters of the spatial grouping algorithm of the nodes in separate layers are likely to require different settings. The spatial grouping algorithm must provide a mecha- nism for categorization of a novel input. K-means clustering [9] was used in this work. The si- milarity measure was Euclidean distance. The training Figure 1. Forming a receptive field of a node in the bottom layer in the stereoscopic setup-example Figure 2. Example, two layer hierarchy of nodes in one time step. Layer 1 contains a single node therefore input matrix of the Layer 1 and receptive field of the node in Layer 1 are identical patterns for the cluster analysis in a nde N are repre- sented by a set o end ttt RFRFRF ,...,, 1 00 . For example, if N is in the bottom layer, the training set will contain data representing the patterns which were appearing over time in the portion of the image data covered by the receptive field of N. The algorithm produces a set of k centroids 110 ,...,, k CCCC, where k is set by the user. The centroids are vectors with the same number of elements as the receptive field vector of the node. After the spatial groups are identified, the node proc esses the training patterns - end ttt RFRFRF ,...,, 1 00 ordered in time, starting with the oldest. Each training pattern is assigned to exactly one spatial group. In this work, each training pattern t RF is assigned to the spa- tial group that has the closest centroid in terms of Euclidean distance. The index of the winning spatial group },...,0{, kwwt is appended to a time ordered list S if 1 tt ww . The node ignores repeating states both in training and recognition modes for the reasons explained in Section 4.2. The time ordered list of indices (a sequence of indices) S represents the training data for a Temporal Data Min- ing algorithm searching for frequent episodes within S. t represents the state of the receptive field of the node in time t and S represents the recording of the transitions between the states. The Temporal Data Mining algorithm used in this work is described in [10]. It is based on the frequent episode discovery framework [7]. It searches for frequent episodes with variable length. The frequent episodes identified by N are stored in a list E of lists w 1-Ne0 EE ,..., , where Ne is the number of the identi- fied frequent episodes. The user determines the minimal length of the frequent episode to be stored. It is ensured that the shorter episodes are not contained in the longer episodes because it would create undesired ambiguity. There is a scale of Temporal Data Mining algorithms related to mining for frequent temporal patterns (e.g. [7, Copyright © 2010 SciRes JILSA Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 215 11,12]) sequence matching (absolute and approximate, e.g [14]) and sequence clustering that can be applied in a system like the one described here. Useful surveys on temporal data mining techniques can be found in [9] and [10]. Operation of N in recognition mode is divided into two consecutive stages. First, a novel input t RF is categorized into one of the spatial groups identified in the training process t twRF . If 1, t is appended to the list BS (the buffer stack) and the oldest item of BS is deleted. Constant length of BS is thus maintained. BS can be seen as a short term memory be- cause it records the recent changes of states of the recep- tive field of the node. The length of BS is defined by the user. The elements of BS are initialized to -1 at the start of the algorithm. -1 does not appear in the stored fre- quent episodes therefore BS cannot be found in any of them before it is filled with valid values after start or after reset. tt ww w Second, in the given time step, the node tries to find which of the frequent episodes stored in E contains BS (in direct and reverse order). The purpose is to recognize whether the sequence of the recent changes in the recep- tive field has been frequently observed before. The out- put of N in time t is a binary vector t O. The elements of t Ocorrespond to the stored frequent episodes. If Ei con- tains BS in the given time step, the i-th element of t Ois set to 1 otherwise it is set to 0. If Ei is shorter than BS the corresponding number of older items in BS is ignored and the matching is performed with the shortened buffer stack. There are several conditions modifying the behavior of a node in recognition mode. The node can be active (flag A = 1) or inactive (flag A = 0), with nodes initially starting with A = 1. The conditions are checked in each time step. If 1 the counter Tidle is incremented by 1. tt ww t O will be equal to 1t O. If Tidle exceeds a user defined timeout constant Tout, A is set to 0, and the ele- ments of t O are set to 0. The node remains inactive Table 1. Algorithms used during training (T) and recogni- tion (R) Algorithm Mode Comments Data preprocessing T,R e.g. Gabor filtering, normal- izing to unit length etc. Clustering T Parameters set for each layer separately Categorization T,R t twRF Temporal data min- ing T Parameters set for each layer separately Sequence matching T,R finding BS in E Name assignment T Putting a human readable label on the objects found until there is a significant change in its input (1 ttww). If that happens, the node is reset: A is set to 1, Tidle is set to 0 and the elements of BS are set to -1. This is to avoid unrelated events lying further apart in time being consid- ered one event by a node. For example, if only a portion of the robot’s vision field is changing the nodes process- ing the unchanging portion will turn inactive. This also reduces the computational load. Table 1 summarizes the algorithms used. Calculation of the output t O of a node in recognition mode in one time step can be seen in pseudocode as follows: {, BS, Tout, A have assigned values} 1t w 1t wCategorize(t RF ,C) {Categorize current input using the centroids} if 1 tt ww then if A = 0 then 1 A end if T idle 1 Push(BS, wt) {Append wt to BS, delete oldest element of BS} t OFindInEpisodes( BS, E) {Find which episodes contain BS} tt ww 1 tt OO 1 else if A = 0 then 1 tt OO else 1 idleidle TTTT if out idle 0 A SetAllElements(BS, -1) {Set all elements of BS to -1} SetAllElementst O,0) {Set all elements of t O to 0} tt OO 1 else 1 tt OO end if end if end if When BS is found to be part of a stored frequent epi- sode, prediction of the future inputs already resides in the remaining part of the frequent episode. The prediction can be for example used to reduce ambiguity by catego- rization of the incoming input if it is noisy. This feature was not used here, however. The user defines the structure of the hierarchy (i.e. number of layers, dimensions of receptive fields for nodes in each layer) and the setting of training algo- Copyright © 2010 SciRes JILSA Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 216 rithms for each layer. The layers can be trained simulta- neously, but it is more suitable to train the layers con- secutively, starting with the bottom layer. In this way, it is ensured that a layer about to be trained is getting meaningful input. In order to simplify the learning process and to in- crease generality, a modified training approach can be used. Instead of training nodes of a layer separately a master node Nmaster is trained using the data from the receptive fields of all nodes in the layer. The training set of the master node is: 0 0 ,0,0,,0,0, ,1,1, ,1,1, {,,, ,,, end end Lt Lt Lm n tLmn t RF RF RF RF ---- } (1) Where L is the index of the layer of m by n nodes being trained. This implies that the receptive fields of all nodes must have the same dimensions. When Nmaster is trained, a copy of it replaces nodes at all positions within the layer: NN ,, 0, 1,,1;0, 1,,1: Li jmaster imjn"=- =- ¬ (2) Based on the assumption that objects can potentially appear in any part of the image although they were not recorded that way in the training images, the advantage is that each node will be able to recognize all objects identified in the input data. In this work, the modified training approach was applied on each layer of the hier- archy. After the unsupervised learning of the system is com- pleted, the operator assigns names to the objects the sys- tem has identified. The output of the top layer (node) of the hierarchy is a binary vector. All images from the training set for which a particular element of the binary vector is non-zero are grouped and presented to the op- erator. The operator decides whether the group contains a majority of pictures of an object of interest. This is done for all elements of the output vector. When the sys- tem is tested on novel visual data, ideally, the elements of the output vector should respond to the same type of objects as in the training set. 4.2. Domain Related Problems The problems of the application of the described system in a mobile robot are largely related to the balance which must be achieved between the robot’s velocity, the scan- ning frequency, the dimensions of the receptive fields of the nodes in the bottom layer and the measure of the dis- cretization of the input space into spatial groups (the number of spatial groups). In this work, the parameters were set based on logical assumption and/or trial and error. Let us assume the robot is in forward movement. The objects in its vision field appear larger as the robot ap- proaches and leave the vision field sideways as the robot passes. Let us assume there is an object whose features are consecutively assigned to three different spatial groups A, B and C as it moves in the vision field. As- suming a constant frame rate and image processing, the observed sequence may be ...→ A → A → A → B → B → B → C → C → C → ..., or ...→ A → B → C → ... or ...→ A → C → ..., depending on the velocity of the robot. However, it is desirable that the object be charac- terized by a constant temporal pattern within the range of the robot’s velocity. To minimize the influence of the changing velocity, the nodes ignore repeating states. The disadvantage is that objects distinguished by variable number of repeat- ing features will be considered a single object type. A lower velocity, higher scanning frequency and rougher discretization of the input space increases the frequency of the repeating states in the observed sequence and vice versa. To ensure optimal performance these values must be in balance. The number of the spatial groups to be identified by a node is relative to the structural complexity of the spatial data. It is the value to start the tuning with. The scanning frequency (including processing of the images) is largely limited by the computational capacity of the control computer. During training, the robot first collects the image data without processing them so the scanning fre- quency can be higher and the robot can move faster. It should be taken into consideration that the robot’s veloc- ity will likely have to be reduced when the system enters recognition mode due to the reduction of the scanning frequency. The velocity of the robot can be easily changed but cannot be too low for meaningful operation. Ideally, there will be frequently repeating states in the sequence observed by a node regardless of the robot’s velocity. The robustness of the system will be higher assuming that no important features are being missed. This problem is most critical with the nodes in the bot- tom layer, because the input patterns are changing slower in the higher levels of the hierarchy. Setting the dimen- sions of the receptive fields of the nodes in the bottom layer influences how long an object will be sensed by a node. If the dimensions are too small given the velocity, the scanning frequency and the discretization, it is more likely that two unrelated objects will appear in the recep- tive field of a node in two consecutive time steps. This means that the identified frequent episodes (if any) would include features of different objects instead of including different features or positions of a single object. The resolution of the recognition may deteriorate below an acceptable level. On the other hand, if the receptive Copyright © 2010 SciRes JILSA Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 217 fields of the nodes in the bottom layer are too large, it is more likely that multiple objects will be sensed simulta- neously by a node. In every time step, the stimuli in the receptive field is assigned to a single spatial group. One of the sensed objects will thus become dominant. How- ever, in the following time steps, other objects in the receptive field may become dominant and the observed sequence will lose meaning. If there are multiple objects in the vision field of a ro- bot during operation, the system may separate a single object, a group of objects as a single object, may not recognize the objects (all elements of the output vector are 0) or may misinterpret the situation (an element of the output vector will become active which is usually active in the presence of a different object). Note that any frequent visual spatio-temporal pattern may be iden- tified as a separate object during training, and not neces- sarily as a human would do it. No mechanism for covert attention was implemented to the system at this point; the system evaluates the vision field as a whole. 5. Comparison to Similar Systems The proposed system is closely related to other models based on the memory-prediction framework ([2-4, 6]). It has the same internal structure as HTM. The sequence memory system published by Numenta in 2009 [6] uses a mechanism to store and recall sequences in an HTM setting. The Temporal Data Mining technique used in [6] aims to map closely their proposed biological sequence memory mechanism. In contrast to the proposed system, it enables simultaneous learning and recall. The system proposed here could not utilize this feature now because when a node identifies a new sequence, the dimension of its output vector increases and retraining of the nodes in the layers above is necessary. Neither the proposed sys- tem nor [6] provide means of storing duration of se- quence elements. The proposed system can be considered an HTM with a sequence memory. The products published by Numenta and the proposed system use different algorithms for cluster analysis and Temporal Data Mining, but the main difference is that the proposed system is specialized for a real time computer vision application on a mobile robot. This required implementation of a mechanism for mini- mizing the influence of the robot’s changing velocity on storing and recalling the frequent episodes and usage of relatively fast methods. In contrast to HTM, the system does not supervise learning on the top level. The hu- man-machine interface for labeling the categories of the identified objects is used instead. In other words, the proposed system is allowed to isolate objects on its own. The supervision has a form of communication (although primitive at the moment) instead of typical supervised learning, when observations must be assigned to prede- fined categories. 6. Experiments 6.1. Experimental Setup The experiments were performed on image data recorded with two identical parallel cameras installed on a mount. Optical axes of the lenses were parallel to each other (50mm apart) and to the floor (100mm above). The im- age data was recorded in a portion of office space limited by furniture (closet, drawer boxes etc.) and walls. There were three small objects placed in the area: a toy dog, a toy turtle and a toy car. The mount was moved around the obstacles in changing paths at the average speed of approx 50 mm/s. The images were taken at 2 fps in 320x240 pixels res- olution. These values were chosen so that the control computer can process the incoming images in real time with Gabor filtering being used (Gabor filtering is rela- tively time consuming). The recorded data set contained a total of 1140 time-ordered images from each camera. The first 60% of the image sequence was used for train- ing and the remaining 40% for validation. The experiments were performed on monoscopic and on stereoscopic data. Table 2 summarizes the system preset values used in the experiments. In both the mono- scopic and stereoscopic experiments, the system con- sisted of two layers of nodes. Grayscale image data was used. The use of Gabor filtering did not significantly influence the results of the system and was not used in order to save computational resources. Gabor filtering significantly reduces the amount of information in the image, potentially increasing the generalization per- formance and the robustness of the system. However, in this experiment the same relatively simple objects were used for both training and validation. Since only trajec- tories of the movement and views of the objects varied, this reduction was deemed unnecessary. Euclidean dis- tance was used to measure the similarity of the spatial Table 2. Experimental settings. Layer Index/Experiment Type L 0/ Mono L 1/ Mono L 0/ Stereo L 1/ Stereo No. of Nodes 150 1 150 1 RF dim. 32x16 150x29* 64x16 150x26* No. of Spatial Groups 80 20 100 20 BS Length 3 2 3 2 Tout (Time steps) 10 10 10 10 Freq. Ep. Min. Length 3 1 3 1 * Value obtained after training layer 0. Copyright © 2010 SciRes JILSA Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 218 patterns, and so changes in the cameras’ exposure setting (over or under exposure) or a significant change in light- ing would probably influence matching of the spatial patterns in the bottom layer of the hierarchy negatively. Using Gabor filtering would greatly reduce this negative influence, but the image data was taken in a visually sta- ble environment, a small area with dispersed light. Also, there are simpler methods of exposure correction during preprocessing, such as normalization of the input pat- terns to unit length. The relatively large receptive field of the nodes in layer 0 ensured that the objects were sensed by a node for several consecutive time steps and meaningful fre- quent sequences could be identified. Using 2 fps frame rate at the given velocity caused rapid relative movement of the objects between consecutive images. As most of the movement is along the horizontal axis, the width of the receptive fields was larger than the height, so as to capture this kind of movement. 6.2. Experimental Results Assigning labels to the elements of the output vector of the top node depends, to a certain extent, on the subjec- tivity of the operator whose task it is to find a common element in sets of pictures. For example, the majority of pictures in a set show either the toy car or a corner of the room or the toy car in a corner. The operator decides whether to label the corresponding output as “the toy car” or “a corner” or “the toy car in a corner”. Therefore, in this case, the emphasis was on the consistency of the results from the training and on the testing set. This means that when the operator assigns a name to certain outputs, these outputs will become active for the same objects during the training and on the testing set. The use of supervised learning is possible, but it would force the system to find objects of predefined categories. However the interest here was to study what objects would attract the attention of the system. Table 3 summarizes how many frequent episodes were found and lengths of the episodes. Figure 3 shows examples of the frequent episodes as sequences of cluster centroids. It is obvious that what matters are the changes in luminance rather than changes of texture. Table 4 summarizes the experimental results. Varia- tion of the information on the output of layer 1 was lower, which confirmed expectations based on the mem- ory-prediction theory. In the monoscopic experiment, layer 0 transited 86 times between the identified spatial- groups (between states) on the 640 images of the training set. Typically, layer 0 persisted in certain states for sev- eral time steps before it changed. This is consistent with the fact that objects remain in the field of view for some time before the field of view is dominated by a different object. 8 names were assigned altogether. Groups con- taining an uncharacteristic mixture of objects were Table 3. Counts of frequent episodes identified in the ex- periments Layer Index/ Experiment Type L 0/ Mono L 1/ Mono L 0/ Stereo L 1/ Stereo Ep. Length = 1 - 20 - 20 Ep. Length = 2 - 3 - 3 Ep. Length = 3 7 1 12 0 Ep. Length = 4 12 0 8 0 Ep. Length = 5 10 0 6 0 Total 29 24 26 23 Table 4. Experimental results. Monoscopic Experiment Object Name Count Train Count Test Corr. Train Corr. Test O.* empty carpet 112 59 85% 81% 8 drawer box 38 15 79% 80% 1 car 71 42 89% 90% 4 white table 45 28 88% 89% 2 turtle 28 18 82% 83% 1 dark table 88 57 90% 79% 1 el. socket 49 21 71% 81% 1 sliding door 55 34 73% 79% 1 unidentified 198 182 - - 5 Stereoscopic Experiment Object Name Count Train Count Test Corr. Train Corr. Test O.* empty carpet 105 52 74% 77% 7 drawer box 33 17 76% 76% 1 car 50 35 86% 80% 3 white table 45 28 87% 82% 3 turtle 22 15 68% 80% 2 dark table 88 57 83% 75% 1 el. socket 47 19 74% 63% 1 sliding door 55 28 78% 71% 2 unidentified 239 205 - - 3 * number of outputs responding to the object. Figure 3. Frequent episodes 14-19 of cluster centroids found in single camera experiment on lay er 0. Copyright © 2010 SciRes JILSA Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function 219 Figure 4. Typical examples of the objects identified in the single camera experiment. 1. “empty carpet”, 2. “drawer box”, 3. “car”, 4. “white table”, 5. “turtle”, 6. “dark table”, 7. “corner with electric socket”, 8. “closet with sliding door”. designated as “unidentified”. Correctness of the output on the testing set had to be established by manually counting the images of the group containing the given object. Automatic evaluation was not possible because it was not know which objects will the system identify. Figure 4 shows typical examples of the objects found. In the stereoscopic experiment, the appearance of the centroids identified at layer 0 and the overall behavior was similar to that of the monoscopic experiment. Therefore, the operator searched for the same objects so that comparison would be possible. The experimental results indicate that adding the information from the second camera increased confusion rather than resolution. It can be attributed to the slightly different setting of the algorithms or to problems related to the alignment of the cameras. In this setup, the cameras must be precisely aligned so that the same object (shifted) will appear in both halves of the aggregated receptive field of a node. The confusion increased slightly more for objects near the cameras. If an object is near the cameras, the node may capture it only in one half of its receptive field, loosing the depth information. The experimental results indicate a strong consistency between the results during the training and on the testing sets. If there was an object identified during training, it is very likely that the same elements of the output vector will respond to the object on the testing set as well. However, still a large portion of the sets are not identi- fied, despite the presence of identified objects in the im- ages. 7. Problems, Discussion and Future Work One of the main problems of the system is a large num- ber of user set parameters significantly influencing the functioning of the system. A human operator is needed to name the objects therefore trial and error approach is time consuming. Algorithms for automatic optimization of these parameters are to be developed. Like HTM, the system can be modified for supervised learning. A supervised learning setup can provide more data to evaluate the behavior of the system and to de- velop methods for optimization of the user set parame- ters. The algorithm matching BS with the frequent episodes E searches for absolute match. Discretization of the input space provides the only mean for generalization now. Algorithm searching for approximate matches should be employed. Also, predictions and feedback are to be im- plemented. The brain presumably includes information on the or- ganism’s own movements when making predictions about future inputs. We propose Tout to be inversely pro- portional to the robots’ velocity in the future, to avoid the system turning inactive if the velocity decreases or the robot stops. Incremental learning is an important feature of the system to be developed. Adding new fre- quent episodes in layer L increases dimensionality of the input matrix of the consecutive layer. Therefore the lay- ers above L must be retrained, which requires storing the corresponding training set. The problem considers mainly higher levels of the hierarchy recognizing more complex objects. If the lower levels of the hierarchy are sufficiently trained the basic object comprising the world has been correctly identified. More complex objects are usually comprised from the same basic objects. In addition to learning new sequences also forgetting should be considered (also discussed in [4]). An autonomous robot will presumably explore different en- vironments and collect a large amount of knowledge. Matching a current sequence to all of the stored frequent Copyright © 2010 SciRes JILSA Object Identification in Dynamic Images Based on the Memory-Prediction Theory of Brain Function Copyright © 2010 SciRes JILSA 220 episodes would become more and more demanding over time. We propose that frequently used sequences would be kept active and less frequently used sequences would be transferred into storage or completely forgotten. In the case that a sequence is not recognized the episodes in storage should be checked first for a possible match and retrieved if necessary. In this way, the robot could keep active only those learned sequences which are relevant for the current environment. There is no mechanism that enables the system to identify the position of a recognized object in the image at this point. The system usually fails if there are multi- ple objects in the scene. We plan to develop a mecha- nism for covert attention, presumably using feedback, to identify the location of an object, masking it and search- ing for other objects in the scene. The performance of the system deteriorated slightly in the stereoscopic experiment. The causes of the problems mentioned in the stereoscopic experiment may be elimi- nated by using a hierarchy with more layers and inte- grating the information from separate cameras at higher levels of the hierarchy, not in layer 0 directly. Strain on the human operator is high as are the time demands. More sophisticated means of human-system interaction should be developed, possibly using a mechanism for the grouping of similar images and pre- senting only model examples to the operator and/or high- lighting the objects in the scene that have triggered the response. 8. Conclusions A system for unsupervised identification of objects in image data recorded by moving cameras in real time using a novel combination of algorithms was presented here. The system has some features described in mem- ory-prediction theory of brain function. A solution was proposed for the elimination of uncertainty linked with the variable speed of the robot. The system represents an early attempt to make a ma- chine that learns largely on its own and only needs occa- sional advice from the human operator. It is possible that this approach may be more suitable for creating intelli- gent multipurpose systems than trying to heavily super- vise the learning process. 9. Acknowledgements This work is supported by Japan Society for the Promo- tion of Science and Waseda University, Tokyo. REFERENCES [1] J. Hawkins and S. Blakeslee, “On Intelligence: How a New Understanding of the Brain Will Lead to the Crea- tion of Truly Intelligent Machines,” Times Books, 2004, ISBN 0-8050-7456-2 [2] D. George and J. Hawkins, “A Hierarchical Bayesian Model of Invariant Pattern Recognition in the Visual Cortex,” Proceedings of 2005 IEEE International Joint Conference on Neural Networks, Vol. 3, 2005, pp. 1812- 1817. [3] J. Hawkins and D. George, “Hierarchical Temporal Memory-Concepts,” Theory and Terminology, White Paper, Numenta Inc, USA, 2006. [4] S. J. Garalevicius, “Memory-Prediction Framework for Pattern Recognition: Performance and Suitability of the Bayesian Model of Visual Cortex,” FLAIRS Conference, Florida, 2007, pp. 92-97. [5] S. Laxman and P. S. Sastry, “A Survey of Temporal Data Mining, Sadhana,” Academy Proceedings in Engineering Sciences, Vol. 31, No. 2, 2006, pp. 173-198. [6] J. Hawkins, D. George and J. iemasik, “Sequence Mem- ory for Prediction, Inference and Behaviour,” Philoso- phical Transactions of the Royal Society B, Vol. 364, No. 1521, 2009, pp. 1203-1209. [7] H. Mannila, H. Toivonen and A. I. Verkamo, “Discovery of Frequent Episodes in Event Sequences,” Data Mining Knowledge Discovery, Vol. 1, No. 1, 1997, pp. 259-289. [8] R. Xu and D. Wunsch II, “Survey of Clustering Algo- rithms,” IEEE Transactions on Neural Networks, Vol. 16, No. 3, 2005, pp. 645-678. [9] J. B. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, USA, 1967, pp. 281-297. [10] D. Patnaik, P. S. Sastry and K. P. Unnikrishnan, “Infer- ring Neuronal Network Connectivity from Spike Data: A Temporal Data Mining Approach,” Scientific Program- ming, Vol. 16, No. 1, 2008, pp. 49-77. [11] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proceedings of 11th International Conference on Data Engineering, Taipei, 1995, pp. 3-14. [12] R. Agrawal, T. Imielinski and A. Swami, “A Mining Association Rules Between Sets of Items in Large Data- bases,” Proceedings of ACM SIGMOD Conference on Management of Data, USA, 1993, pp. 207-216. [13] J. Han, H. Cheng, D. Xin and X. Yan, Frequent Pattern Mining: Current Status and Future Directions,” Data Mining and Knowledge Discovery, Vol. 14, No. 1. 2007, pp. 55-86. [14] S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Communications of the ACM, Vol. 35, No. 10, 1992, pp. 83-91. |