Paper Menu >>
Journal Menu >>
![]() J. Intelligent Learning Systems & Applications, 2010, 2, 117-125 doi:10.4236/jilsa.2010.23015 Published Online August 2010 (http://www.SciRP.org/journal/jilsa) Copyright © 2010 SciRes. JILSA 117 A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts Todsanai Chumwatana, Kok Wai Wong, Hong Xie School of Information Technology, Murdoch University, South St, Murdoch, Australia. Email: {T.Chumwatana, K.Wong, H.Xie}@Murdoch.edu.au Received March 25th, 2010; revised July 15th, 2010; accepted July 30th, 2010. ABSTRACT This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve th e efficiency of information retrieval. SOM has been wid ely used for document cluster- ing and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficien tly. There are two main phases in the propose method: p reprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the pat- terns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as ind exing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the docu- ment cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, ex- perimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently. Keywords: Frequent Max Substring, Self-Organizing Map, Document Clustering 1. Introduction Document clustering has been an important issue [1] due to the rapid growth in the number of electronic docu- ments. Document clustering, sometimes can be general- ized as text clustering, indentifies the similarity of docu- ments and summarize a large number of documents using key attributes of the clusters. Document clustering uses unsupervised learning techniques and may assist fast information retrieval or filtering [2]. This is because clus- tering technique categorizes documents into groups based on their similarity in term of their member occur- rences. Thus clustering can be used to categorize docu- ment databases and digital libraries, as well as providing useful summary information of the categories for brows- ing purposes. In information retrieval, a typical search on document database or the World Wide Web can return several thousands of documents in response to the user’s queries. It is often very difficult for users to identify their documents of interest from such a huge number of documents. Clustering the documents enables the user to have a clear and easy grasp of the relevant documents from the collection of documents which are similar to each other and could be rel evant to the user’s queries. For text clustering in information retrieval, a document is normally considered as a bag of words, even though a document actually consists of a sequence of sentences and each sentence is composed from a sequence of words. Very often the positions of words are ignored when per- forming document clustering. Words, also known as in- dexing terms, and their weights in documents are usually used as important parameters to compute the similarity of documents [3]. Those documents that contain similar indexing terms and frequencies will be grouped under th e same cluster. This process is straightforward for Euro- pean languages where words are clearly defined by word delimiter such as space or other special symbols. Euro- pean texts are explicitly segmented in to word tokens that are used as indexing terms. Many algorithms have been developed to calculate the similarity of documents and to build clusters for fast information retrieval [4]. In con- trast, document clustering can be a challenging task for ![]() A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts 118 many Asian languages such as Chinese, Japanese, Ko- rean and Thai, because these languages are non-segmented languages, i.e., a sentence is written continuously as a sequence of characters without explicit word boundary delimiters. Due to this characteristic, texts in a non-seg- mented document cannot be directly used to calculate the similarity. Some preprocessing needs to be performed first to discover keywords for Asian documents before clustering. As a result, most approaches for clustering non-segmented documents consist of two phases: a text mining process to extract the keywords, and a document clustering process to compute the similarity between the input documents. 2. Keyword Extraction Keywords are usually regarded as an important key to identifying the main content of the documents. Most of the semantics are usually carried by nouns, although a sentence in a natural language text is composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and con- nectives. Keyword extraction is one of the main applica- tions of text mining. The objective of text mining is to exploit useful information or knowledge contained in textual documents [5]. Information Extraction (IE) is an essential task in text mining that describes a process of discovering interesting keywords underlying unstruc- tured natural-language texts. Most keyword extraction methods proposed in the literature were accomplished by constructing a set of words from given texts. Keywords will then be selected from the set of words during the preprocessing step. Many approaches have been pro- posed to extract keywords from non-segmented docu- ments such as Chinese [6], Japanese [7] or Thai docu- ments [8]. Most techniques are based on word segmenta- tion which is one of the most widely used information extraction techniques in Natural language Processing (NLP). However, most word segmentation approaches involve complex language analysis and require long computational time. After keyword extraction is per- formed, keywords are then transformed into feature vec- tor of the words that appear in the documents. The term-weights (usually term-frequencies) of the words are also contained in each feature vector. The vector space model (VSM) has been a standard model of representing documents by containing the set of words with their fre- quencies [1]. In the VSM, each document is replaced by the vector of the words. The vector size is dependent on the number of keywords that appear in the documents. For instance, let wik be the weight of keyword k that ap- pear in the document i, and Di = (wi1, wi2,…, wit) is the feature vector for document i, where t is the number of unique words of all documents. Therefore, the size of the feature vector is equal to t dimension as shown in Figure 1. Figure 1. The example of the document vectors in 3-dimen- sion From Figure 1, the similarity between two documents can be computed with one of several similarity measures based on two corresponding feature vectors, e.g., cosine measure, Jaccard measure, and Euclidean distance meas- ure [9]. 3. Document Clustering Algorithms In document clustering, there are two main approaches: hierarchical and partitional approaches [10,11,4]. The hierarchical approach produces document clusters by using a nested sequence of partitions that can be repre- sented in the form of a tree structure called a dendrogram. The root of the tree contains one cluster covering all data points, and singleton cluster of individual data point are shown on the leaves of the tree. There are two basic methods when performing hierarchical clustering: ag- glomerative (bottom up) and divisive (top down) clus- tering [4]. The advantages of hierarchical approach are that it can take any form of similarity function, and also the hierarchy of clusters allows users to discover clusters at any level of detail. However, this technique may suffer from the chain effect, and its space requirement is at least quadratic or O(n2) compared to the k-means algorithm that provide O(Iknm) where I is the number of necessary iterations, k is the number of clusters, n is the number of documents and m is the dimensionality of the vectors. The partitional approach [12], on the other hand, can be divided into several techniques, e.g., k-means [13], Fuzzy c-means [14], QT (quality threshold) [15] algorithms., Th e k-means algorithm is more widely used among all clus- tering algorithms because of its efficiency and simplicity. The basic idea of k-means algorithm is that it separates a given data into k clusters where each cluster has the cen- ter point, also called centroid, which can be used to rep- resent the cluster. The main advantages of k-means algo- rithm are its efficiency and simplicity. Its weaknesses are that it is only applicable to data sets where the notion of the mean is defined, the number of clusters can be iden ti- fied by users, and it is sensitive to data points that are very far away from other points called outliers [1]. Fur- Copyright © 2010 SciRes. JILSA ![]() A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts 119 thermore, Self-organizing map (SOM) [16,17] can be used as one of the clustering algorithms in the family of an artificial neural network. The self-organizing map is unsupervised neural network, capable of ordering high dimensional data in such a way that similar inputs are grouped spatially close to each other. To use SOM in document clustering, text documents are described by features with high dimensionality, and SOM based tech- niques have been successfully applied to document clus- tering. Some of the successful applications of SOM in document clustering are described in the next section. 4. Related Works Many clustering techniques have been developed and can be applied to clustering English documents Most of these traditional approaches use documents as the basis for clustering [18,19]. The Vector Space Document (VSD) model is a very widely used data representation model for document clustering [20]. This data model starts with a representation of any document as a feature vector of the words that appear in documents. The term-weights of the words are also contained in the feature vectors. The similarity measures are used to co mpute the similarity of two document vectors. An alternative approach of docu- ment clustering is phrase-based document clustering. Zamir and Etzioni [21] introduced the notion of phrase- based document clustering. They proposed to use a gen- eralized suffix-tree to obtain information about the phrases between two documents and use common phrases to cluster the documents. According to [22], Bakus, Hussin, and Kamel used a hierarchical phrase grammar extraction procedure to identify phrases from documents and used these phrases as features for document clustering. The self-organizing map (SOM) method was used as the clustering algorithm. An improvement in clustering per- formance was demonstrated when using phrases rather than single words as features. Mladenic and Grobelnik used a Naive Bayesian method to classify documents based on word sequences of dif- ferent length [23]. Experimental results show that using the word sequences whose length is no more than 3 words can improve the performance of a text classifica- tion system. But when the average length of used word sequences is longer than 3 words, there will be no dif- ference between usin g w or d sequences or single words. However, there are not many research works on phrase- based document clustering for Asian languages, primar- ily due to the fact that most Asian langu age texts are non- segmented and it is difficult to separate words and phrases from the non-segmented texts. Most document clustering approaches require a preprocessing stage where word segmentation, stopwo rd removal or semantic analysis are performed. NLP techniques provide good support for this step. Word segmentation is very impor- tant step involved in most NLP processing tasks. A text is separated into a sequence of tokens by using word segmentation techniques. Many approaches have been proposed for Asian languages such as Chinese, Japanese, Korea and Thai languages. In [24], a Chinese document clustering method using data mining technique and neural network model was proposed. This technique was divided into two main parts: the preprocessing part which provides Chinese sentence segmentation method, and the clustering part that adopts the dynamical SOM model with a view to dynamically clustering data. In addition, this method uses term vectors clustering process instead of document vec- tors clustering process. In Thai language, Canasai and Chuleerat propose a parallel algorithm for clustering text documents based on spherical k-means [25]. They implemented an algorithm on the PIRUN Linux Cluster, which is a parallel com- puter using cluster computing technology. Experimental results show that the use of parallel algorithm can sig- nificantly improve clustering performance. 5. A SOM Based Clustering Using Frequent Max Substrings for Non-Segmented Texts In this section, we describe a new method that combines Kohonen’s SOM and frequent max substring technique to process the non-segmented text documents into clusters. SOM is one of the main unsupervised learning methods in the family of artificial neural networks (ANN) that was first developed by Teuvo Kohonen in 1984 [26]. The SOM can be visualized as a regular two-dimensional array of cells or nodes (neurons). The SOM algorithm defines a mapping from the input vector onto a two- dimensional array of nodes. When the input vector x(t) Rn is given, it is connected to all neurons in the SOM array denoted as vector mi(t) Rn ,whic h are associated by each neuron and is gradually modified in the learning process. The input vector x(t) Rn is the inp ut data sets where t is the indexing terms of the input documents. These input data sets have to be mapped with all neurons in the map that is denoted as two-dimensional network of cells or the model vector mi(t) Rn. In mapping, the node where vector mi is most similar to the input vector x will be activ ated. This node is often called a best-matching node or a winner. The winner and a number of its neighboring nodes in the SOM array are then turned towards the input vector x according to the learning principle. The frequent max substring technique is an information extraction technique used to identify the terms called frequent max substrings from non-segmented texts where the word boundary and characteristic are not clearly de- fined. This technique was first introduced in 2008 [27] and has been proposed as an alternative language-indepen- dent technique for keyword extraction for non-segmented texts [28,29]. It also has been applied in application for Copyright © 2010 SciRes. JILSA ![]() A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts 120 indexing for non-segmented languages [8,30] as this technique provides good significant in term of the effi- ciency of storage space which could be able to support the rapid growth in the number of electronic non-segmen- ted documents. The frequent max substring technique classifies indexing terms, known as frequent max sub- strings, from the non-segmented texts where the word boundaries are not clearly defined. The frequent max substrings refer to the substrings that appear frequently (at a given frequency threshold value f) and have the maximum length on the given string, so these terms are likely to be the patterns of interest. We extract the set of frequent max substrings or FMAX by using the frequent max substring technique. This technique uses Frequent Suffix Trie or FST data structure to explore the indexing terms [27]. The FST data structure is similar to suffix trie structure, that is an efficient substring enumeration method. However, FST data structure enumerates substrings with their frequencies and positions information while suffix trie structure enumerates only substrings without their frequencies information. Therefore, we employ FST data structure in order to support extracting the frequent max substrings. In this technique, the parameter and the pre- determined frequency are applied to reduce the number of the indexing terms. This method uses the two reduc- tion rules: 1) reduction rule using the predetermined fre- quency to check extracting termination, 2) reduction rule using superstring definition to reduce the number of in- dexing terms extracted. This technique also uses heap data structure to support computation [27]. In this paper, we use a set of non-segmented docu- ments (Thai documents) as an input to train a map using SOM. We will describe the process of clustering as fol- lows. Let D be a document collection consisting of n docu- ments, d1, d2, ..., dn. Firstly, we use the frequent max substring technique [27] to generate a set of frequent max substrings at the given frequency threshold value f from the document collectio n to be used as the set of indexing terms for the document collection. Assuming the above process produces m frequent max substrings from the document collection, denoted FMAX = (fm1, fm2, ..., fmm). where fmi is the ith frequent max substring generated from the document collection. These m substrings are used as our indexing terms. We will then calculate th e weight wij which represents the frequency of indexing term fmi occurring in docu- ment dj for each indexing term and each document. Fi- nally we construct an m × n matrix of such weights. In this matrix, row i represents the frequencies of occur- rence of the ith indexing term fmi in the n documents, while jth column represents the document vector for document j, as depicted in Figure 2. Figure 2 shows an example of document matrix. In a document matrix, each element wij is at least at f if fmi occurs in the document dj or 0 if fmi does not appear in the document dj, i.e., if occu r s in 0 otherwise ij ij f fm d w After the document matrix is obtained, the document vectors are presented to SOM for clustering. These documents can be labeled into neurons according to the similarity of their document vectors. Two documents containing the same or similar document vectors will map to the same neuron. In contrast, the documents may map to distant neurons if they contain different or non- overlapping frequent max substrings. Furthermore, the documents with similar frequent max substrings may map to neighbouring neurons. This means that the neu- rons can form the document clusters by examining mapped neurons in the document cluster map. We depict the organization of the document map that clusters simi- lar documents into the same neuron as shown in the boxes. fms in the boxes represent the content of docu- ments in the collection . Figure 2. The example of the document matrix at the given frequency threshold value f is equal to 2 Figure 3. The example of the document cluster map Copyright © 2010 SciRes. JILSA ![]() A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts 121 After the SOM has been trained , the docu men t clus ters are formed by labeling each neuron that contains certain documents of similar type. The documents in the same neuron may not contain exactly the same set of frequent max substrings or FMAX, but they usually contain mostly overlapping frequent max substrings. As a result, the document cluster map can be used as a prediction model to generate the different groups of similar documents, and each group will then be used to specify the document type by comparing frequent max substrings of each group wi th keywords of each area. In Figure 4, we depi ct clustering the documents into different groups, by map- ping an input data with neurons in the document cluster map to find the document groups of several types. From Figure 4, the following will describe the process of matching an input data x with the neurons in the document cluster map by using SOM. Let us consider the input vector x = [x1, x2, …, xn]t Rn as the input data sets where t is the frequent max sub- strings of the input documents. These input data sets have to be matched with all neurons in the map that is denoted as two-dimensional network of cells or the model vector mi = [mi1, mi2, …, min]t R n depicted in Figure 5. Each neuron i in the network contains the model vector mi, which has the same number of indexing terms as the input vector x. From Figure 5, the input vector x is compared with all neurons in the model vector mi to find the best matching node called the winner. The winner unit is the neuron on the map where the set of the frequent max substrings of the input vector x is the same or similar to the set of the frequent max substrings of the model vector mi by using some matching criterion e.g. the Euclidean distances between x and mi. As a result, this method can be used to cluster documents into different groups, and also sug- gested that this can use to reduce the search time for the relevant document. 6. Experimental Studies and Comparison Results In this section, we describe an experiment for clustering non-segmented documents (Thai documents) based on the proposed SOM and frequent max substring technique. We also compare the proposed technique with the SOM based documents clustering using single words as fea- tures and hierarchical clustering technique using single words on a group of documents. 50 Thai documents were used as an input dataset to train a map. All Thai docu- ments used are from Thai news websites that consist of different categories of contents: sport, travel, education and political news. The documents h ave varying lengths. The set of documents contains 103,287 characters, and average document length is 2,065 characters or 78 words per document. The basic statistics for the text collection are shown in Table 1. Figure 4. Neuron network architecture Figure 5. Self-organizing map Table 1. Basic statistics for Thai text collection No. of Docs No. of Chars No. of Words Avg. Chars./ Docs Avg. Words/ Docs Sport news15 24727 997 1648.4666.46 Travel news15 29096 1022 2078.2873 Political news 15 38017 1398 2534.4693.2 Education news 5 9445 336 1889 67.2 In our proposed technique, the set of frequent max substring was first generated by frequent max substring technique at the given frequency threshold value, which is equal to 2 from the document dataset and 35 frequent max substrings, the long and frequently occurring terms in sport, travel, political and education documents, were used as the set of indexing terms for this document col- lection. The 50 input documents are then transformed to a document matrix of weighted frequent max substring occurrence. Hence, these 35 indexing terms and 50 input documents to form a 50 × 35 matrix, where each docu- ment vector was represented by each column of the ma- trix, and the rows of the matrix correspond to the index- ing terms. We use this 50 × 35 matrix to train a map us- ing SOM, and the number of neurons was set as 9 in SOM program as shown in Figure 6. In the experimental study, 9 was set as the number of neurons because the several numbers of neurons were investigated, and 9 neurons provided the best result among them. From experimental studies, the group of political, sport, and education documents provided good results as Copyright © 2010 SciRes. JILSA ![]() A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts 122 similar documents of each type were mapped onto the neuron. It can be observed that some errors occurred within the group of travel documents. The travel docu- ments were mapped onto several neurons due to overlap- ping terms that appeared across different type of docu- ments. The Figure 6 showed the map containing 9 neurons and 50 Thai documents. Each neuron contains the group of similar documents. From Figure 6, the experimental result showed that SOM can cluster 50 documents into 5 neurons on the map, and the similar documents were grouped into the same neuron as shown in Table 2. As observed from the results, this technique can be used to cluster non-segmented documents into several groups according to their similarity. The accuracy of this technique is up to 83.25%. However, from this experi- ment, we have found that the groups of education and sport documents are mapped onto the same neuron (Neuron5) because they both contain mostly overlapping frequent max substrings such as ผลการแขงขัน (competi- tion result), การจัดอันดับ (position ranking), ไดรับรางวัล (getting award), etc. Furthermore, the contents of docu- ments and generated indexing terms are also the main factors that impact the accurate value. The content of one document may have overlapping terms from two differ- ent types of documents. For instance, Education5, Travel 1, Travel 3, Travel 11 and Travel 14 documents are mapped onto the neuron 3 because they are presenting information on ecotourism, containing overlapping terms from education and travel documents. The methodology to measure the clustering ap- proaches is to compare the group of documents occur- rences. In this paper, we compare our technique with the SOM based documents clustering using single words [31] as features and hierarchical based document clustering using single words [4,12]. In the SOM based documents clustering using single words, single words were first extracted by using Thai word segmentation from the same document data set and 35 single words, most fre- quent occurring single keywords in education, sport, travel and political documents, and 50 input documents are used to form a 50 × 35 matrix, which is the same ma- trix size as our earlier experiment to train a map using SOM. From experimental studies, it can be observed that only the groups of politic, and travel documen ts provided fairly good results as shown in Table 3. The group of similar travel documents was mapped onto the neuron as well as the group of political docu- ments. Meanwhile, the education and sport documents were distributed onto several neurons because most sin- gle words extracted from education and sport documents are general terms. These general terms can be shared in many types of documents. To depict the experiment results, Figure 7 showed the Figure 6. SOM contains 9 neurons and the group of similar documents from 50 Thai document collection Figure 7. SOM contains 9 neurons and the group of similar documents from 50 Thai document collection by using sin- gle words as features map containing 9 neurons and 50 Thai documents, and the documents were grouped into the neurons as shown in Table 3. It can be observed that the accuracy of the SOM based documents clustering using single words is 72.21%. Moreover, from the experiment results, we have found that the SOM based documents clustering using frequent max substrings provides better result than the SOM based documents clustering using single words because the frequent max substrings can be used to describe the Copyright © 2010 SciRes. JILSA ![]() A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts Copyright © 2010 SciRes. JILSA 123 Table 2. Clustering results of using SOM and frequent max substring technique Neuron ID Row Column Document ID Neuron 5 1 2 Political 1, Education 1, Education 2, Education 3, Education 4, Sport 1, Sport 2, Sport 3, Sport 4, Sport 5, Sport 6, Sport 7, Sport 8, Sport 9, Sport 10, Sport 11, Sport 12, Sport 13, Sport 14, Sport 15, Travel 10 Neuron 2 2 1 Political 2, Political 3, Political 4, Political 5, Political 6, Political 7, Political 8, Political 9, Political 10, Political 11, Political 12, Political 13, Political 14, Political 15, Neuron 4 2 2 Travel 2, Travel 4, Travel 5, Travel 6, Travel 7, Travel 8, Travel 9, Travel 13, Travel 15 Neuron 1 3 1 Travel 12 Neuron 3 3 2 Education 5, Travel 1, Travel 3, Travel 11, Travel 14 Table 3. Clustering results of SOM based documents clustering using single words Neuron ID Row Column Document ID Neuron 2 1 1 Education 1, Sport 1, Sport 2, Sport 5, Sport 6, Political 1, Political 2, Political 3, Political 4, Political 6, Political 8, Political 9, Political 10, Political 11, Political 12 Neuron 5 1 2 Education 3, Education 5, Sport 7, Sport 8, Sport 9, Sport 11, Sport 13, Travel 7 Neuron 6 1 3 Sport 12, Sport 14, Neuron 1 2 1 Sport 15, Travel 15, Politica l7 Neuron 4 2 2 Education 2, Sport 3, Sport 4, Travel 1, Travel 2, Travel 3, Travel 4, Travel 5, Travel 6, Travel 8, Travel 9, Travel 10, Travel 11, Travel 12, Travel 13, Travel 14, Political 13, Political 14, Political 15 Neuron 3 3 2 Education 4, Sport 10, Politcal 5 content of the documents more specifically than using single words, as the frequent max substrings can be re- ferred to the frequently and long terms rather than indi- vidual words. Moreover, many researches have also shown an improvement in clustering performance when using phrases rather than single words as features [21,22]. Additionally, we also compare our technique with the hierarchical based document clustering using single words. The hierarchical based document clustering is used because it has been widely used and has been ap- plied successfully in many applications in the area of document clustering [12]. This method has also been used to perform Thai documents clustering. To use this technique with Thai language, single words were first extracted by using Thai word segmentation techniques from the same document dataset used in our experiment that is discussed earlier. After word segmentation is per- formed, single words are then transformed into feature vector of the words that appear in documents. The term-frequencies of the words are also contained in each feature vector. The feature vector of the words is then used to compute the similarity of the documents by using the hierarchical clustering approach. In the hierarchical clustering program, the feature vector of the words with their frequencies was used as an input data, and the number of clusters was set to 9 after trial-and-error. From the experimental results, the group of political, travel and education documents provided good results. However, travel and education documents were grouped into the same cluster (cluster 1). It can be observed that the travel and education documents are sharing overlapping words that appeared across different type of documents. In ad- dition, some of travel, education, sport and travel docu- ment were distributed across several small clusters. In Ta bl e 4, the experimental result showed that the hierar- chical clustering program can cluster 50 documents into 9 clusters. As can be observed from the results, the accuracy of the proposed method is up to 83.25%, meanwhile using the SOM based documents clustering using single words and the hierarchical clustering approach provide the ac- curacies 72.21% and 79.75% respectively. The hierarchi- ![]() A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts 124 Table 4. Clustering results of using the hierarchical clus- tering approach Cluster ID Document ID Cluster 1 Education 1, Education 2, Education 3, E du- cation 4, Sport 9, Sport 13, Travel 1, Travel 2, Travel 3, Travel 4, Travel 5, Travel 7, Travel 8, Travel 9, Travel 10, Travel 11, Travel 12, Travel 13, Travel 14, Travel 15, Political 15 Cluster 2 Education 5 Cluster 3 Sport 1 Cluster 4 Sport 2, Sport 3, Sport 4, Sport 5, Sport 6, Sport 7, Sport 8, Sport 15 Cluster 5 Sport 10, Sport 11 Cluster 6 Sport 12 Cluster 7 Sport 14 Cluster 8 Travel 6 Cluster 9 Political 1, Political 2, Political 3, Political 4, Political 5, Political 6, Political 7, Political 8, Political 9, Political 10, Political 11, Political 12, Political 13, Political 14 cal clustering approach also created many small clusters that containing only a few documents. As a result, an improvement was demonstrated using frequent max sub- strings rather than single words as features. This pro- posed technique also does not require any pre-processing technique to extract the frequent max substrings. Mean- while, the SOM based documents clustering using single words and the hierarchical based document clustering using single words require word segmentation to extract the single words. 7. Conclusions This paper describes a non-segmented document cluster- ing method using self- organizing map (SOM) and fre- quent max substring technique to improve the efficiency of information retrieval. We first use the frequent max substring technique to discover pattern s of interest, called frequent max substrings, rather than individual words from Thai text documents, and these frequent max sub- strings are then used as indexing terms with their number of occurrences to form a document vector. SOM is then applied to generate the document cluster map by using the document vector. The experiment studies and com- parison results on clustering the 50 Thai text documents is presented in this paper. We compare the proposed technique with the SOM based documents clustering using single words and hierarchical based document clustering technique with the use of single words for grouping the document occurrences. From the experi- mental results, our technique can be used to cluster 50 Thai documents into different clusters with more accu- racy than using the SOM based documents clustering using single words and the hierarchical clustering ap- proaches. As a result, the generated document cluster map from our technique can be used to find the relevant documents according to a user’s query more efficiency. REFERENCES [1] B. Liu, “Web Data Mining: Exploring Hyperlinks, Con-tents, and Usage Data,” 1st Edition, Springer-Verlag, New York Berlin Heidelberg, 2007. [2] D. R. K. R. D. Cutting, J. O. Pedersen, J. W. Tukey, “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proceedings of ACM Spe- cial Interest Group on Information Retrieval ‘92, Copen- hague, 1992, pp. 318-329. [3] I. Matveeva, “Document Representation and Multilevel Measures of Document Similarity,” Irina Matveeva, Document representation and multilevel measures of document similarity, Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Tech- nology: companion volume: doctoral consortium, New York, 2006, pp. 235-238. [4] G. K. M. Steinbach and V. Kumar, “A Comparison of Docu-ment Clustering Techniques,” KDD Workshop on Text Mining, Boston, 2000. [5] A.-H. Tan, “Text Mining: The state of the art and the challenges,” Proceedings of the PAKDD Workshop on Knowledge Discovery from Advanced Databases, Beijing, 1999, pp. 65-70. [6] Q. L. H. Jiao and H.-B. Jia, “Chinese Key word Extraction Based on N-Gram and Word Co-occurrence, 2007 Inter- national Conference on Computational Intelligence and Security Workshops (CISW 2007), Harbin, 2007, pp. 124-127. [7] J. Mathieu, “Adaptation of a Keyphrase Extractor for Japanese Text,” Proceedings of the 27th Annual Confer- ence of the Canadian Association for Information Science (CAIS-99), Sherbrooke, Quebec, 1999, pp. 182-189. [8] T. Chumwatana, K. W. Wong and H. Xie “An Automatic Indexing Technique for Thai Texts Using Frequent Max Substring,” 2009 Eight International Symposium on Natural Language Processing, Bangkok, 2009, pp. 67-72. [9] R. Feldman and J. Sanger, “The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data,” Cambridge University Press, Cambridge, 2006. [10] A. K. Jain and R. C. Dubes, “Algorithms for Clustering Data,” Prentice Hall, New Jersey, 1988. [11] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analysis,” John Wiley and Sons, New York, 1990. [12] G. K. Y. Zhao, “Comparison of Agglomerative and Parti- tional Document Clustering Algorithms,” The SIAM workshop on Clustering High-dimensional Data and Its Applications, Washington, DC, April 2002. Copyright © 2010 SciRes. JILSA ![]() A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts Copyright © 2010 SciRes. JILSA 125 [13] Z. Huang, “Extensions to the K-means Algorithm for Clustering Large Datasets with Categorical Values,” Data Mining and Knowledge Discovery, Vol. 2, No. 3, 1998, pp. 283-304. [14] D. Dembele and P. Kastner, “Fuzzy C-Means Method for Clustering Microarray Data,” Bioinformatics, Vol. 19, No. 8, 2003, pp. 973-980. [15] L. J. Heyer, S. Kruglyak and S. Yooseph, “Exploring Ex- pression Data: Identification and Analysis of Coexpressed Genes,” Genome Research, Vol. 9, No. 11, 1999, pp. 1106-1115. [16] C. C. Fung, K. W. Wong, H. Eren, R. Charlebois and H. Crocker, “Modular Artificial Neural Network for Predic- tion of Petrophysical Properties from Well Log Data,” IEEE Transactions on Instrumentation & Measurement, Vol. 46, No. 6, December 1997, pp. 1259-1263. [17] D. Myers, K. W. Wong and C. C. Fung, “Self-organising Maps Use for Intelligent Data Analysis,” Australian Journal of Intelligent Information Processing Systems, Vol. 6 No. 2, 2000, pp. 89-96. [18] D. R. Hill, “A Vector Clustering Technique,” In: Samuel- son, Ed., Mechanized Information Storage, Retrieval and Dissemination North-Holland, Amsterdam, 1968. [19] J. J. Rocchio, “Document Retrieval Systems — Optimi- zation and Evaluation,” Doctoral Thesis, Harvard Univer- sity, Boston, 1966. [20] A. W. G. Salton and C. S. Yang, “A Vector Space Model for Automatic Indexing,” Communication of ACM, Vol. 18, No. 11, 1975, pp. 613-620. [21] O. Zamir, “Clustering Web Documents: A Phrase-Based Method for Group Search Engine Results,” Computer Science & Engineering, Ph.D. Thesis, University of Washington, 1999. [22] M. F. H. J. Bakus and M. Kamel, “A SOM-Based Docu- ment Clustering Using Phrases,” Proceeding of the 9th International Conference on Neural Information Proc- essing (ICONIP’02), Vol. 5, 2002, pp. 2212-2216. [23] D. Mladenic and M. Grobelnik, “Word Sequence as Fea- tures in Text-learning,” Proceedings of the 17th Electro technical and Computer Science Conference (ERK-98) Ljubljana, Slovenia, 1998. [24] K.-H. Tsai, C.-M. Tseng, C.-C. Hsu and H.-C. Chang, “On the Chinese Document Clustering Based on Dynamical Term Clustering,” Asia Information Retrieval Symposium 2005, Jeju Island, October 2005, pp. 534-539. [25] C. Kruengkrai and C. Jaruskulchai, “Thai Text Document Clustering Using Parallel Spherical K-means Algorithm on PI-RUN Linux Cluster (in Thai),” The 5th National Computer Science and Engineering Conference, Chiang Mai University, Chiang Mai, 2001, pp. 7-9 [26] T. Kohonen, “Self-Organization and Associative Mem- ory,” Springer Series in Information Sciences, Springer- Verlag, Berlin, 1984, p. 125. [27] T. Chumwatana, K. W. Wong and H. Xie “Frequent max substring mining for indexing,” International Journal of Computer Science and System Analysis (IJ CSSA ), India, 2008, pp. 179-184. [28] T. Chumwatana, K. W. Wong and H. Xie “An Efficient Text Mining Technique,” 9th Postgraduate Electrical En- gineering & Computing Symposium (PEECS2008), Perth, Australia, 2008, pp. 147-152. [29] T. Chumwatana, K. W. Wong and H. Xie, “Using Fre- quent Max Substring Technique for Thai Keyword Ex- traction used in Thai Text Mining,” 2nd International Conference on Soft Computing, Intelligent System and Information Technology (ICSIIT 2010), Bali, 1-2 July 2010, pp. 309-314. [30] T. Chumwatana, K. W. Wong and H. Xie, “Thai Text Mining to Support Web Search for E-Commerce,” The 7th International Conference on e-Business 2008 (INCEB 2008), Bangkok, 2008, pp. 66-70. [31] J. E. Hodges and Y. Wang, “Document Clustering using Compound Words,” Proceedings of the 2005 Interna- tional Conference on Artificial Intelligence (ICAI 2005), Las Vegas, Nevada, 2005, pp. 307-313. |