A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

doi:10.4236/jilsa.2010.23015

Paper Menu >>

Journal Menu >>

J. Intelligent Learning Systems & Applications, 2010, 2, 117-125

doi:10.4236/jilsa.2010.23015 Published Online August 2010 (http://www.SciRP.org/journal/jilsa)

117

A SOM-Based Document Clustering Using

Frequent Max Substrings for Non-Segmented

Texts

Todsanai Chumwatana, Kok Wai Wong, Hong Xie

School of Information Technology, Murdoch University, South St, Murdoch, Australia.

Email: {T.Chumwatana, K.Wong, H.Xie}@Murdoch.edu.au

Received March 25th, 2010; revised July 15th, 2010; accepted July 30th, 2010.

ABSTRACT

This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max

substring technique to improve th e efficiency of information retrieval. SOM has been wid ely used for document cluster-

ing and is successful in many applications. However, when applying to non-segmented document, the challenge is to

identify any interesting pattern efficien tly. There are two main phases in the propose method: p reprocessing phase and

clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the pat-

terns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words

from the non-segmented texts. These discovered patterns are then used as ind exing terms. The indexing terms together

with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the docu-

ment cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, ex-

perimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts,

are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document

cluster map generated with the method can be used to find the relevant documents more efficiently.

Keywords: Frequent Max Substring, Self-Organizing Map, Document Clustering

1. Introduction

Document clustering has been an important issue [1] due

to the rapid growth in the number of electronic docu-

ments. Document clustering, sometimes can be general-

ized as text clustering, indentifies the similarity of docu-

ments and summarize a large number of documents using

key attributes of the clusters. Document clustering uses

unsupervised learning techniques and may assist fast

information retrieval or filtering [2]. This is because clus-

tering technique categorizes documents into groups

based on their similarity in term of their member occur-

rences. Thus clustering can be used to categorize docu-

ment databases and digital libraries, as well as providing

useful summary information of the categories for brows-

ing purposes. In information retrieval, a typical search on

document database or the World Wide Web can return

several thousands of documents in response to the user’s

queries. It is often very difficult for users to identify their

documents of interest from such a huge number of

documents. Clustering the documents enables the user to

have a clear and easy grasp of the relevant documents

from the collection of documents which are similar to

each other and could be rel evant to the user’s queries.

For text clustering in information retrieval, a document

is normally considered as a bag of words, even though a

document actually consists of a sequence of sentences

and each sentence is composed from a sequence of words.

Very often the positions of words are ignored when per-

forming document clustering. Words, also known as in-

dexing terms, and their weights in documents are usually

used as important parameters to compute the similarity of

documents [3]. Those documents that contain similar

indexing terms and frequencies will be grouped under th e

same cluster. This process is straightforward for Euro-

pean languages where words are clearly defined by word

delimiter such as space or other special symbols. Euro-

pean texts are explicitly segmented in to word tokens that

are used as indexing terms. Many algorithms have been

developed to calculate the similarity of documents and to

build clusters for fast information retrieval [4]. In con-

trast, document clustering can be a challenging task for

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

118

many Asian languages such as Chinese, Japanese, Ko-

rean and Thai, because these languages are non-segmented

languages, i.e., a sentence is written continuously as a

sequence of characters without explicit word boundary

delimiters. Due to this characteristic, texts in a non-seg-

mented document cannot be directly used to calculate the

similarity. Some preprocessing needs to be performed

first to discover keywords for Asian documents before

clustering. As a result, most approaches for clustering

non-segmented documents consist of two phases: a text

mining process to extract the keywords, and a document

clustering process to compute the similarity between the

input documents.

2. Keyword Extraction

Keywords are usually regarded as an important key to

identifying the main content of the documents. Most of

the semantics are usually carried by nouns, although a

sentence in a natural language text is composed of nouns,

pronouns, articles, verbs, adjectives, adverbs, and con-

nectives. Keyword extraction is one of the main applica-

tions of text mining. The objective of text mining is to

exploit useful information or knowledge contained in

textual documents [5]. Information Extraction (IE) is an

essential task in text mining that describes a process of

discovering interesting keywords underlying unstruc-

tured natural-language texts. Most keyword extraction

methods proposed in the literature were accomplished by

constructing a set of words from given texts. Keywords

will then be selected from the set of words during the

preprocessing step. Many approaches have been pro-

posed to extract keywords from non-segmented docu-

ments such as Chinese [6], Japanese [7] or Thai docu-

ments [8]. Most techniques are based on word segmenta-

tion which is one of the most widely used information

extraction techniques in Natural language Processing

(NLP). However, most word segmentation approaches

involve complex language analysis and require long

computational time. After keyword extraction is per-

formed, keywords are then transformed into feature vec-

tor of the words that appear in the documents. The

term-weights (usually term-frequencies) of the words are

also contained in each feature vector. The vector space

model (VSM) has been a standard model of representing

documents by containing the set of words with their fre-

quencies [1]. In the VSM, each document is replaced by

the vector of the words. The vector size is dependent on

the number of keywords that appear in the documents.

For instance, let wik be the weight of keyword k that ap-

pear in the document i, and Di = (wi1, wi2,…, wit) is the

feature vector for document i, where t is the number of

unique words of all documents. Therefore, the size of

the feature vector is equal to t dimension as shown in

Figure 1.

Figure 1. The example of the document vectors in 3-dimen-

sion

From Figure 1, the similarity between two documents

can be computed with one of several similarity measures

based on two corresponding feature vectors, e.g., cosine

measure, Jaccard measure, and Euclidean distance meas-

ure [9].

3. Document Clustering Algorithms

In document clustering, there are two main approaches:

hierarchical and partitional approaches [10,11,4]. The

hierarchical approach produces document clusters by

using a nested sequence of partitions that can be repre-

sented in the form of a tree structure called a dendrogram.

The root of the tree contains one cluster covering all data

points, and singleton cluster of individual data point are

shown on the leaves of the tree. There are two basic

methods when performing hierarchical clustering: ag-

glomerative (bottom up) and divisive (top down) clus-

tering [4]. The advantages of hierarchical approach are

that it can take any form of similarity function, and also

the hierarchy of clusters allows users to discover clusters

at any level of detail. However, this technique may suffer

from the chain effect, and its space requirement is at least

quadratic or O(n2) compared to the k-means algorithm

that provide O(Iknm) where I is the number of necessary

iterations, k is the number of clusters, n is the number of

documents and m is the dimensionality of the vectors.

The partitional approach [12], on the other hand, can be

divided into several techniques, e.g., k-means [13], Fuzzy

c-means [14], QT (quality threshold) [15] algorithms., Th e

k-means algorithm is more widely used among all clus-

tering algorithms because of its efficiency and simplicity.

The basic idea of k-means algorithm is that it separates a

given data into k clusters where each cluster has the cen-

ter point, also called centroid, which can be used to rep-

resent the cluster. The main advantages of k-means algo-

rithm are its efficiency and simplicity. Its weaknesses are

that it is only applicable to data sets where the notion of

the mean is defined, the number of clusters can be iden ti-

fied by users, and it is sensitive to data points that are

very far away from other points called outliers [1]. Fur-

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts 119

thermore, Self-organizing map (SOM) [16,17] can be

used as one of the clustering algorithms in the family of

an artificial neural network. The self-organizing map is

unsupervised neural network, capable of ordering high

dimensional data in such a way that similar inputs are

grouped spatially close to each other. To use SOM in

document clustering, text documents are described by

features with high dimensionality, and SOM based tech-

niques have been successfully applied to document clus-

tering. Some of the successful applications of SOM in

document clustering are described in the next section.

4. Related Works

Many clustering techniques have been developed and can

be applied to clustering English documents Most of these

traditional approaches use documents as the basis for

clustering [18,19]. The Vector Space Document (VSD)

model is a very widely used data representation model

for document clustering [20]. This data model starts with

a representation of any document as a feature vector of

the words that appear in documents. The term-weights of

the words are also contained in the feature vectors. The

similarity measures are used to co mpute the similarity of

two document vectors. An alternative approach of docu-

ment clustering is phrase-based document clustering.

Zamir and Etzioni [21] introduced the notion of phrase-

based document clustering. They proposed to use a gen-

eralized suffix-tree to obtain information about the phrases

between two documents and use common phrases to

cluster the documents. According to [22], Bakus, Hussin,

and Kamel used a hierarchical phrase grammar extraction

procedure to identify phrases from documents and used

these phrases as features for document clustering. The

self-organizing map (SOM) method was used as the

clustering algorithm. An improvement in clustering per-

formance was demonstrated when using phrases rather

than single words as features.

Mladenic and Grobelnik used a Naive Bayesian method

to classify documents based on word sequences of dif-

ferent length [23]. Experimental results show that using

the word sequences whose length is no more than 3

words can improve the performance of a text classifica-

tion system. But when the average length of used word

sequences is longer than 3 words, there will be no dif-

ference between usin g w or d sequences or single words.

However, there are not many research works on phrase-

based document clustering for Asian languages, primar-

ily due to the fact that most Asian langu age texts are non-

segmented and it is difficult to separate words and

phrases from the non-segmented texts. Most document

clustering approaches require a preprocessing stage

where word segmentation, stopwo rd removal or semantic

analysis are performed. NLP techniques provide good

support for this step. Word segmentation is very impor-

tant step involved in most NLP processing tasks. A text

is separated into a sequence of tokens by using word

segmentation techniques. Many approaches have been

proposed for Asian languages such as Chinese, Japanese,

Korea and Thai languages.

In [24], a Chinese document clustering method using

data mining technique and neural network model was

proposed. This technique was divided into two main

parts: the preprocessing part which provides Chinese

sentence segmentation method, and the clustering part

that adopts the dynamical SOM model with a view to

dynamically clustering data. In addition, this method uses

term vectors clustering process instead of document vec-

tors clustering process.

In Thai language, Canasai and Chuleerat propose a

parallel algorithm for clustering text documents based on

spherical k-means [25]. They implemented an algorithm

on the PIRUN Linux Cluster, which is a parallel com-

puter using cluster computing technology. Experimental

results show that the use of parallel algorithm can sig-

nificantly improve clustering performance.

5. A SOM Based Clustering Using Frequent

Max Substrings for Non-Segmented Texts

In this section, we describe a new method that combines

Kohonen’s SOM and frequent max substring technique to

process the non-segmented text documents into clusters.

SOM is one of the main unsupervised learning methods

in the family of artificial neural networks (ANN) that

was first developed by Teuvo Kohonen in 1984 [26]. The

SOM can be visualized as a regular two-dimensional

array of cells or nodes (neurons). The SOM algorithm

defines a mapping from the input vector onto a two-

dimensional array of nodes. When the input vector x(t)



Rn is given, it is connected to all neurons in the SOM

array denoted as vector mi(t) Rn ,whic h are associated

by each neuron and is gradually modified in the learning

process. The input vector x(t) Rn is the inp ut data sets

where t is the indexing terms of the input documents.

These input data sets have to be mapped with all neurons

in the map that is denoted as two-dimensional network of

cells or the model vector mi(t) Rn.



In mapping, the node where vector mi is most similar

to the input vector x will be activ ated. This node is often

called a best-matching node or a winner. The winner and

a number of its neighboring nodes in the SOM array are

then turned towards the input vector x according to the

learning principle.

The frequent max substring technique is an information

extraction technique used to identify the terms called

frequent max substrings from non-segmented texts where

the word boundary and characteristic are not clearly de-

fined. This technique was first introduced in 2008 [27]

and has been proposed as an alternative language-indepen-

dent technique for keyword extraction for non-segmented

texts [28,29]. It also has been applied in application for

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

120

indexing for non-segmented languages [8,30] as this

technique provides good significant in term of the effi-

ciency of storage space which could be able to support

the rapid growth in the number of electronic non-segmen-

ted documents. The frequent max substring technique

classifies indexing terms, known as frequent max sub-

strings, from the non-segmented texts where the word

boundaries are not clearly defined. The frequent max

substrings refer to the substrings that appear frequently

(at a given frequency threshold value f) and have the

maximum length on the given string, so these terms are

likely to be the patterns of interest. We extract the set of

frequent max substrings or FMAX by using the frequent

max substring technique. This technique uses Frequent

Suffix Trie or FST data structure to explore the indexing

terms [27]. The FST data structure is similar to suffix trie

structure, that is an efficient substring enumeration method.

However, FST data structure enumerates substrings with

their frequencies and positions information while suffix

trie structure enumerates only substrings without their

frequencies information. Therefore, we employ FST data

structure in order to support extracting the frequent max

substrings. In this technique, the parameter and the pre-

determined frequency are applied to reduce the number

of the indexing terms. This method uses the two reduc-

tion rules: 1) reduction rule using the predetermined fre-

quency to check extracting termination, 2) reduction rule

using superstring definition to reduce the number of in-

dexing terms extracted. This technique also uses heap

data structure to support computation [27].

In this paper, we use a set of non-segmented docu-

ments (Thai documents) as an input to train a map using

SOM. We will describe the process of clustering as fol-

lows.

Let D be a document collection consisting of n docu-

ments, d1, d2, ..., dn. Firstly, we use the frequent max

substring technique [27] to generate a set of frequent max

substrings at the given frequency threshold value f from

the document collectio n to be used as the set of indexing

terms for the document collection.

Assuming the above process produces m frequent max

substrings from the document collection, denoted FMAX

= (fm1, fm2, ..., fmm). where fmi is the ith frequent max

substring generated from the document collection. These

m substrings are used as our indexing terms.

We will then calculate th e weight wij which represents

the frequency of indexing term fmi occurring in docu-

ment dj for each indexing term and each document. Fi-

nally we construct an m × n matrix of such weights. In

this matrix, row i represents the frequencies of occur-

rence of the ith indexing term fmi in the n documents,

while jth column represents the document vector for

document j, as depicted in Figure 2.

Figure 2 shows an example of document matrix. In a

document matrix, each element wij is at least at f if fmi

occurs in the document dj or 0 if fmi does not appear in

the document dj, i.e.,

if occu r s in

0 otherwise

fm d

w











After the document matrix is obtained, the document

vectors are presented to SOM for clustering. These

documents can be labeled into neurons according to the

similarity of their document vectors. Two documents

containing the same or similar document vectors will

map to the same neuron. In contrast, the documents may

map to distant neurons if they contain different or non-

overlapping frequent max substrings. Furthermore, the

documents with similar frequent max substrings may

map to neighbouring neurons. This means that the neu-

rons can form the document clusters by examining

mapped neurons in the document cluster map. We depict

the organization of the document map that clusters simi-

lar documents into the same neuron as shown in the

boxes. fms in the boxes represent the content of docu-

ments in the collection .

Figure 2. The example of the document matrix at the given

frequency threshold value f is equal to 2

Figure 3. The example of the document cluster map

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts 121

After the SOM has been trained , the docu men t clus ters

are formed by labeling each neuron that contains certain

documents of similar type. The documents in the same

neuron may not contain exactly the same set of frequent

max substrings or FMAX, but they usually contain mostly

overlapping frequent max substrings. As a result, the

document cluster map can be used as a prediction model

to generate the different groups of similar documents,

and each group will then be used to specify the document

type by comparing frequent max substrings of each

group wi th keywords of each area. In Figure 4, we depi ct

clustering the documents into different groups, by map-

ping an input data with neurons in the document cluster

map to find the document groups of several types.

From Figure 4, the following will describe the process

of matching an input data x with the neurons in the

document cluster map by using SOM.

Let us consider the input vector x = [x1, x2, …, xn]t



Rn as the input data sets where t is the frequent max sub-

strings of the input documents. These input data sets

have to be matched with all neurons in the map that is

denoted as two-dimensional network of cells or the

model vector mi = [mi1, mi2, …, min]t R

n depicted in

Figure 5. Each neuron i in the network contains the

model vector mi, which has the same number of indexing

terms as the input vector x.



From Figure 5, the input vector x is compared with all

neurons in the model vector mi to find the best matching

node called the winner. The winner unit is the neuron on

the map where the set of the frequent max substrings of

the input vector x is the same or similar to the set of the

frequent max substrings of the model vector mi by using

some matching criterion e.g. the Euclidean distances

between x and mi. As a result, this method can be used to

cluster documents into different groups, and also sug-

gested that this can use to reduce the search time for the

relevant document.

6. Experimental Studies and Comparison

Results

In this section, we describe an experiment for clustering

non-segmented documents (Thai documents) based on

the proposed SOM and frequent max substring technique.

We also compare the proposed technique with the SOM

based documents clustering using single words as fea-

tures and hierarchical clustering technique using single

words on a group of documents. 50 Thai documents were

used as an input dataset to train a map. All Thai docu-

ments used are from Thai news websites that consist of

different categories of contents: sport, travel, education

and political news. The documents h ave varying lengths.

The set of documents contains 103,287 characters, and

average document length is 2,065 characters or 78 words

per document. The basic statistics for the text collection

are shown in Table 1.

Figure 4. Neuron network architecture

Figure 5. Self-organizing map

Table 1. Basic statistics for Thai text collection

No. of

Docs No. of

Chars No. of

Words

Avg.

Chars./

Docs

Avg.

Words/

Docs

Sport news15 24727 997 1648.4666.46

Travel news15 29096 1022 2078.2873

Political

news 15 38017 1398 2534.4693.2

Education

news 5 9445 336 1889 67.2

In our proposed technique, the set of frequent max

substring was first generated by frequent max substring

technique at the given frequency threshold value, which

is equal to 2 from the document dataset and 35 frequent

max substrings, the long and frequently occurring terms

in sport, travel, political and education documents, were

used as the set of indexing terms for this document col-

lection. The 50 input documents are then transformed to

a document matrix of weighted frequent max substring

occurrence. Hence, these 35 indexing terms and 50 input

documents to form a 50 × 35 matrix, where each docu-

ment vector was represented by each column of the ma-

trix, and the rows of the matrix correspond to the index-

ing terms. We use this 50 × 35 matrix to train a map us-

ing SOM, and the number of neurons was set as 9 in

SOM program as shown in Figure 6.

In the experimental study, 9 was set as the number of

neurons because the several numbers of neurons were

investigated, and 9 neurons provided the best result among

them. From experimental studies, the group of political,

sport, and education documents provided good results as

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

122

similar documents of each type were mapped onto the

neuron. It can be observed that some errors occurred

within the group of travel documents. The travel docu-

ments were mapped onto several neurons due to overlap-

ping terms that appeared across different type of docu-

ments.

The Figure 6 showed the map containing 9 neurons

and 50 Thai documents. Each neuron contains the group

of similar documents.

From Figure 6, the experimental result showed that

SOM can cluster 50 documents into 5 neurons on the

map, and the similar documents were grouped into the

same neuron as shown in Table 2.

As observed from the results, this technique can be

used to cluster non-segmented documents into several

groups according to their similarity. The accuracy of this

technique is up to 83.25%. However, from this experi-

ment, we have found that the groups of education and

sport documents are mapped onto the same neuron

(Neuron5) because they both contain mostly overlapping

frequent max substrings such as ผลการแขงขัน (competi-

tion result), การจัดอันดับ (position ranking), ไดรับรางวัล

(getting award), etc. Furthermore, the contents of docu-

ments and generated indexing terms are also the main

factors that impact the accurate value. The content of one

document may have overlapping terms from two differ-

ent types of documents. For instance, Education5, Travel

1, Travel 3, Travel 11 and Travel 14 documents are

mapped onto the neuron 3 because they are presenting

information on ecotourism, containing overlapping terms

from education and travel documents.

The methodology to measure the clustering ap-

proaches is to compare the group of documents occur-

rences. In this paper, we compare our technique with the

SOM based documents clustering using single words [31]

as features and hierarchical based document clustering

using single words [4,12]. In the SOM based documents

clustering using single words, single words were first

extracted by using Thai word segmentation from the

same document data set and 35 single words, most fre-

quent occurring single keywords in education, sport,

travel and political documents, and 50 input documents

are used to form a 50 × 35 matrix, which is the same ma-

trix size as our earlier experiment to train a map using

SOM. From experimental studies, it can be observed that

only the groups of politic, and travel documen ts provided

fairly good results as shown in Table 3.

The group of similar travel documents was mapped

onto the neuron as well as the group of political docu-

ments. Meanwhile, the education and sport documents

were distributed onto several neurons because most sin-

gle words extracted from education and sport documents

are general terms. These general terms can be shared in

many types of documents.

To depict the experiment results, Figure 7 showed the

Figure 6. SOM contains 9 neurons and the group of similar

documents from 50 Thai document collection

Figure 7. SOM contains 9 neurons and the group of similar

documents from 50 Thai document collection by using sin-

gle words as features

map containing 9 neurons and 50 Thai documents, and

the documents were grouped into the neurons as shown

in Table 3.

It can be observed that the accuracy of the SOM based

documents clustering using single words is 72.21%.

Moreover, from the experiment results, we have found

that the SOM based documents clustering using frequent

max substrings provides better result than the SOM

based documents clustering using single words because

the frequent max substrings can be used to describe the

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

123

Table 2. Clustering results of using SOM and frequent max substring technique

Neuron ID Row Column Document ID

Neuron 5 1 2

Political 1, Education 1, Education 2, Education 3, Education

4, Sport 1, Sport 2, Sport 3, Sport 4, Sport 5, Sport 6, Sport 7,

Sport 8, Sport 9, Sport 10, Sport 11, Sport 12, Sport 13, Sport

14, Sport 15, Travel 10

Neuron 2 2 1 Political 2, Political 3, Political 4, Political 5, Political 6,

Political 7, Political 8, Political 9, Political 10, Political 11,

Political 12, Political 13, Political 14, Political 15,

Neuron 4 2 2 Travel 2, Travel 4, Travel 5, Travel 6, Travel 7, Travel 8,

Travel 9, Travel 13, Travel 15

Neuron 1 3 1 Travel 12

Neuron 3 3 2 Education 5, Travel 1, Travel 3, Travel 11, Travel 14

Table 3. Clustering results of SOM based documents clustering using single words

Neuron ID Row Column Document ID

Neuron 2 1 1 Education 1, Sport 1, Sport 2, Sport 5, Sport 6, Political 1,

Political 2, Political 3, Political 4, Political 6, Political 8,

Political 9, Political 10, Political 11, Political 12

Neuron 5 1 2 Education 3, Education 5, Sport 7, Sport 8, Sport 9, Sport 11,

Sport 13, Travel 7

Neuron 6 1 3 Sport 12, Sport 14,

Neuron 1 2 1 Sport 15, Travel 15, Politica l7

Neuron 4 2 2

Education 2, Sport 3, Sport 4, Travel 1, Travel 2, Travel 3, Travel 4,

Travel 5, Travel 6, Travel 8, Travel 9, Travel 10, Travel 11,

Travel 12, Travel 13, Travel 14, Political 13, Political 14,

Political 15

Neuron 3 3 2 Education 4, Sport 10, Politcal 5

content of the documents more specifically than using

single words, as the frequent max substrings can be re-

ferred to the frequently and long terms rather than indi-

vidual words. Moreover, many researches have also shown

an improvement in clustering performance when using

phrases rather than single words as features [21,22].

Additionally, we also compare our technique with the

hierarchical based document clustering using single

words. The hierarchical based document clustering is

used because it has been widely used and has been ap-

plied successfully in many applications in the area of

document clustering [12]. This method has also been

used to perform Thai documents clustering. To use this

technique with Thai language, single words were first

extracted by using Thai word segmentation techniques

from the same document dataset used in our experiment

that is discussed earlier. After word segmentation is per-

formed, single words are then transformed into feature

vector of the words that appear in documents. The

term-frequencies of the words are also contained in each

feature vector. The feature vector of the words is then

used to compute the similarity of the documents by using

the hierarchical clustering approach. In the hierarchical

clustering program, the feature vector of the words with

their frequencies was used as an input data, and the

number of clusters was set to 9 after trial-and-error. From

the experimental results, the group of political, travel and

education documents provided good results. However,

travel and education documents were grouped into the

same cluster (cluster 1). It can be observed that the travel

and education documents are sharing overlapping words

that appeared across different type of documents. In ad-

dition, some of travel, education, sport and travel docu-

ment were distributed across several small clusters. In

Ta bl e 4, the experimental result showed that the hierar-

chical clustering program can cluster 50 documents into

9 clusters.

As can be observed from the results, the accuracy of

the proposed method is up to 83.25%, meanwhile using

the SOM based documents clustering using single words

and the hierarchical clustering approach provide the ac-

curacies 72.21% and 79.75% respectively. The hierarchi-

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

124

Table 4. Clustering results of using the hierarchical clus-

tering approach

Cluster ID Document ID

Cluster 1

Education 1, Education 2, Education 3, E du-

cation 4, Sport 9, Sport 13, Travel 1, Travel

2, Travel 3, Travel 4, Travel 5, Travel 7,

Travel 8, Travel 9, Travel 10, Travel 11,

Travel 12, Travel 13, Travel 14, Travel 15,

Political 15

Cluster 2 Education 5

Cluster 3 Sport 1

Cluster 4 Sport 2, Sport 3, Sport 4, Sport 5, Sport 6,

Sport 7, Sport 8, Sport 15

Cluster 5 Sport 10, Sport 11

Cluster 6 Sport 12

Cluster 7 Sport 14

Cluster 8 Travel 6

Cluster 9

Political 1, Political 2, Political 3, Political 4,

Political 5, Political 6, Political 7, Political 8,

Political 9, Political 10, Political 11, Political

12, Political 13, Political 14

cal clustering approach also created many small clusters

that containing only a few documents. As a result, an

improvement was demonstrated using frequent max sub-

strings rather than single words as features. This pro-

posed technique also does not require any pre-processing

technique to extract the frequent max substrings. Mean-

while, the SOM based documents clustering using single

words and the hierarchical based document clustering

using single words require word segmentation to extract

the single words.

7. Conclusions

This paper describes a non-segmented document cluster-

ing method using self- organizing map (SOM) and fre-

quent max substring technique to improve the efficiency

of information retrieval. We first use the frequent max

substring technique to discover pattern s of interest, called

frequent max substrings, rather than individual words

from Thai text documents, and these frequent max sub-

strings are then used as indexing terms with their number

of occurrences to form a document vector. SOM is then

applied to generate the document cluster map by using

the document vector. The experiment studies and com-

parison results on clustering the 50 Thai text documents

is presented in this paper. We compare the proposed

technique with the SOM based documents clustering

using single words and hierarchical based document

clustering technique with the use of single words for

grouping the document occurrences. From the experi-

mental results, our technique can be used to cluster 50

Thai documents into different clusters with more accu-

racy than using the SOM based documents clustering

using single words and the hierarchical clustering ap-

proaches. As a result, the generated document cluster

map from our technique can be used to find the relevant

documents according to a user’s query more efficiency.

REFERENCES

[1] B. Liu, “Web Data Mining: Exploring Hyperlinks,

Con-tents, and Usage Data,” 1st Edition, Springer-Verlag,

New York Berlin Heidelberg, 2007.

[2] D. R. K. R. D. Cutting, J. O. Pedersen, J. W. Tukey,

“Scatter/Gather: A Cluster-based Approach to Browsing

Large Document Collections,” Proceedings of ACM Spe-

cial Interest Group on Information Retrieval ‘92, Copen-

hague, 1992, pp. 318-329.

[3] I. Matveeva, “Document Representation and Multilevel

Measures of Document Similarity,” Irina Matveeva,

Document representation and multilevel measures of

document similarity, Proceedings of the 2006 Conference

of the North American Chapter of the Association for

Computational Linguistics on Human Language Tech-

nology: companion volume: doctoral consortium, New

York, 2006, pp. 235-238.

[4] G. K. M. Steinbach and V. Kumar, “A Comparison of

Docu-ment Clustering Techniques,” KDD Workshop on

Text Mining, Boston, 2000.

[5] A.-H. Tan, “Text Mining: The state of the art and the

challenges,” Proceedings of the PAKDD Workshop on

Knowledge Discovery from Advanced Databases, Beijing,

1999, pp. 65-70.

[6] Q. L. H. Jiao and H.-B. Jia, “Chinese Key word Extraction

Based on N-Gram and Word Co-occurrence, 2007 Inter-

national Conference on Computational Intelligence and

Security Workshops (CISW 2007), Harbin, 2007, pp.

124-127.

[7] J. Mathieu, “Adaptation of a Keyphrase Extractor for

Japanese Text,” Proceedings of the 27th Annual Confer-

ence of the Canadian Association for Information Science

(CAIS-99), Sherbrooke, Quebec, 1999, pp. 182-189.

[8] T. Chumwatana, K. W. Wong and H. Xie “An Automatic

Indexing Technique for Thai Texts Using Frequent Max

Substring,” 2009 Eight International Symposium on

Natural Language Processing, Bangkok, 2009, pp. 67-72.

[9] R. Feldman and J. Sanger, “The Text Mining Handbook:

Advanced Approaches in Analyzing Unstructured Data,”

Cambridge University Press, Cambridge, 2006.

[10] A. K. Jain and R. C. Dubes, “Algorithms for Clustering

Data,” Prentice Hall, New Jersey, 1988.

[11] L. Kaufman and P. J. Rousseeuw, “Finding Groups in

Data: An Introduction to Cluster Analysis,” John Wiley

and Sons, New York, 1990.

[12] G. K. Y. Zhao, “Comparison of Agglomerative and Parti-

tional Document Clustering Algorithms,” The SIAM

workshop on Clustering High-dimensional Data and Its

Applications, Washington, DC, April 2002.

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

125

[13] Z. Huang, “Extensions to the K-means Algorithm for

Clustering Large Datasets with Categorical Values,” Data

Mining and Knowledge Discovery, Vol. 2, No. 3, 1998,

pp. 283-304.

[14] D. Dembele and P. Kastner, “Fuzzy C-Means Method for

Clustering Microarray Data,” Bioinformatics, Vol. 19, No.

8, 2003, pp. 973-980.

[15] L. J. Heyer, S. Kruglyak and S. Yooseph, “Exploring Ex-

pression Data: Identification and Analysis of Coexpressed

Genes,” Genome Research, Vol. 9, No. 11, 1999, pp.

1106-1115.

[16] C. C. Fung, K. W. Wong, H. Eren, R. Charlebois and H.

Crocker, “Modular Artificial Neural Network for Predic-

tion of Petrophysical Properties from Well Log Data,”

IEEE Transactions on Instrumentation & Measurement,

Vol. 46, No. 6, December 1997, pp. 1259-1263.

[17] D. Myers, K. W. Wong and C. C. Fung, “Self-organising

Maps Use for Intelligent Data Analysis,” Australian

Journal of Intelligent Information Processing Systems,

Vol. 6 No. 2, 2000, pp. 89-96.

[18] D. R. Hill, “A Vector Clustering Technique,” In: Samuel-

son, Ed., Mechanized Information Storage, Retrieval and

Dissemination North-Holland, Amsterdam, 1968.

[19] J. J. Rocchio, “Document Retrieval Systems — Optimi-

zation and Evaluation,” Doctoral Thesis, Harvard Univer-

sity, Boston, 1966.

[20] A. W. G. Salton and C. S. Yang, “A Vector Space Model

for Automatic Indexing,” Communication of ACM, Vol.

18, No. 11, 1975, pp. 613-620.

[21] O. Zamir, “Clustering Web Documents: A Phrase-Based

Method for Group Search Engine Results,” Computer

Science & Engineering, Ph.D. Thesis, University of

Washington, 1999.

[22] M. F. H. J. Bakus and M. Kamel, “A SOM-Based Docu-

ment Clustering Using Phrases,” Proceeding of the 9th

International Conference on Neural Information Proc-

essing (ICONIP’02), Vol. 5, 2002, pp. 2212-2216.

[23] D. Mladenic and M. Grobelnik, “Word Sequence as Fea-

tures in Text-learning,” Proceedings of the 17th Electro

technical and Computer Science Conference (ERK-98)

Ljubljana, Slovenia, 1998.

[24] K.-H. Tsai, C.-M. Tseng, C.-C. Hsu and H.-C. Chang, “On

the Chinese Document Clustering Based on Dynamical

Term Clustering,” Asia Information Retrieval Symposium

2005, Jeju Island, October 2005, pp. 534-539.

[25] C. Kruengkrai and C. Jaruskulchai, “Thai Text Document

Clustering Using Parallel Spherical K-means Algorithm

on PI-RUN Linux Cluster (in Thai),” The 5th National

Computer Science and Engineering Conference, Chiang

Mai University, Chiang Mai, 2001, pp. 7-9

[26] T. Kohonen, “Self-Organization and Associative Mem-

ory,” Springer Series in Information Sciences, Springer-

Verlag, Berlin, 1984, p. 125.

[27] T. Chumwatana, K. W. Wong and H. Xie “Frequent max

substring mining for indexing,” International Journal of

Computer Science and System Analysis (IJ CSSA ), India,

2008, pp. 179-184.

[28] T. Chumwatana, K. W. Wong and H. Xie “An Efficient

Text Mining Technique,” 9th Postgraduate Electrical En-

gineering & Computing Symposium (PEECS2008), Perth,

Australia, 2008, pp. 147-152.

[29] T. Chumwatana, K. W. Wong and H. Xie, “Using Fre-

quent Max Substring Technique for Thai Keyword Ex-

traction used in Thai Text Mining,” 2nd International

Conference on Soft Computing, Intelligent System and

Information Technology (ICSIIT 2010), Bali, 1-2 July

2010, pp. 309-314.

[30] T. Chumwatana, K. W. Wong and H. Xie, “Thai Text

Mining to Support Web Search for E-Commerce,” The

7th International Conference on e-Business 2008 (INCEB

2008), Bangkok, 2008, pp. 66-70.

[31] J. E. Hodges and Y. Wang, “Document Clustering using

Compound Words,” Proceedings of the 2005 Interna-

tional Conference on Artificial Intelligence (ICAI 2005),

Las Vegas, Nevada, 2005, pp. 307-313.