Paper Menu >>
Journal Menu >>
Communications and Network, 2013, 5, 81-85 doi:10.4236/cn.2013.51B019 Published Online February 2013 (http://www.scirp.org/journal/cn) Design and Implementation of a New Chinese Word Segmentation Dictionary for the Personalized Mobile Search* Zhongmin Wang, Jingna Qi, Yan He School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an, China Received 2012 ABSTRACT Chinese word segmentation is the basis of natural language processing. The dictionary mechanism significantly influ- ences the efficiency of word segmentation and the understanding of the user’s intention which is implied in the user’s query. As the traditional dictionary mechanisms can't meet the present situ ation of personalized mobile search, this pa- per presents a new dictionary mechanism which contains the word classification information. This paper, furthermore, puts forward an approach for improving the traditional word bank structure, and proposes an improved FMM segmenta- tion algorithm. The results show that the new dictionary mechanism has made a significant increase on the query effi- ciency and met the user’s individual requirements better. Keywords: Chinese Word Segmentation; Dictionary Mechanism; Natural Language Processing; Personalized Search; Word Classification Information 1. Introduction With the rapid growth of network information resources, the personalized information services have become a hot issue of the contemporary search engine. How to obtain the user's search intention becomes the primary task of the personalized search. At present, many technologies can be used to get users’ interests, such as natural lan- guage understanding, concept-based retrieval, semantic network technology, etc. In order to overcome some of the shortcomings of these common technologies and meet the need of the users’ mobile search, a new diction- ary word segmentation mechanism is proposed which can get the word classification information of the user’s query directly after word segmentation. And these cate- gory messages can be used to infer the user’s search in- tentions. The traditional dictionary mechanisms for Chinese word segmentation are mainly based on binary-seek- by-word, TRIE indexing tree, binary-seek-by-characters [1]. Among the three methods, the mechanism based on binary-seek-by-characters is the improvement of the pre- vious two dictionary mechanisms, which integrates the simple dictionary body of binary-seek-by-word diction- ary mechanism and high-efficiency search processing of TRIE indexing tree dictionary mechanism. Since it still uses the binary-seek-by-word structure, the search range for second word is not reduced, and its efficiency is lim- ited [2]. At present, public dictionary mechanisms have their own strengths, but they are basically improved on the basis of traditional dictionary mechanism. Such as double-character-hash-indexing, multi-character-hash-in- dexing [1,2,4,5]. They are indeed effective ways to im- prove the matching efficiency of the system. But directly using the hash method on words makes hash table or index table difficult to build, and the complex structure of this kind of word bank makes it difficult to maintain. This paper puts forward an efficient dictionary mecha- nism for word segmentation based on the classification information of word. It can get word classification in- formation of the users’ query to provide the basis for the establishment of the user’s interest model. On this basis, this paper also provides an approach for improving the FMM algorithm to fit for the new dictionary mechanism which could greatly enhance the efficiency of the word segmentation. 2. A New Dictionary Mechanism with the Word Classification Information The new dictionary mechanism that contains the word classification information is the improvement of bi- nary-seek-by-characters. It employs a new hash mecha- *This work is supported b y Shaanxi Science & Technology Depa rtment p rojec t (2011K06-26) & national natural science foundation project (61100166) Copyright © 2013 SciRes. CN Z. M. WANG ET AL. 82 nism of the second-character’s area code subsection, which can greatly reduce searching scope of the sec- ond-character and accelerate the process of the system. The new dictionary mechanism could also provide the word classification information after word segmentation which is the basis for the personalized analysis of the user’s intention. 2.1. New Dictionary Architecture The category information of the new dictionary is de- signed as Figure 1, which consists of entertainment, sports, urban, nature, engineering and other 12 kinds of word category. Each category contains the corresponding branches with 3 levels, an d each level is encoded sequen- tially. This encoding method can quickly identify the classification information of the target word. As for the design of the new dictionary with word classification information, firstly process the original linear dictionary according to the category information to get independent linear dictionary based on category in- formation. Then change the classified linear dictionary into the new dictionary according to the hash mechanism of the second-character’s area code subsection. And clas- sification information is added to each entry in this proc- ess. 2.2. Chinese Word Frequency Statistics As [3,7] shown, the number of two-character words is quite larger than the single words and multi-character words. As the number of words that have the same first character is so large, it needs more time to find the sec- ond character using the dichotomizing search mechanism. So reducing the searching scope of the second character can improve the query efficiency greatly. To avoid the problem of too long hash table of the second character caused by double-character-hash-in- dexing, this paper proposes a new method—the hash mechanism of the second-character’s area code subsec- tion. First of all, get the statistical result of distribution range of the second word in the original word bank. Figure 1. Category information of the dictionary. Second, divide second characters into non-uniform sub- section. In this process, entries with the same first char- acter are mapped to different subsection according to their second-character’s area code (Section number is: 1-20). The new mechanism only increases a small amount of storage space, but greatly reduces the range of the second-character inquiry. 2.3. The Construct Procedure of Dictionary During the construction of dictionary, each character in the first-character Hash table is added with an additional data structure which is used to store the information of area code subsection of the second-character. Then the second-character is hashed into 20 sub-tables according to its area code. Through this mechanism, we can get the pointer of second-character indexing table according to the section number of the second-character’s area code. This will greatly reduce the range of inquiry of the sec- ond-character. The construct procedure of the dictionary is shown in Figure 2. Table 1. Chinese word freq uency statistics. The number of characters1 2 3 4 5 The number of words 9919 65891 26352 21699 5124 Figure 2. Construct flow of dictionary. Copyright © 2013 SciRes. CN Z. M. WANG ET AL. Copyright © 2013 SciRes. CN 83 Figure 3. The logical structure of the dictionary. The dictionary mechanism proposed in this paper also integrates the information of the word frequency. During the segmentation, word frequency can be taken into ac- count. And the words with high frequency have the pri- ority to be selected, which may further improve the effi- ciency of the system. 2.4. New Dictionary’s Logical Structure The logical structure of the new dictionary is shown in Figure 3. As shown in this figure, the new dictionary structure consists of four parts: First-character Hash table, Area code subsection of the second-character Hash table, Second-character indexing table, Dictionary text table. 1) First-character Hash table: F_hash First-character Hash table which adds some attribute values is formed on the basis of present mature mecha- nism of First-character Hash table. The structure is shown in Table 2. The first characters (word First) are the Chinese char- acters from GB2312 encoding table. According to the machine code of the Chinese characters, we can quickly locate the first character in the F_hash table. Hash func- tion is as follows: )1xA0(94)0xB0( 21 c coffset (1) c1, c2, represent the high byte and the low byte of the character’s machine code. Offset represents character’s array subscripts in F_hash. In this table, is Word represents whether the first cha- racter is a word. If the first character is a word (the value of is Word is TRUE), then frequency and coding in the table can be used to record the word frequency and their classification respectively. S_hash is a pointer which points to the table of S_hash. This table belongs to the second-character of words which begin with wordFirst. After obtain the second-character’s subsection index, it can navigate to the next unit with the current pointer of s_hash. 2) Area code subsection of the second-character Hash table: S_hash In the process of this stage, all the words in the origi- nal word bank are analyzed statistically to get the fre- quency of each second-character. And then according the frequency, the second-characters are arranged into dif- ferent areas. The second-character with higher frequency is divided into a small interval, and low frequency is di- vided into large interval. By this means, words begin with the same character are divided into non-uniform subsection. It is the non-uniform interval that makes the number of words in each subsection basically achieve uniform distribution. The new mechanism can greatly reduce the inquiry range of the second-character, and can avoid the disadvantages of query efficiency limitation caused by non-uniform distribution of the second-char- acter. The structure of S_hash is shown in Table 3 re- gionIndex (1-20) represents subsection index of sec- ond-character’s area code. s_index is a pointer that points to the table of S_index. 3) Seco nd-charact er indexing ta ble : S_ind ex Table 2. F_hash structural. wordFirstisWord frequency coding s_hash Table 3. S_hash structura. regionIndex s_index Z. M. WANG ET AL. 84 This table stores the second-character of words which meet the table of F_hash and S_hash constraints. Each record in the table consists of 5 parts. Its structure is shown in Table 4 word Sec represents second-character. isWord, frequency, coding are same as the property in First-character Hash table. last_table is a pointer that points to the table of Last_table. 4) Dictionary text table: Last_table Last_table is composed of dynamic arrays which stores all words excluding the first two characters. The structure of this table is shown in Table 5 lastStr repre- sents the remaining string. Frequency, coding is same as the property in First-character Hash table. 3. Improved Fmm Algorithm Based on the new dictionary construction, this paper proposed an improved Former Max Matching algorithm (FMM). In word segmentation, instead of using fixed maximum segmentation length [6-8], the improved FMM algorithm using dynamic length determined by the target words which begin with the current first two characters on query sentence segmentation. This method effectively overcomes the ineffectiveness of length limitation caused by FMM. Compared to FMM, the improved FMM has the fol- lowing two improvements: 1) According to the new dictionary structure, the sec- ond-character W2 is located like the following way. First, obtain region Index depending on area code of W2. Sec- ond, locate second-character’s place in S_hash that cor- responding to W1 according to the value of region Index. Then, we can find the second-character in S_index ac- cordance with region Index. This strategy of sec- ond-character subsection can greatly reduce the search- ing scope of the second-character. 2) After getting the position of the second-character, we can obtain Last_table corresponding to W1W2. Then the improved FMM intercepts the current Chinese string according to the length of words in Last_table and com- pares the intercepted word with th e corresponding words in Last_table. During the process, the improved FMM will select the optimal segmentation according to word length and frequency. Because the segmentation length is dynamically determined by the length of words to be matched, the efficiency is improved greatly, and the shortcoming of low efficiency, caused by fixed segmen- tation length in FMM, is avoided. Table 4. S_index structural. wordSec isWord frequency coding last_table Table 5. S_index structural. 4. Experimental Results By comparing the hash mechanism of the second-char- acter’s area code subsection with binary-seek-by-char- acters dictionary mechanism, the new dictionary mecha- nism is tested from two aspects: time and space com- plexity. In the experiments, the same improved FMM algorithm is used to process the test text (31K) based on the two dictionaries. The experimental results are given in Table 6: As shown in Table 6, the memory space of new dic- tionary mechanism is 13% more than that of the bi- nary-seek-by-characters dictionary mechanism, but the time efficiency is improved by about 20%. Besides, word classification information can be obtained after word segmentation. For example, the input test sentence is: “计算机网络是计算机技术与通信技术结合的产物”, and output result is show n in Figure 4. For the first word of “计算机网络” of the result, it belongs to the classification“031402,031404,032001” which indicates the word belongs to the engineering ap- plication of computer science and technology. Table 6. Time and space consuming comparison of the two dictionary mechanism. Dictionary mechanism Space consuming/Byte Time consuming/ms Binary-seek-by-characters 3553,068 156 New mechanism 4014,108 125 Figure 4. Segmentation test. Copyright © 2013 SciRes. CN Z. M. WANG ET AL. Copyright © 2013 SciRes. CN 85 5. Conclusions The hash mechanism of the second-character’s area code subsection, proposed in this paper, can greatly reduce the searching scope of the second character. It can improve the speed of word segmentation, and realize the im- provement of time efficiency by consuming small space. With the increas e in the n umber of dictionary entries, the superiority of the new dictionary mechanism will be more significant. Meanwhile, the new dictionary inte- grates word classification information effectively which can be used for user’s interest modeling and provides the basis to meet the personalized needs of mobile users. REFERENCES [1] M.-S. Sun and Z.-P. Zuo, “An Experimental Study on Dictionary Mechanism for Chinese Word Segmentation,” Journal of Chinese Information Processing, Vol. 1, 2000, pp. 1-6. [2] W. Yang, L.-Y. Ren and R. Tang, “A Dictionary Mecha- nism for Chinese Word Segmentation Based on the Finite Automata,” 2010 International Conference on Asian Language Processing (IALP), pp. 39-42. [3] Z. X. Li, Z. P. Xu, W. Q. Tang and R. X. Tang, “Ambi- guity Processing in Word Segmenting,” Computer Engi- neering and Applications, Vol. 38, No. 11, 2002, pp. 106-109. [4] Q. Y. Zhang and S. Chai, “Chinese Word Segmentation Dictionary using Two-level Index,” Computer Engineer- ing and Applications, Vol. 19, 2009. [5] Q. H. Li, Y. J. Chen and J. G. Sun, “A New Dictionary Mechanism for Chinese Word Segmentation,” Journal of Chinese Information Processing, Vol. 17, 2003, pp. 13-18. [6] Y. Niu and L. L. Li, “An Improved Chinese Segmentation Algorithm Based on New Dictionary Construction,” In- ternational Conference on Computational Science and Engineering, Vol. 2, 2009, pp. 993-996. [7] A. Choi, C. H. Cheng and Y. L. Ko, “Word Extraction from Chinese Documents by Occurrence Counts,” 1988 International Conference on Computer Processing of Chinese and Oriental Languages, Toronto, Canada, pp. 488-491. [8] H. Y. Cui, “Research 0n an Improved Chinese Segmenta- tion Algorithm based on Word Frequency Statistic,” In- formation Technology, Vol. 04, 2008. |