Using AdaBoost Meta-Learning Algorithm for Medical News Multi-Document Summarization

doi:10.4236/iim.2013.56020

Paper Menu >>

Journal Menu >>

Intelligent Information Management, 2013, 5, 182-190

Published Online November 2013 (http://www.scirp.org/journal/iim)

http://dx.doi.org/10.4236/iim.2013.56020

Open Access IIM

Using AdaBoost Meta-Learning Algorithm for Medical

News Multi-Document Summarization

Mahdi Gholami Mehr

Department of Computer and Information Technology, Malek-Ashtar University of Technology, Tehran, Iran

Email: Mahdi.Gholami.Mehr@gmail.com

Received September 26, 2013; revised October 25, 2013; accepted November 6, 2013

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

ABSTRACT

Automatic text summarization involves reducing a text document or a larger corpus of multiple documents to a short set

of sentences or paragraphs that convey the main meaning of the text. In this paper, we discuss about multi-document

summarization that differs from the single one in which the issues of compression, speed, redundancy and passage se-

lection are critical in the formation of useful summaries. Since the number and variety of online medical news make

them difficult for experts in the medical field to read all of the medical news, an automatic multi-document summariza-

tion can be useful for easy study of information on the web. Hence we propose a new approach based on machine

learning meta-learner algorithm called AdaBoost that is used for summarization. We treat a document as a set of sen-

tences, and the learning algorithm must learn to classify as positive or negative examples of sentences based on the

score of the sentences. For this learning task, we apply AdaBoost meta-learning algorithm where a C4.5 decision tree

has been chosen as the base learner. In our experiment, we use 450 pieces of news that are downloaded from different

medical websites. Then we compare our results with some existing approaches.

Keywords: Multi-Document Summarization; Machine Learning; Decision Trees; AdaBoost; C4.5; Medical Document

Summarization

1. Introduction

Nowadays there are lots of online medical news on the

web and study of these huge amount of information is

not possible for experts in medical field [1]. Medical

information on the web such as news, articles, clinical

trial reports is an important source to help clinicians in

patient treatment. Usually, clinicians go through author-

written abstracts or summaries available in the medical

domain and then decide whether articles are relevant to

them for in-depth study. Since all types of medical arti-

cles do not come with authors written abstracts or sum-

maries. An automatic muti-document summarization can

be useful for help clinicians or medical students to find

their relevant information on the web.

Text summarization is the process to produce a con-

densed representation of the content of its input for hu-

man consumption. In existing categorization for summa-

rization with respect to the number of input documents,

the summarization is divided into two types, namely sin-

gle and multi documents. Automatic multi document

summarization refers to the production process of a

compressed summary of documents while the content,

readability, and cohesion are maintained [2]. Considering

further complexities of multi document summarization

than single document summarization, we face with some

challenges among them. The most significant ones are as

following [3]:

 The rate of information redundancy is higher in a

group of subject-related texts;

 The need for devoting great attention to the extraction

of unknown perspectives in the documents and cov-

ering all of them;

 Difficulty of producing a highly readable summary

from documents that address the same subject from

different perspectives;

 Difficulty of ordering extractive phrases for produc-

tion of the final summary.

In another categorization based on the type of sum-

mary, if the summarized text was obtained through ex-

tracting some phrases from the original text, the summa-

rization would be “extractive” or “selective”, and if the

text summary is generated after understanding the avail-

able content in the original text, the summarization

M. GHOLAMI MEHR

Open Access IIM

183

would be “abstractive” [4]. Both types of summariza-

tions face different challenges. In extracting method,

main challenges are identifying important sentences in

the text, distinguishing and extracting key words, and

analyzing the main text with the purpose of preparing a

readable and coherent summary. In abstract method, first,

the main text must be understood; then based on the

meaning of the text, a meaningful summary is produced.

In this method, the main challenges are natural language

processing and the analysis of the meaning of the text

with the purpose of comprehension.

In this paper, we present a machine learning based

model for a sentence extraction based, Multi document,

and informative text summarization in the medical do-

main (This work is an improvement of the study pro-

posed in [5]). In our work, we approach automatic text

summarization as a supervised learning task. We treat a

document as a set of sentences, which must be classified

as positive or negative examples of sentences based on

the summary worthiness of sentences where a sentence is

represented by a feature set, which includes a number of

features used in the summarization literature and some

other features specific to the medical domain.

Thus, this summarization task can be formulated as the

classical machine-learning problem of learning from

examples. There are several unusual aspects to this clas-

sification problem. For example, the size of positive ex-

amples in the training set is relatively small compared to

the size of the entire training set because a summary size

is roughly less than one-fourth of the size of the source

document. It has been generally thought that a summary

should be no shorter than 15% and no longer than 35% of

the source text [6].

C4.5 is typically applied to more balanced class dis-

tributions. In our experiment, we found that AdaBoost

improves and performs significantly and uniformly well,

when combined with C4.5. In general, AdaBoost works

by repeatedly running a given weak learning algorithm

on various distributions over the training data, and then

combining the classifiers produced by the weak learner

into a single composite classifier. There seem to be two

separate reasons for the improvement in performance that

is achieved by boosting. The first and better understood

effect of boosting is that it generates a hypothesis whose

error on the training set is small by combining many hy-

potheses whose error may be large (but still better than

random guessing). It seems that boosting may be helpful

to learning problems having either of the following two

properties. The first property, which holds for many

real-world problems, is that the observed examples tend

to have varying degrees of hardness. For such problems,

the boosting algorithm tends to generate distributions that

concentrate on the harder examples, thus challenging the

weak learning algorithm to perform well on these harder

parts of the sample space. The second property is that the

learning algorithm is sensitive to changes in the training

examples so that significantly different hypotheses are

generated for different training sets.

For text summarization applications, we need to rank

sentences based on its summary scores. So, for the sen-

tence ranking, we have to follow a new methodology to

combine decisions (discuss in Section 4).

We adopted and designed ten features to characterize

sentences (taken as basic linguistic units) in the docu-

ments.

The paper is organized as follows. Section 2 provides

related work. In Section 3, we discuss how to extract and

use features. In Section 4, the summarization method has

been discussed. We present the evaluation and the ex-

perimental results in Section 5.

2. Related Work

In this section, we discuss about some previous works

that are used in text summarization. Radev et al. [7] sug-

gested a multi-document extracting summary maker that

extracts the sentences of the summary from several texts.

The extraction is based on center clusters. To increase

coherence, Harry Hilda [8] and Mitra [9] extracted para-

graphs instead of sentences from documents. Knight and

Marcu [10] presented two algorithms for sentence com-

pression which are based on “Noisy Channel” and “De-

cision Tree”. The input of the algorithm of “Decision

Tree” is a long sentence, and the output is supposed to be

a shorter sentence but with more meaningful content.

Barzilay et al. [11] presented an algorithm for the fusion

of information. This algorithm tries to combine similar

sentences of the documents to create a new sentence

based on “Language Generation Technology”. Although

this method simulates human behavior in summarization

process to some extent, it is heavily dependent on exter-

nal sources such as “Dependency Parser”, “Production

Rules”, and etc. Therefore, the portability of this method

is limited. In sentence extracting strategy, clustering is

introduced with the purpose of eliminating redundant

information which is due to using multi documents [12].

But this technique cannot solve the problem of redun-

dancy entirely because some sentences can be in more

than one cluster. To solve this problem some researchers

predefine the number of clusters, or determine a thresh-

old level for similarities. Even if the number of sentences

is predefined, it is probable that the sentences with high-

est score in clusters are not the best sentences. To solve

this problem GA algorithm [13] is used; in this algorithm

the favorable summary is chosen from a group of sum-

maries that are created by combined sentences obtained

from the main documents. Four properties which are

length of sentence, cover criterion, information criterion,

and similarity are used as fitness function of GA algo-

M. GHOLAMI MEHR

Open Access IIM

184

rithm for summarization.

MEAD [6] a popular summarization system ranks

sentences based on its similarity to the centroid, position

in the text, similarity to the first sentence of the article

and length. It uses linear combination of features whose

values are normalized between 0 and 1 for sentence

ranking. Redundancy is removed by a variation of MMR

(Maximal Marginal Relevance) algorithm [14].

Some machine learning approaches to extractive sum-

marization have already been investigated. In [15] sen-

tence extraction is viewed as a Bayesian classification

task. To our knowledge, there are few attempts to use

machine learning algorithm for medical document sum-

marization task. Most of the researchers extend to the

medical domain the summarization techniques already

used in other domains. One of the projects in medical

domain is [16]. MiTAP (MITRE Text and Audio Proc-

essing) monitors infectious disease outbreaks or other

biological threats by monitoring multiple information

sources. The work presented in [17] exploits extractive

techniques, which ranks the extracted sentences accord-

ing to the so-called cluster signature of the document.

The abstracts and full texts from the Journal of the

American Medical Association were used for their ex-

periments. TRESTLE (Text Retrieval Extraction and

Summarization Technologies for Large Enterprises) is a

system, which produces single sentence summaries of

pharmaceutical newsletters [18]. TRESTLE generates

summaries by filling the templates by the Information

Extraction process. The system Helpful Med [19] helps

professional and advanced users to access medical in-

formation on the Internet and in medical related data-

bases. An ontology based summarization approach has

been proposed in [20]. A query based medical informa-

tion summarization system that exploits ontology knowl-

edge has been proposed in [21].

The work presented in [21] uses ontology to expand

query words and assigns scores to sentences based on

number of original keywords (query words) and expand-

ed keywords. Most recently a variation of lexical chain-

ing method [22] called bio-chain [23] is used in bio-

medical text summarization.

Compared to the above-mentioned approaches, we de-

velop a machine learning based model for medical do-

cument summarization that also exploits domain knowl-

edge.

3. Summarization Method

In extractive text summarization approach, the main task

is to identify sentences in a source text, which are rele-

vant to the users while simultaneously reducing informa-

tion redundancy. Sentences are scored based on a set of

features. The top-n highest scoring sentences in a text are

then extracted where n is an upper bound, which is de-

termined by the compression rate. Finally the selected

sentences are presented to the user in their order of ap-

pearance in the original source text [24].

The proposed summarization method consists of three

primary parts that shows in Figure 1.

The preprocessing task includes formatting the docu-

ment, removal of punctuation marks (except dots at the

sentence boundaries).

3.1. Using AdaBoost for Sentence Extraction

We apply a meta-learner called AdaBoost for sentence

extraction, where a C4.5 decision tree [25] has been cho-

sen as the base learner.

The boosting algorithm takes as input a training set of

m examples











S, ,

xyx y where xi is an

instance drawn from some space X and represented in

some manner (typically, a vector of attribute values), and

yi Є Y is the class label associated with xi. In this paper,

we always assume that the set of possible labels Y is of

finite cardinality k.

In addition, the boosting algorithm has access to an-

other unspecified learning algorithm, called the weak

learning algorithm, which is denoted generically as

Weak Learn. The boosting algorithm calls Weak Learn

repeatedly in a series of rounds. On round t, the booster

provides Weak Learn with a distribution Dt over the

training set S. In response, Weak Learn computes a

classifier or hypothesis ht: X → Y which should correctly

classify a fraction of the training set that has large prob-

ability with respect to Dt. That is, the weak learner’s goal

is to find a hypothesis ht which minimizes the (training)

error





Pr t

tiDti i

hx y





 



. Note that this error is

measured with respect to the distribution Dt that was

provided to the weak learner. This process continues for

T rounds, and, at last, the booster combines the weak

hypotheses h1… hT into a single final hypothesis hfin. Still

unspecified are: 1) the manner in which Dt is computed

on each round, and 2) how hfin is computed. Different

boosting schemes answer these two questions in different

ways. AdaBoost uses the simple rule shown in Figure 2.

The initial distribution D1 is uniform over S so D1 (i) =

1/m for all i. To compute distribution Dt+1 from Dt and

the last weak hypothesis ht, we multiply the weight of

example i by some number





0, 1



 if ht classifies xi

correctly, and otherwise the weight is left unchanged.

The weights are then renormalized by dividing by the

normalization constant weights are then renormalized by

dividing by the normalization constant Zt. Effectively,

“easy” examples that are correctly classified by many of

the previous weak hypotheses get lower weight, and

“hard” examples which tend often to be misclassified get

higher weight. Thus, AdaBoost focuses the most weight

on the examples which seem to be hardest for Weak

Learn.

M. GHOLAMI MEHR

Open Access IIM

185

Document Pre

rocessin

Sentence Extraction by

using AdaBoost and c4.5

Summary generation

Figure 1. Summarization method.

 





sequence of m examples

, ,...,,

with lables y1,...,

weak learning algorithm WeakLearn

integer T specifying number of iterations

D1 for all i

Algorithm AdaBoost

Input:

Initialize

xyx ym





:( )

t = 1,2,...,T:

1) Call WekLearn, providing it with the distribution D .

2) Get back a hypothesis h:.

3) Calculate the error of h ()

If 12, set T = t-1 and abo

Do for

ti i

tt t

ih xy

then













  



rt loop.

4) Set 1.

5) Update distribution D:

if h

D1 otherwise

Where Z is a normalization constant

chosen so that D+1will be a distribution.

Outp

tt t

tttii

Di xy





 











:()

the final hypothesis:

hargmaxlog

fin yY th xyt







Figure 2. The algorithm AdaBoost.

The number t



is computed as shown in the figure as

a function of t

. The final hypothesis hfin is a weighted

vote (i.e., a weighted linear threshold) of the weak hy-

potheses. That is, for a given instance x, hfin outputs the

label y that maximizes the sum of the weights of the

weak hypotheses predicting that label. The weight of

hypothesis ht is defined to be



log 1t



so that the

greater weight is given to hypotheses with lower error.

The important theoretical property about AdaBoost is

stated in the following theorem. This theorem shows that

if the weak hypotheses consistently have error only

slightly better than 1/2, then the training error of the final

hypothesis hfin drops to zero exponentially fast. For bi-

nary classification problems, this means that the weak

hypotheses need be only slightly better than random.

Theorem 1: suppose the weak learning algorithm

Weak Learn, when called by AdaBoost, generates hy-

potheses with errors 1,,



, where t

 is as defined

in Figure 1. Assume each 12



 and let





 .

Then the following upper bound holds on the error of

the final hypothesis hfin:





:14 exp 2

fin ii

ih xy







 









Theorem 1 implies that the training error of the final

hypothesis generated by AdaBoost is small. This does

not necessarily imply that the test error is small. How-

ever, if the weak hypotheses are “simple” and T “not too

large”, then the difference between the training and test

errors can also be theoretically bounded.

Traditionally, the component learners are of the same

general form. In our case, all component learners are

decision trees. In general, decision tree induction algo-

rithms have low bias but high variance.

Boosting multiple trees improves performance by re-

ducing variance and this procedure appears to have rela-

tively little impact on bias.

To train a learning algorithm, we need to establish a

set of features and a training corpus of document/extract

pairs. In our work, the main goal is to train a booster of

decision trees with the set of features and the multiple

versions of training set D and combine the decisions of

those trained decision trees to classify a sentence as

summary worthy (positive) or not (negative example).

After completion of training, the trained learning algo-

rithm is tested on unseen instances, which is not part of

training corpus.

3.1.1. Building Corpus

The training and test corpus are built by downloading

medical news form different websites. Then the summa-

ries are manually created for that news. A total of 450

news documents that downloaded from different web-

sites.

For each news, two manual summaries (model sum-

maries) are created by human abstractors. Since summa-

ries are very subjective and user sensitive, for each news

we decide to have three different model summaries cre-

ated by three different human abstractors. Human ab-

stractors are faculty members and postgraduate students

of our institute. Though human abstractors have been

instructed to create abstracts for each article by copying

material from the original text, they freely used their own

sentence construction while summarizing news.

But, to apply machine learning algorithm, we need to

have extracts instead of abstracts because for extracting

features for summary sentences we need to match the

manual summary sentences to the sentences in the origin-

al document. So, for each manually created abstract, we

M. GHOLAMI MEHR

Open Access IIM

186

create an extract by selecting the sentences from the

original document that best match the sentences in the

abstract. The average size of an extract is 25% of the

source document. We choose relatively long medical

news documents in our study because the summarization

becomes more useful for long news documents. The rea-

son behind choosing medical news articles for our ex-

periment is that the medical news does not come with

any abstract or summaries. Though some features of the

medical news articles and general newspaper articles

may overlap, we found that the medical news have some

features such as medical terms, medical cue phrases

which may be absent in the general newspaper articles. It

is a common practice to evaluate a machine learning al-

gorithm using k-fold cross validation, where the entire

dataset is divided into k subsets, and each time, one of

the k subsets is used as the test set and the other k − 1

subsets are put together to form a training set. Thus,

every data point gets to be in a test set exactly once, and

gets to be in a training set k − 1 times. For the evaluation

of the proposed learning based system, the entire dataset

are divided into 2 folds where each fold consists of one

training set of 300 documents and a test set of 150

documents.

3.1.2. Featur es

To characterize the sentences in the medical documents

we have designed and adopted a number of features such

as: centroid overlap, similarity to subject, similarity of

sentences to each other, positive and negative cue phra-

ses, acronyms, sentence position, sentence length, nume-

ral data.

For normalization of each feature value, we divide the

value by the maximum of the scores obtained by the sen-

tences due to the feature. The following we discuss the

features in detail.

 Centroid: this criterion is used to calculate the simi-

larity of the sentences to the central sentence of the

text. The following procedure is used to find the cen-

tral sentence of the text.

 With the use of sentence similarity matrix, total

similarity of a sentence to the rest of the sentences

is calculated with the following method:



SimSenSimMat ,

ijij





SimMat [i,j] shows the similarity of ith sentence to jth

sentence. After calculation, maximum value for SimSeni

is found and the position of the sentence-sentence index

is placed in index-centroid variable.

 With the use of cosine similarity formula, similar-

ity of each sentence to the center is calculated as

follows:





index_centroid

Centroid_SenSim ,

SS

Then the above relation is normalized between 0 and

 Similarity to title: A good summary consists of sen-

tences that are similar to the title [26,27]. This means

that if sentence Si is the most similar sentence to the

title, in comparison to other sentences in the docu-

ment, sentence Si can be considered more important

than other sentences or the most important sentence.

Calculating this criterion is as follows:



title

Title_Sen Sim,

SS

Then the above relation is normalized between 0 and 1.

The more the value (closer to 1), the greater the similar-

ity is.

 Sentence position: A summery which contains the

first and the last sentence of a document is a good

summary [28]. For this reason, the position of the

sentences in a text is very important. The following

formula is used to calculate the position of the sen-

tence in a document:







Pos_Score10.5 Sin1

DocRank ,

 



N is the total number of documents. DocRankj is the

parameter that is used for giving higher score to the first

and the last sentences which have more importance.

Therefore, documents need to be ranked. For ranking, the

number of keywords in each document is divided to the

total number of words in the document. The score of the

document with more keywords is closer to one while

score closer to zero represent fewer keywords.



Num of Keyword

DockRank maxNum of Keyword,

jiiD



 Positive and negative cue phrases in medical do-

main: A sentence gets score of n if it contains n posi-

tive cue phrases and gets score of −n if it contains n

negative cue phrases from our knowledge base.

 Acronyms: A sentence gets a score based on the

number of acronyms it contains. In medical articles,

authors frequently use acronyms for important com-

plex medical terms, perhaps it help them memorize

the things better. So, we consider acronym as an im-

portant feature for medical document summarization

task. If some letters (at least two letters) of a term are

capital, we treat the term as an acronym (gene names,

medical instruments etc.).

 Sentence length long or short sentences are not suit-

able for the summary. Therefore, to calculate this cri-

terion, first the lengths of all sentences in the text are

calculated. Afterward the average length called Lenavg

is calculated. Now, the following formula is used to

score each sentence of the text for the sentence length

M. GHOLAMI MEHR

Open Access IIM

187

criteria:

avg

avg avg

1LenLen

Len

Sen_Len else

Len LenLen













 Numerical data: Numerical data like date and time is

important in news. Hence a sentence gets score (+1) if

it contains numerical data.

3.1.3. Sentence Extrac ti on

Training a learning algorithm for summary sentence ex-

traction requires document sentences to be represented as

feature vectors. For this purpose, we write a computer

program for automatically extracting values for the fea-

tures characterizing the sentences in the documents. For

each sentence in the given document we extract the fea-

ture values from the source document using the measures

discussed in Sub-Section 3.1.2. If the sentence under

consideration is found in both the extracts, extract1 and

extract 2, which are created from the human abstracts

(discussed in 3.1.1), we label the sentence as “Summary

Worthy” sentence. If it is found in one of these extracts,

we label the sentence as “Moderately Summary Worthy”

and if it is not found in any one of these extracts we label

the sentence as “Summary Unworthy”. Thus each sen-

tence vector looks like {<a1 a2 a3 ··· an>, <label>} which

becomes an instance (example) for a base learner C4.5

decision tree, where a1, a2, ···, an, indicate feature values

for a sentence. All the documents in our corpus are con-

verted to a set of instances of the above form. We divide

the entire data set into 3 folds where each fold consists of

one training set corresponding to a set of training docu-

ments and a test set corresponding to a set of test docu-

ments. After preparation of a training set, the multiple

decision trees are trained with the different versions of

the training set and the decisions of those trained deci-

sion trees are combined to classify a sentence as one of

three categories: “Summary Worthy”, “Moderately Sum-

mary Worthy” and “Summary Unworthy”. For each fold,

a model is built from a training set using the boosting

technique and then the learned model is applied to the

test set. For our experiments, we have used Weka (www.

cs.waikato.ac.nz/ml/weka) machine learning tools. Ini-

tially, for each fold, we submit the training data set and

the test data set to Weka. Then we select the option

“boosting” under meta-classifier folder in Weka. We

chose J48 (Weka’s implementation of Quinlan’s C4.5

decision tree) as a base learner and set the number-

of-base learners to the default value which is 10.

Though all the attribute values of the instances in the

training and test sets are continuous, we did not apply

any separate discretization algorithm because C4.5 is

capable of handling continuous attribute values. We con-

figure WEKA in such a way that for each test instance,

we can get the predicted class and the probability esti-

mate for the class. The trained learning algorithm will

assign one of three labels: “Summary Worthy” (SW),

“Moderately Summary Worthy” (MSW), “Summary Un-

worthy” (SU) to a test instance corresponding to a sen-

tence in a test document. It is possible to save the output

in a separate file. We save the output produced by

WEKA in a file and then collect the classification output

for the sentences belonging to each test document. Then

we design a sentence-ranking algorithm based on the

classification output and the probability estimates for the

classes. The algorithm for sentence ranking is given be-

low.

Sentence Ranking Algorithm

Input:

An output file produced by WEKA, which contains the

sentences of a test document with their classifications

and the probability estimates of the classes to which the

sentences belong.

Output: A file containing the ranked sentences

Begin

Read the input file.

 Select those sentences, which have been classified as

“Summary Worthy” (SW) and reorder the selected

sentences in decreasing order of the probability esti-

mates of their classes. Save the selected sentences in

the output file and delete them from the input file.

 Select those sentences, which have been classified as

“Moderately Summary Worthy” (MSW) and reorder

the selected sentences in decreasing order of the

probability estimates of their classes. Save the select-

ed sentences in the output file and delete them from

the input file.

 For the rest of the sentences, which are classified as

“Summary Unworthy”, we order the sentences in in-

creasing order of the probability estimates of the class

labels. In effect, the sentence for which the probabil-

ity estimate is minimum (that is, the sentence is mi-

nimum “Summary Unworthy”) comes at the top. Ap-

pend the ordered sentences to the output file.

 Close the output file.

End of the Algorithm

The sentence-ranking algorithm has three major steps.

At the first step, the sentences, which are classified as

“Summary Worthy”, we undoubtedly select those sen-

tences in the summary.

If the number of sentences selected at Step 1 is less

than the desired number of sentences, we consider those

sentences which are not selected in the summary at the

first step. At the second step, the sentences, which are

classified as “Moderately Summary Worthy”, are con-

sidered. If the number of sentences selected at Step 1 and

Step 2 are less than desired number of sentences, we con-

M. GHOLAMI MEHR

Open Access IIM

188

sider the sentences, which have been classified as “Sum-

mary Unworthy” and order those sentences in increasing

order of the probability estimates of the class labels, that

is, the sentences are ordered from minimum summary

unworthiness (maximum summary worthiness) to maxi-

mum summary unworthiness (minimum summary wor-

thiness). They are selected in this order one by one in the

summary until the desired summary length is reached.

4. Summary Generation

After ranking the sentences, n top ranked sentences are

selected to generate the final summary. Value of n de-

pends on the compression rate. But, the summary pro-

duced in this way may contain some redundant informa-

tion, that is, some sentences in the summary may entail

partially or fully the concept embodied in other sentences.

This restricts the summary to be more informative when

the summary length is a restriction. Moreover, a user

who is used to just looking at first few sentences repre-

senting the same concept will prefer to see something

different information, though marginally less relevant.

To keep the sentences in the summary sufficiently dis-

similar from each other, the diversity based re-ranking

method called Maximal Marginal Relevance (MMR) is a

well-known measure. This approach uses a ranking pa-

rameter that allows the user to slide between relevance to

the query and diversity from the sentences seen so far.

The MMR algorithm is most suitable to apply in query-

focused summarization where the summary will be fo-

cused toward the user’s query. But in our generic sum-

marization environment where only one generic sum-

mary will be produced for a text document, we have used

a variant of the MMR algorithm to remove redundancy in

the summary. This algorithm works as follows:

 Rank the sentences using the ranking algorithm dis-

cussed in Sub-Section 3.1.3.

 Select the top ranked sentence first.

 Select the next sentence from the ordered list and in-

clude into the summary if this sentence is sufficiently

dissimilar to all of the previously selected sentences.

 Continue selecting sentences one by one until the pre-

defined summary length is reached.

The similarity between two sentences is measured us-

ing cosine similarity metric. If the cosine similarity be-

tween two sentences is greater (less) than a threshold, we

say that the sentences are similar (dissimilar). The cosine

similarity between two sentences is measured by the fol-

lowing formula as stated in [29].





Idf-modified-cosine ,

ii ii

xx xyyy

xx yy

tf tfidf

tf idftfidf

 



















where ,S



is the number of occurrences of the word



in the sentence S, idf



inverse document fre-

quency of the word



and i

is the i-th word in the

sentence x and i

y is the i-th word in the sentence y. idf

value of a word is computed on a corpus of documents

using the formula: log(N/df) where N is the number of

documents in the corpus and df is the number of docu-

ments in the corpus that contain the word. Finally, the

sentences selected in the above-mentioned manner are

reordered using text order (sorted in the order in which

they appear in the input texts) to increase the readability

of the summary.

5. Experimental Results

To evaluate our summarization system, 450 medical news

articles have been downloaded from a number of online

medical news sources. From the downloaded articles, the

images and other links are manually removed and only

the news content is considered.

Traditionally, for each system generated summary,

more than one model summaries are used for evaluation

because the human abstractors may disagree with each

other in producing the summary of the document. But,

manual summary creation is a tedious task. In our ex-

periments, we have used two reference summaries for

evaluating a system generated summary.

For system evaluation, we have used precision and re-

call.

Precision and recall: Precision and recall are the

well-known evaluation measures in the information re-

trieval settings. Since our system extracts sentences from

the source document to form a summary, we define pre-

cision and recall as follows:

Precision N



where, N = number of extracted sentences matched with

a reference summary and K = number of sentences ex-

tracted by the system.

Recall N



where, N = number of extracted sentences matched with

a reference summary and M = number of sentences in the

reference summary. Since we have used two reference

summaries for evaluating a system generated summary,

we have compared the system summary to each of the

reference summaries and computed the precision and re-

call. Thus for each system generated summary, we get

one pair of precision and recall values for the first refer-

ence summary and another pair of precision and recall

values for the second reference summary. We define the

average 12

precision

R and the average 12

recall

R as fol-

lows:

M. GHOLAMI MEHR

Open Access IIM

189

Average precision2

Average recall2









where 1

P= average precision of a system, where the

precision is computed by comparing the system gener-

ated summary and the first reference summary for a

document, 2

P= average precision of a system, where

the precision is computed by comparing the system gen-

erated summary and the second reference summary for a

document, 1

P= average recall of a system, where the

recall is computed by comparing the system generated

summary and the first reference summary for a document,

P= average recall of a system, where the recall is

computed by comparing the system generated summary

and the second reference summary for a document.

For evaluating the system using precision and recall,

we set the compression ratio to 15% and 20%. Compres-

sion ratio r% means r percent of the total sentences in the

source documents are extracted as a summary. To meas-

ure the overall performance of the proposed learning

based summarization system, our experimental dataset

consisting of 450 documents are divided into 3 folds for

3-fold cross validation where each fold contains two in-

dependent sets: a training set of 300 documents and a test

set of 150 documents. For each fold, a separate model is

built from 300 documents and the learned model is ap-

plied to the test set of 150 documents. Thus, for each task,

if we consider all three folds, we can get a summary for

each of 450 documents in our corpus. For other systems

such as MEAD and the lead baseline system (which sim-

ply takes the first n words or n sentences of the document)

and Bagging Method to which the proposed system is

compared, we run the systems on the entire 450 docu-

ments in our corpus to collect 450 summaries for each

task.

Table 1 shows the results in terms of precision and re-

call for the compression ratio set to 15% and Table 2

shows the results for the compression ratio set to 20%.

By analyzing Table 1, we find that for 15% summary

generation task, the learning based system performs bet-

ter than the Bagging Method, lead baseline and MEAD,

but MEAD performs worse than the lead baseline and

Bagging method.

Table 2 shows that for 20% summary generation task,

MEAD performs better than the lead baseline whereas

the learning based system performs better than MEAD

and Bagging method.

6. Conclusion

This paper discusses a machine learning based model for

text summarization in medical domain. Most of previous

works on text summarization in the medical domain extends

Table 1. Precision and recall for 15% summary generation

task on the test data set.

Average PrecisionR1R2 Average RecallR1R2

Proposed Method0.69 0.26

Mead 0.54 0.24

Baseline-Lead 0.58 0.25

Bagging Method0.63 0.29

Table 2. Precision and recall for 20% summary generation

task on the test data set

Average PrecisionR1R2 Average RecallR1R2

Proposed Method0.61 0.34

Mead 0.54 0.31

Baseline-Lead 0.47 0.27

Bagging Method0.59 0.35

the various features used in other domains to the medical

domain. In our work, we have combined several medical

domain specific features with some other features used in

the state-of-art summarization approaches. A machine-

learning tool has been used for effective feature combi-

nation. The proposed approach performs better than the

systems it is compared to.

REFERENCES

[1] S. D. Afantenos, V. Karkaletsis and P. Stamatopoulos,

“Summarization from Medical Documents: A Survey,”

Journal of Artificial Intelligence in Medicine, Vol. 33, No.

2, 2005, pp. 157-177.

http://dx.doi.org/10.1016/j.artmed.2004.07.017

[2] I. Mani, “Automatic Summarization,” John Benjamins Pub-

lishing Company, Amsterdam/Philadelphia, 2001.

[3] X. Wan, J. Yang and J. Xiao, “Manifold-Ranking Based

Topic-Focused Multi Document Summarization,” Pro-

ceedings of the 20th International Joint Conference on

Artifical Intelligence, Hyderabad, 6-12 January 2007, pp.

2903-2908.

[4] H. Jing and K. McKeown, “Cut and Paste Based Text

Summarization,” Proceedings of the 1st Meeting of the

North American Chapter of the Association for Computa-

tional Linguistics (NAACL 2000), Seatle, Washington DC,

June 2000.

[5] K. Knight and D. Marcu, “Summarization Beyond Sen-

tence Extraction: A Probablistic Approach to Sentence

Compression,” Artificial Intelligence, Vol. 139, No. 1, 2002,

pp. 91-107.

[6] E. H. Hovy, “Automated Text Summarization,” In: R.

Mitkov, Ed., The Oxford Handbook of Computational

Linguistics, Oxford University Press, Oxford, 2005, pp.

583-598.

http://dx.doi.org/10.1093/oxfordhb/9780199276349.013.0

032

[7] D. R. Radev, H. Jing, M. Sty and D. Tam, “Centroid-

M. GHOLAMI MEHR

Open Access IIM

190

Based Summarization of Multiple Documents,” Journal

of Information Processing and Management, Vol. 40, No.

6, 2004, pp. 919-938.

http://dx.doi.org/10.1016/j.ipm.2003.10.006

[8] H. Hilda, “Cross-Document Summarization by Concept

Classification,” Proceedings of the 25th Annual Interna-

tional ACM SIGIR Conference on Research and Devel-

opment in Information Retrieval, Tampere, 11-15 August

2002, pp. 121-128.

[9] M. Mitra, S. Amit and B. Chris, “Automatic Text Sum-

marization by Paragraph Extraction,” ACL/EACL-97 Work-

shop on Intelligent Scalable Text Summarization, Madrid,

20 July 1997, pp. 31-36.

[10] K. Knight and D. Marcu, “Summarization beyond Sen-

tence Extraction: A Probablistic Approach to Sentence

Compression,” Artifcial Intelligence, Vol. 139, No. 1,

2002, pp. 91-107.

[11] B. Regina, K. R. McKeown and M. Elhadad, “Informa-

tion Fusion in the Context of Multi Document Summari-

zation,” Proceedings of the 37th Annual Meeting of the

Association for Computational Linguistics, Morristown,

20-26 June 1999, pp. 550-557.

[12] J. Manuel and MANA-LOPEZ, “Multi-Document Summa-

rization: An Added Value to Clustering in Interactive Re-

trieval,” ACM Transactions on Information Systems, Vol.

22, No. 2, 2004, pp. 215-241.

[13] Y.-X. He, D.-X. Liu, D.-H. Ji and C. Teng, “MSBGA: A

Multi-Document Summarization System Based on Ge-

netic Algorithm,” Proceedings of the 5th International

Conference on Machine Learning and Cybernetics, Dali-

ban, 13-16 August 2006, pp. 2659-2664.

[14] G. J. Carbonell and J. Goldstein, “The Use of MMR,

Diversity-Based Re-Ranking for Reordering Documents

and Producing Summaries,” Proceedings of the 21st An-

nual International ACM SIGIR Conference on Research

and Development in Information Retrieval, Melbourne,

24-28 August 1998, pp. 335-336.

[15] J. Kupiec, J. O. Pedersen and F. Chen, “A Trainable Do-

cument Summarizer,” Proceedings of the 18th Annual

International ACM SIGIR Conference on Research and

Development in Information Retrieval, Seattle, Washing-

ton DC, 1995, pp. 68-73.

[16] L. D. Day, L. Hirschman, R. Kozierok, S. Mardis, T.

McEntee, et al., “Real Users, Real Data, Real Problems:

The MiTAP System for Monitoring Bio Events,” Pro-

ceedings of the Conference on Unified Science & Tech-

nology for Reducing Biological Threats & Countering

Terrorism (BTR 2002), University of New Mexico, Mex-

ico, 2002, pp. 167-177.

[17] D. B. Johnson, Q. Zou, J. D. Dionisio, V. Z. Liu and W.

W. Chu, “Modeling Medical Content for Automated Sum-

marization,” Annals of the New York Academy of Sciences,

Vol. 980, 2002, pp. 247-258.

[18] R. Gaizauskas, P. Herring, M. Oakes, M. Beaulieu, P. Wil-

lett, H. Fowkes, et al., “Intelligent Access to Text: Inte-

grating Information Extraction Technology into Text

Browsers,” Proceedings of the Human Language Tech-

nology Conference (HLT 2001), San Diego, 2001, pp. 189-

193.

[19] H. Chen, A. Lally, B. Zhu and M. Chau, “HelpfulMed:

Intelligent Searching for Medical Information over the

Internet,” Journal of American Society for Information

Science and Technology (JASIST), Vol. 54, No. 7, 2003,

pp. 683-694. http://dx.doi.org/10.1002/asi.10260

[20] M. Fiszman, T. Rindflesch and H. Kilicoglu, “Abstraction

Summarization for Managing the Biomedical Research

Literature,” Proceedings of the HLT-NAACL Workshop

on Computational Lexical Semantics, Stroudsburg, 2004,

pp. 76-83.

[21] P. Chen and R. Verma, “A Query-Based Medical Infor-

mation Summarization System Using Ontology Knowl-

edge,” Proceedings of the 19th IEEE Symposium on Com-

puter Based Medical Systems, Salt Lake City, 2006, pp.

37-42.

[22] R. Barzilay and M. Elhadad, “Using Lexical Chains for

Text Summarization,” In: I. Mani and M. T. Maybury,

Eds., Advances in Automatic Text Summarization, The

MIT Press, Cambridge, 1999, pp. 111-121.

[23] L. H. Reeve, H. Han and A. D. Brooks, “The Use of Do-

main-Specific Concepts in Biomedical Text Summariza-

tion,” Journal of Information Processing and Manage-

ment, Vol. 43, No. 6, 2007, pp. 1765-1776.

http://dx.doi.org/10.1016/j.ipm.2007.01.026

[24] R. Barzilay, N. Elhadad and K. McKeown, “Sentence

Ordering in Multi-Document Summarization,” Proceed-

ings of the Human Language Technology Conference,

San Diego, 2001, pp. 1-7.

[25] J. R. Quinlan, “C4.5: Programs for Machine Learning,”

Morgan Kaufmann, California, 1993.

[26] G. Salton and C. Buckley, “Term-Weighting Approaches

in Automatic Text Retrieval,” Information Processing

and Management, Vol. 24, No. 5, 1998, pp. 513-523.

http://dx.doi.org/10.1016/0306-4573(88)90021-0

[27] J. Silla, C. Nascimento, G. L. Pappa, A. A. Freitas and C.

A. A. Kaestner, “Automatic Text Summarization with Ge-

netic Algorithm-Based Attribute Selection,” Lecture Notes in

Artificial Intelligence, 2004.

[28] C. Y. Lin and E. Hovy, “Identifying Topics by Position,”

Proceedings of the 5th Applied Natural Language Proc-

essing Conference, 1997, pp. 283-290.

[29] G. Erkan and D. R. Radev, “LexRank: Graph-Based Lexi-

cal Centrality as Salience in Text Summarization,” Jour-

nal of Artificial Intelligence Research (JAIR), Vol. 22,

2004, pp. 457-479.