Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario

doi:10.4236/jsea.2011.412077

Paper Menu >>

Journal Menu >>

Journal of Software Engineering and Applications, 2011, 4, 653-665

doi:10.4236/jsea.2011.412077 Published Online December 2011 (http://www.SciRP.org/journal/jsea)

653

Strategy for Data Stream Processing Based on

Measurement Metadata: An Outpatient

Monitoring Scenario

Mario Diván1,2 , Luis Olsina2, Silvia Gordillo3

1Law and Economy School, Universidad Nacional de La Pampa, Santa Rosa, Argentina; 2Engineering School, National University of

La Pampa, General Pico, Argentina; 3LIFIA, Informatics School, National University of La Plata, La Plata, Argentina.

Email: mjdivan@eco.unlpam.edu.ar, olsinal@ing.unlpam.edu.ar, gordillo@lifia.info.unlp.edu.ar

Received Oct ober 25th, 2011; revised December 1st, 2011; accepted December 12th, 2011.

ABSTRACT

In this work we discuss SDSPbMM, an integrated Strategy for Data Stream Processing based on Measurement Meta-

data, applied to an outpatient monitoring scenario. The measures associated to the attributes of the patient (entity) un-

der monitoring, come from heterogeneous data sources as data streams, together with metadata associated with the

formal definition of a measurement and evaluation project. Such metadata supports the patient analysis and monitoring

in a more consistent way, facilitating for instance: 1) The early detection of problems typical of data such as missing

values, outliers, among others; and 2) The risk anticipation by means of on-line classification models adapted to the

patient. We also performed a simulation using a prototype developed for outpatient monitoring, in order to analyze

empirically processing times and variable scalability, which shed light on the feasibility of applying the prototype to

real situations. In addition, we analyze statistically the results of the simulation, in order to detect the components

which incorporate more variability to the system.

Keywords: Measurement, Data Stream Processing, C-INCAMI, Statistical Analysis

1. Introduction

Nowadays, there are applications which make customized

pro ce ssing of da ta set s, ge ne rated i n a co ntinuo us wa y, in

response to queries and/or to adjust their behavior de-

pend ing on t he ar ri val of ne w d at a [ 1 ]. Exa mple s of the se

applications are namely for vital signs monitoring of pa-

tients; for behavioral tracking of financial markets; for

air traffic monitori ng, among others. In such applications,

the arrival of a new data represents the arrival of a value

(e.g. for a cardiac frequency, a foreign currency rate, etc.)

associated to a syntactical behavior. Frequently, they only

analyze the number (value) itself without formal and se-

mantic support, disregarding not only the measurement

metadata, but also the context in which the phenomenon

happens. Therefore, in order to understand the meaning of

arriving data and then act accordingly, such applications

must necessarily incorporate a logic layer, i.e. procedures

and metadata, which transform and/or interpret the data

streams. Since a lack of clear separation of concerns be-

tween the syntactic and semantic aspects of those current

applications, very often an expert (e.g., for the outpatient

monitoring system, the expert can be a doctor responsible

for the monitoring) should intervene in order to interpret

the situation. So we argue that given the state-of-the-art

of IT in metadata and semantic pro cessing the intervene-

tion of experts should be minimized as long as the appli-

cations can perform the job.

Taking into account the semantic and formal basis for

measurement and evaluation (M&E), the C-INCAMI (Con-

text-Information Need, Concept model, Attribute , Metric

and Indicator) conceptual framework establishes an on-

tology that includes the concepts and relationships nec-

essary to specify data and metadata in any M&E project

[2, 3]. O n the othe r ha nd, we have envisi oned t he ne ed of

integrating heterogeneous data streams with metadata

based on the C-INCAMI framework, in order to allow a

more consistent and richer analysis of data sets (meas-

ures). As result, the Strategy for Data Stream Processing

based on Measurement Metadata (SDSPbMM) [4,5] was

developed.

The main SDSPbMM aim is filling the gap among the

integration of heterogeneous data sources; the incorpora-

tion and processing of metadata for attributes, contextual

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 654

properties, metrics (for measurement) and indicators (for

evaluation); and on-line classifiers that support in a more

rob ust way dec i sion-making processes.

Thus, by using the SDSPbMM approach for the above-

mentioned applications, in this paper we present particu-

larl y the fo undatio ns for d eveloping t he outpa tient moni -

toring scenario. We also performed a simulation using a

prototype developed for this scenario, in order to analyze

empirically processing times and variable scalability, which

shed light on the feasibility of applying the prototype to

real situations. For this end, statistical techniques such as

descriptive a nalysis, correla tion analysis a nd principal com-

ponent analysis are used.

The contributions of this work is manifold: 1) related

to metrics: the detection of deviations of metrics’ values

with respect to their formal definitions, identification of

missing values and outliers; 2) related to the set of mea-

sures: the instant detection of correlations; the identifica-

tion of variability factors of the system; and the detection

of trends on data streams, considering also the contextual

situation; and 3) related to the empirical study: the vali-

dation of the implemented prototype on a specific do-

main scenario, i.e. the outpatient monitoring, which al-

low us determining the feasibility to be applied in real si-

tuations.

The quoted contributions represent a step further with

regard to our previous works [4,5], because now the en-

hanced prototype incorporate the online classifiers which

support proactive decision making, and their interaction

with statistical techniques.

Following this introduction, Section 2 points out the main

motivation and provides an overview of the C-INCAMI

framework and the SDSPbMM approach. Section 3 illus-

trates the outpatient monitoring scenario, and Section 4

discusses the planning of the study and the analysis of

results related to the simulation. Section 5 analyzes the

contributions of this research regarding related work and,

finally, Section 6 draws the main conclusions and out-

lines future work.

2. Fundamentals of SDSPbMM

2.1. Motivation

The SDSPbMM [4] approach proposes a flexible frame-

work in which co-operative processes and components

are specialized for data stream management with the ul-

timate aim of having proactive decision making. In this

sense, SDSPbMM allows the automation of data collec-

tion and adaptation processes supporting also the incur-

poration of heterogeneous data sources; the correction

and analysis processes supporting the early detection of

problems typical of data such as missing values, outliers,

etc.; and online decision-making processes based on for-

mal definitio ns of M &E projects, a nd current/updated cla -

ssifiers (see Figure 2, which depicts a view of the SD

SPbMM approach).

For instance, to deal with detection, correction and ana-

lysis processes, our proposal uses in online form, statis-

tical techniques such as descriptive analysis, correlation

analysis and principal co mponent analysis. In addition to

these techniques other statistical techniques are used to

initially validate the work. In a nutshell, we performed a

simulation using a prototype developed for outpatient mo-

nitoring scenario, in order to analyze empirically proc-

essing times and variab le scalability, which shed lig ht on

the feasibility of applying the pr o totype in real situations.

Before going through the simulation and statistical ana-

lysis issues, it is necessary to illustrate the main aspects

of the C-INCAMI framework, which is a key part to the

SDSPbMM approach.

2.2. C-INCAMI Overview

C-INCAMI is a conceptual framework [2,3], which de-

fines the concepts and their related components for the

M&E area in software organizations. It provides a domain

(ontological) model defining all the terms, properties, and

relationships needed to design and implement M&E pro-

cesses. It is an approach in which the requirements speci-

fication, M&E design, and analysis of results are designed

to satisfy a specific information need in a given context.

In C-INCAMI, concepts and relationships are meant to

be used along all the M&E activities. This way, a com-

mon understanding of data and metadata is shared among

the organization’s projects lending to more consistent

analysis and results across projects.

The SDSPbMM approach reuses totally the C-INCAMI

conceptual base, in order to obtain a repeatable and con-

sistent data stream processing where raw data usually is

coming from data sources as sensors. While C-INCAMI

was initially developed for software applications, the in-

vol ved concept s such as metri c, measure ment method, scale,

scale type, indicator, elementary function, decision crite-

rion, etc., are semantically the same when applied to other

domains, as it is the case when applied to the outpatient

monitoring system for the healthcare domain.

The C-INCAMI framework is structured in six com-

ponents, namely: 1) M&E project definitio n, 2) Nonfunc-

tional requirements specification, 3) Context specifica-

tion, 4) Measurement design and implementation, 5) Eva-

luation design and implementation, and 6) Analysis and

recommendation specificatio n.

The components are supported by ontological terms de-

fined in [3].

The M &E project definition component (not shown in

Figure 1), defines and relates a set of project concepts

needed to deal with M&E activities, roles and artifacts. It

allows defining the ter ms for a requirements project, and

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario

655

Figure 1. C-INCAMI main concepts and relationships for nonfunctional requirements specification, context specification,

measurement design and implementation, and evaluation design and implementation components.

Figure 2. Conceptual schema for the strategy for data stream processing based on measurement metadata.

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 656

its associated measurement and evaluation sub-projects.

The Nonfunctional requirements specification compo-

nent (requirements package in Figure 1) allows specify-

ing the Information Need of any M&E project. The in-

formation need identifies the purpose (e.g. “understand”,

“predict”, “monitor”, etc.) and the user viewpoint (e.g.

“patient”, “final user”, etc); in turn, it focuses on a Cal-

culable Concept—e.g. software quality, quality of vital

signs, etc. and specifies the Entity Category to evaluate

—e.g. a resource, a product, etc. A calculable concept

can be defined as an abstract relationship between attrib-

utes of an entity and a given information need. This can

be represented by a Concept Model where the leaves of

an instantiated model are Attributes. Attributes can be mea-

sured by met rics .

For the Context Specification component (context pack-

age in Figure 1), o ne concep t i s Context, which represents

the relevant state of the situation of the entity to be as-

sessed with regard to the information need. We consider

Context as a special kind of Entity in which related rele-

vant entities are involved. To describe the context, attrib-

utes of the relevant entities are used—which are also At-

tributes called Context Properties (see [2] for details).

The Measurement Design and Implementation com-

ponent (measurement package in Figure 1), includes the

concepts and relationships intended to specify the meas-

urement design and implementation. Regarding measure-

ment design, a Metric provides a Measurement speci-

fication of how to quantify a particular attribute of an

entity, u s ing a part i cular Method, and how to represent its

values, using a particular Scale. The properties of the

measured values in the scale with regard to the allowed

mathematical and statistical operations and analysis are

given by t he scale Type.

Two types of metrics are distinguished. Direct Metric

is that for which values are obtained directly from meas-

uring the corre spondi ng entit y’s a ttrib ute, b y using a Mea-

surement Method. On the other hand, the Indirect Metric

value is calculated from other direct metrics’ values fol-

lowing a function specification and a particular Calcula-

tion Method. For measurement implementation, a Meas-

urement specifies the activity by using a particular metric

description in order to produce a Measure value. Other

associated metadata is the data collector name and the

timestamp in which the measurement was performed.

The Evaluation Design and Implementation compo-

nent (evaluation package in Figure 1) includes the con-

cepts and relationships intended to specify the evaluation

desig n and imple mentatio n. It is worthy to mentio n that the

selected metrics are useful for a measurement process as

long as the selected indicators are useful for an evaluation

process in order to interpret the stated information need.

Indicator is the main term, and there are two types of in-

dicators. First, Elementary Indicator that evaluates at-

tributes combined in a concept model. Each elementary

indicator has an Elementary Model that provides a map-

ping function from the metric’s measures (the domain) to

the indicator’s scale (the range). The new scale is inter-

preted using agreed decision criteria, which help analyze

the level of satisfaction reached by each elementary non-

functional requirement, i.e. by each attribute. Second, Par-

tial/Global Indicator, whic h eval ua tes mid - le vel and hi gher -

level r eq ui re me nts , i.e. sub-characteristics and characteris-

tics in a concept model. Different aggregation models

(Global Model) can be used to perform evaluations. The

global indicator’s value ultimately represents the global

degree of satisfaction in meeting the stated information

need for a given purpose and user viewpoint. As for the

implementation, an Evaluation represents the activity in-

volving a single calculation, following a particular indi-

cator specification—either elementary or global-, produ-

cing an Indicator Value.

The Analysis and Recommendation Specification com-

ponent (not shown in Figure 1), includes concepts and

rela tions hip s dea ling with a na l ysis de sign a nd i mpl eme n-

tation as well as conclusio n and recommendation. Analy-

sis and recommendation use information coming from each

M&E project, which includes requirements, context, mea-

surement and evaluation data and metadata. By process-

ing all this information and by using different kinds of

statistical techniques and visualization tools, stakeholders

can analyze the assessed entities’ strengths and weaknesses

with regard to established information needs, and justify

recommendations and ultimately decision making in a

consis t ent wa y.

Considering the SDSPbMM strategy and its developed

prototype, streams coming from data sources (i.e. us uall y

sensors) are structured incorporating to measures the me-

tadata based on C-INCAMI such as the entity being mea-

sured, the attribute and its corresponding metric, the trace

group, among others. For a given data stream, not only

measures associated to metrics of attributes are tagged

but also measures associated accordingly to contextual

properties as well.

Thanks to each M&E project specification is based on

C-INCAMI, the processing of tagged data streams are

then in alignment with the project objective and informa-

tion need, allowing thus traceability and consistency by

supporting a clear separation of concerns. For instance,

for a given project—more than one can be running at the

same time—it is easy to identify whether a measure is

coming from an attribute or from a contextual property,

and also its associated scale type and unit. Therefore, the

statistical analysis is benefited because the verification of

each measure for consistency against its formal (metric)

definition can be performed.

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario657

2.3. SDSPbMM Overview

Data collecting and adapting processes deal with how to

adapt different measurement devices to collect measures

and then communicate them to correction and analysis

processes. The main components (see Figure 2) are data

sources, measurement adapters and the gathering function.

The underlying idea of the SDSPbMM approach [4] is

depicted in Figure 2.

Briefly, the measurement stream is informed by each

heterogeneous data source to the measurement adapter

(MA). The MA incorporates the metadata (e.g. metric ID,

context property ID, etc.) associated to each data source

into the stream, in order to transmit measurements to the

gathering function (GF). Such measurements are organ-

ized in GF by their metadata and then sent to the Analysis

& Smoothing Function (ASF). ASF perfor ms a set of sta -

tis ti cal an alys is on th e st re am in o rd er t o de tect dev iat i ons or

problems with data, considering its formal definition (as

per C-INCAMI DB). In turn, the incremental classifiers

(i.e. the current and updated classifiers) analyze the arri-

vei ng mea sur eme nts and act a cco rdin gly tr igger in g alar ms

in case a risk situation arises.

SDSPbMM is made up of the following processes: 1)

Data Collecting and Adapting Processes; 2) Correction

and Analysis Processes; and 3) Decision-Making Proc-

esses, which are summarized below.

2.3.1. Data Collecting and Adapting Processes

The data collecting and adapting processes deal with how

to adapt different measurement device s to collect measures

and then communicate them to correction and analysis pro-

cesses. T he mai n compon ents (see Figure 2) are data sour-

ces, measurement adapters and the gathering function.

In short, measures are generated in the heterogeneous

data sources, and sent continuously to the MA. MA can

usually be embedded in mobile devices, but also can be

embedded i n any co mputin g device asso ciated to data sour -

ces. It incorporates the measured values join to the M&E

project metadata respectively, sending in turn both to the

GF.

GF introduces streams into a buffer (see Figure 3) or-

ganized by trace groups—a flexible way to group data

sources established dynamically by the M&E project di-

rector. This organization allows consistent statistical ana-

lysis at trace group level, without representing an addi-

tional processing load. Within each trace group, as shown

in Figure 3, the organization of measurements is tracked

by metric. This fosters a consistent analysis among dif-

ferent attributes (e.g. axillary temperature, cardiac fre-

quency, etc.), which are monitored by a given trace group

for a particular patient. Also, homogeneous comparisons

of attributes can be made for different trace groups (or

patients).

Figure 2. A view of the multilevel buffer.

Moreover, GF incorporates load shedding techniques

[6], which allow managing the queue of services associ-

ated to measurements, thus mitigating overflow risks re-

gardless of how they are grouped.

2.3.2. Correction and Analysis Processes

The correction process is based on statistical techniques

where data and their a ssociated metadata allo w richer (se-

mantic) analysis. The semantic lies in the formal definition

of each M&E project regarding the C-INCAMI concep-

tual framework (introduced in sub Section 2.2).

It is important to remark that the formal definition of

each project is made by experts. In this way, such a defi-

nition becomes a reference pattern in order to determine

if a particular measure (value) is coherent and consistent

with regard to its associated metric specification.

Once the measures are organized in the buffer, the

SDSPbMM prototype applies descriptive, correlation and

principal components analysis. These t e ch ni ques allow de-

tecting inconsistent situations, trends, correlations, and/or

identifying system components that incorporate more va-

riability. If some situation is detected in ASF (see Figure

2), a statistical alarm is triggered to the decision maker

(DM) component in order to evaluate whether it is nec-

essary to send an external alarm (via e-mail, SMS, etc.)

for reporting on this situation to medical staff or not.

2.3.3. Decision-Making Processes

Once the statistical analysis was performed, the unified

streams are communicated to the current classifier (CC)

component, which classifies measurements to decide whe-

ther they correspond to a risk situation or not and report

accordingly such decision to DM. Simultaneously, CC is

regenerated by incorporating the unified streams to the

training measure set, and then producing a new model

named Updated Classifier (UC) in Figure 2.

Later, the UC classifies the unified streams and pro-

duces an updated decision notifying to DM. Ultimately,

DM evaluates if both decisions (from CC and UC) cor-

respond to a risk situation and its probability of occurrence.

Finally, regardless the selected decision made by DM,

the UC becomes the CC replacing the previous one (see

the adjust model arrow in Figure 2), only if an impro-

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario

658

vement in the classification capacity according to the ad-

justment model based on ROC (Receiver Operating Cha-

racteristic) [7] curves exists.

Hence, CINCAMI/MIS is a schema—based on the C-

INCAMI conceptual base as discussed in subsection 2.2-,

which cope with interoperability issues in the provision of

data from hetero geneo us devic es, and their furthe r orga ni-

zation.

2.3.4. Contribution of Metadata to the Measurement

Process In Figure 4 an annotated schema of a C-INCAMI/MIS

stream is presented. For each sent stream, MA incorpo-

rates to the raw data—e.g. the value 80—the structure of

C-INCAMI/MIS schema, indicating the correspondence

of each measure with each attribute and contextual prop-

erty. For instance, in the message of Figure 4, IDEntity =

1 represents the outpatient entity, IDMetric = 2 the metric

value of cardiac frequency, and IDProperty = 5 the met-

ric value of environmental humidity percentage, in the pa-

tient location—representing a contextual property. Thus,

the metadata in the message clearly includes a set of in-

formation which allows keeping a link between a meas-

ure value a nd th e ori gin o f d at a to id en ti fy the data so urce,

the metric and entity ID, among others. This information

allows increasing the consistency in the processing model

for each M&E project definition.

In this subsection the added value of metadata for data

interoperability, consistency and processability is addressed.

Recall that measures are sent from heterogeneous data

sources to the GF component through MA. When MA

receives data streams from each data source, incorporates

metadata accordingly to a common stream-independent-

tly that measures come from several data sources- and

transmit it by means of the C-INCAMI/MIS (Measure-

ment Interchange Schema) [4] schema t o the GF co mpo-

nen t. Th us, pr evio us to se ndin g measure s, ea ch d ata so urce

must configure just once each metric that quantifies each

attribute (e.g. the cardiac frequency attribute) of the en-

tity under assessment (e.g. an outpatient), and the in-

cluding contextual properties (e.g. environmental tempe-

rature) of the situation. This al lows MA be aware of how

such metadata should be embedde d into the stream.

Figure 3. Annotated XML (Extended Markup Language) schema of a C-INCAMI/MIS stream.

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario659

Let’s suppose, for example, a value of 80 associated to

a cardiac frequency of an outpatient arrived; then, the fo-

llowing basic questions can be raised: What does it re-

present? W hich unit of measu re do es it have? W hich ma-

thematical and statistical properties have the value regar-

ng the scale type? Is it good or bad? What is good and

what is bad, i.e. what are the decision criteria? Could any

software process t he mea sure?

Therefore, if the stream metadata were not available,

many questions as those could not be answered in a con-

sistent way. Even more, the processability of measures can

be hampered and the analysis be skewed.

3. Outpatient Monitoring Scenario

In this section, we illustrate the formal definition of a

M&E project for outpatient monitoring, and some as-

pects of its implementation as well. In the M&E project

defi nitio n, t he kno wledge of e xper ts (e .g. d oct ors) is a va-

luable asset.

3.1. Definition

The present scenario aims at illustrating the SDSPbMM

approach. The underl ying hypothesis is doctors of a health -

care centre could avoid adverse reactions and major da-

mage in the health of patients (particularly, outpatients)

if the y ha d a c ontin uo us mo ni t or ing ove r t hem. That is to

say, doctors should have a mechanism by which can be

informed about unexpected variations and/or inconsis-

tencies in health indicators defined by them (as experts).

So, the core idea is that there exists some proactive me-

chanism based on health metrics and indicators that pro-

duces an on-line report (alarm) for each risk situation as-

sociated to the outpatient under monitoring.

Considering C-INCAMI, the information need is “to

monitor the principal vital signs of an outpatient when

he/she is given the medical discharge from the health-

care centre”. The entity under analysis is the outpatient.

According to medical experts, the corporal temperature,

the systolic arterial pressure (maximum), the diastolic

arterial pressure (minimum) and the cardiac frequency

represent the relevant attributes of the outpatient vital

signs to monitor. They also consider as necessary moni-

toring the environmental temperature, environmental pres-

sure, hu midity, and patien t position (i.e. latitude and lon-

gitude) contextual properties. The definition of the infor -

mation need, the entity, its associated attributes and the

context are part of the “Nonfunctional requirements speci-

fication” and “Context specification” components as dis-

cussed in sub-Section 2.2.

The quantification of attributes and contextual proper-

ties is performed by metrics as shown in the Measure-

ment Design and Implementation component in Figure 1.

For monitoring purposes, the metrics that quantify the

cited attributes, were selected from the C-INCAMI DB

repository; likewise the metrics that quantify the cited con-

textual properties. Figure 5 shows the specification of

the metric for environmental temperature contextual pro-

perty.

After the set of metrics and contextual properties for

outpatient monitoring has been selected, the correspond-

ing elementary indicators for interpretation purposes (as

discussed in sub-Section 2.2) have also to be selected by

experts. In this way, they have included the following

elementar y indicators: level of corporal temperature, level

of pressure, level of cardiac frequency and level of dif-

ference between the corporal and the environmental tem-

perature. The concepts related to indicators are part of

the Evaluation Design and Implementation component

(see Figure 1).

Figure 6 shows the specification of the level of corpo-

ral temperature elementary indicator. For example, the

different acceptability levels with their interpretations are

shown, among other metadata. Besides, considering that

ranges of the acceptability levels (shown in Figure 6) are

in a categorical scale (i.e. an ordinal scale type), then the

target variable for the mining function (classification) is

also categorical. So, the classifiers both CC and UC, act

relying on the values of the given indicators and their ac-

ceptability levels.

3.2. Implementation Issues

Once all the above project information was established, it

is necessary for implementation issues to choose a con-

crete architecture to deploy the system. Figure 7 depicts

an abridged deployment view for the outpatient moni-

toring system considering the SDSPbMM approach.

Let’s suppose we install and set up the MA in a mobile

device—the outpatient device-, which will work in con-

junction with sensors as shown in Figure 7. Therefore,

Figure 5. Metric definition for the environmental tempera-

ture contextual property .

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 660

Figure 6. Details of the level of corporal temperature ele-

mentary indicator specification.

while the data collecting and adapting processes are im-

plemented in a mobile device by the MA, the gathering

function and other processes can reside in the healthcare

center computer. The M A co mponent, using web services ,

informs the measures (streams) to the gathering function

(GF) in an asynchronous and continuous way. MA takes

the measures from sensors—the data sources- and incur-

porates the associated metric metadata for attributes and

contextual properties accordingly. For instance, it incur-

porates the contextual property ID for the environmental

temperature (VTAPT, in Figure 5) joint to the value to

transmit; and so for every attribute and contextual prop-

erty. Note that data (values) and metadata are transmitted

through the C-INCAMI/MIS schema to the gathering

function (GF), as discussed in sub-Section 2.3.4.

When the gathering function receives measures from

several outpatients under monitoring it arranges them, for

insta nce, by patient (i. e. the trace group) and transmits them

to the analysis and correction processes. As discussed in

subSection 2.3.2, ASF mainly solves typical problems of

data such as missing values, noises, among others. For

example, and thanks to metadata, if ASF receives for the

Value of axillary temperature metric a zero value, by the

metric definition the processing model identifies an er-

ror because the scale is numeric (in interval scale type),

Figure 7 . A depl oyme nt view f or the O utpatient Monitoring System.

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario661

continuous, and defined in the interval of positive real

numbers.

Although all the values of metrics and contextual pro-

perties from monitored o utpatients are si multa neousl y re-

ceived and analyzed, let’s consider for a while, for illu-

stration purpose, that the system only receives data for

the axillary temperature attribute and the environmental

temperature contextual property from one outpatient, and

that also the system visualizes them. As depicted in Fig-

ure 8, the lower and upper limit defined for the level of

(axillary temperature) corporal temperature indicator to-

gether with the evolution of the environmental tempera-

ture and the axillary temperature can be tracked.

The measures and, ultimately, the acceptability level

achieved by the level of corporal temperature elementary

indicator ( see Figure 8) indica te a normal situatio n for the

patient. Nevertheless, the on-line decision-making process,

apart from analyzing for attributes the level of accept-

ability met also analyzes the interaction with contextual

properties and their values. This analysis allows detect-

ing a situation like that exposed in Figures 9(a) and (b).

At first glance, what seemed to be normal and evident,

it was probably not because in a proactive form the pro-

cessing model has detected a correlation between axillary

temperature and environmental temperature as shown in

Figure 9(b). This could cause the triggering of a preven-

tive alarm from the healthcare centre to doctors, because

the increment on the environmental temperature can drag

in turn the increment in the corporal temperature, and

therefore this situation can be associated to a gradual rais e

in the risk probability for the outpatient.

4. Scenario Simulation

4.1. Goal

The developed prototype for the SDSPbMM approach im-

plements functionalities (see Figure 2) ranging from the

Figure 8. Visualization of the evolution of axillary tempera-

ture versus e nvironmental temperature measu res.

(a)

(b)

Figure 9. (a) Correlation Analysis for the axillary tempera-

ture versus the enviro nmental temperature; (b) Correlati on

Matrix.

formal definition of the M&E project including the C-

INCAMI repository with metadata, the integration of he-

terogeneous data sources, trace groups and MA, to clas-

sifiers for on-line decision-making process. In addition, it

implemen ts the C- INCAMI/M IS schema for the inte rch an ge

of measures in an interoperable way, and the multilevel

buffer based on metadata (see Figure 3).

The prototype has been implemented in JAVA, using

R [8] as statistical calculus engine, and the CRAN (Com-

prehensive R Archive Network) Rserve mechanism to

access TCP/IP from the streaming application to the R

engine, without requiring persistence and prioritizing the

direct communication.

The simulation goal is to determine the processing times

involved in the outpatient scenario and the variable scal-

ability. This simulation can allow us analyzing the feasi-

bility of applying the prototype to real situations. Fur-

thermore, we discuss statistically the results of the simu-

lation, in order to detect the components which incorpo-

rate more variability to the system.

4.2. Si m ul atio n Plan ning an d Ex e c ut i on

The simulation has been performed from the illustrated

scenario in Section 3. The measurement data have been

generated in a pseudo-random way considering two pa-

rameters: quantity of metrics (in a simulation each metric

corresponds to a variable), and quantity of measurements

by variable. Each patient has 8 associated metrics pertain-

ing to attributes and contextual properties as commented

in sub-Section 3.1.

The simulation discretely varied the quantity of vari-

ables (metrics) into the data stream from 3 to 99, and the

quantity of measurements by variable from 100 to 1000.

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 662

The idea of discretely vary the quantity of metrics instead

of doing it as a multiple of 8—i.e. based on the 8 ones

associated per each patient-, lies in analyzing the proto-

type behavior in presence of missing values and the pro-

gressive reincorporation of measures to the stream.

The prototype, R and Rserve were running in a PC

equipped with AMD Athlon × 2 64 bits processor, 3 GB

of RAM, and Windows Vista Home Premium as operat-

ing syste m.

For the simulation, the following variables which are

the target of measurement considering the stream as the

entity under a nalysis have been define d, namely:

 Startup: the necessary time (in ms) to start up the

functio ns of the ana lysis

 AnDesc: the necessary time (ms) to make the de-

scriptive analysis on the complete data stream

 Cor: the necessary time (ms) to make the correla-

tion analysis by trace group inside the complete data stream

 Pca: the necessary time (ms) to make the principal

component analysis by trace group inside the complete

data stream

 Tota l: the necessary time (ms) to make all the ana-

lysis on the complete data stream

The simulation parameters used for the statistical ana-

lysis of results are represented by qVar, to indicate the

quantity of variables of the data stream; and by meds, to

indicate the quantity of measures by variable of the data

stream. From now onwards, in order to simplify the reading

of the statistical analysis, the parameters qVar and meds

will be directly referred to as variables, and Startup, An-

desc, Cor, Pca and Total will be called variables as well.

From the simulation process standpoint, we have ob-

tained 1390 measurements over the overall processing

time in relation to the evolution of the quantity of vari-

ables and measurements. This allows us to statistically

arr ive at verif iable results that he lp us conseque ntly vali -

dating the prototype in a controlled environment.

4.3. Analysis of Results

The chart in Fig ure 10(b) clea rly shows us how the evolu-

tion of quantity of var iables affects significantly the over-

all processing time o f data streams, incrementin g according

to the values shown in chart (a). Here, we can observe that

the increment in the processing time produced by the in-

crease of measurements is extremely low in compareson

with the one produced by the increase of variables. This

latter aspect indicates that the load shedding mechanism

really achieved the goal of avoiding overflows without

affecting the time of stream processing against the varia-

tion of the s tream vol ume. Wh ile the i ncorporatio n of ne w

variables does influence because besides the stream vo-

lume by adding a new variable, there exists also the in-

teraction with the preexistent variables, being this the cause

Figure 10. Two views of the evolution of overall processing

time (ms) against the evolution of the quantity of variables

and meas urements .

and main difference in terms of the processing time with

respect to the increase produced by measurements.

In both disper sion char ts, (a) and (b), eac h point is rep -

resented with a color that is associated with the quantity

of variables. This allows us to identify regions in the graph

in a graceful way and to compare them from both per-

spectives. In chart (b), we can observe that the overall pro-

cessing time keeps a linear rel ation according to the quan-

tity of variables. Considering suc h a situation, and on the

basis that the statistical analyzer (ASF in Figure 2) per-

forms a series of analysis on the data stream, we have

studied the incidence of each analysis in the overall pro-

cessing time, in order to detect which of them are more

critical in temporary terms.

The Pearson’s correlation matrix shown in Figure 11(a)

would confirm, firstly, the linear relationship indicated

between the quantity of variables (qVar) and the overall

processing time of the data stream (Total) given the co e f -

ficient value of 0.95. Secondly, it can be concluded that

the overall processing time would keep a strong linear

relationship with respect to the time of the descriptive

analysis with a coefficient of 0.99, followed up by the

time of Pca with 0.9, and Cor with 0.89 respectively.

The resulting matrixes of the principal component ana-

lysis—shown in Figures 11(b) and (c) reveal which of the

variables provide more variability to the system. Thus, the

first autovalue (row 1, Figure 11(b)) explains the 66% of

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario663

(a)

(b) (c)

Figure 11. (a) Pearson’s correlation matrix, (b) Matrix of

autov alues, and (c) Matri x of autovectors ass ociated to prin-

cipal component analysis (PCA).

the variability of the system. Also, if we look at its com-

position in the matrix of autovectors (col. e1, Figure 11

(c)), the variables that more contribute—in absolute terms-

are AnDesc, Cor, Pca y qVar.

Therefore, if we want to replace the seven cited vari-

ables with the three new variables (e1 to e3), we would

be explaining the 96% of the variability of the system,

where the main variables in terms of contribution are

associated to AnDesc, Cor, Pca y qVar. The system is

only affected in a 16% by the evolution of measurements

and in a 14% by the startup time. This is an important as-

pect to remark because the only external variable to the

prototype, i.e. the volume of measurements arrival, which

cannot be controlled by it, represents just a 16% and by

no way constituted an overflow situation in the queue of

services.

Lastly, taking into account the four variables that more

contributed to the system variability, three of them are

part of the overall processing time. In this way, and using

the box plot of Figure 12, we can corroborate that the

most influential variable, in terms of the magnitude to the

overall processing time, is AnDesc. In addition, note that

the biggest resulti ng time to process 99 variables ( metrics)

with 1000 measures each (i.e. in total 99,000 measures

by stream) was 1092 ms.

This outcome represents a satisfactory applicability

threshold for the prototype, especially taking into account

the basic hardware used. So, in our humble opinion, this

could easily meet the response time requirements for the

outpatient monitoring scenario.

Figure 12. Boxplot of the AnDesc, Cor, PCA and total vari-

ables .

5. Related Work and Discussion

There are many researches oriented to data stream proc-

essing from the syntactical point of view, in which the

continuous query over data streams is made in terms of

attributes and their associated values using CQL (Con-

tinuous Query Language) [9]. This approach has been

implemented in several projects such as Aurora & Bore-

alis [10], STREAM [11], and TelegraphCQ [12], among

others. Our approach (and prototype) includes the capa-

bility to incorporate metadata based on an M&E frame-

work, w h ich allows gu iding th e or gani zation of d at a stre ams

in the buffer; making possible the consistent and compa-

rable analysis from the statistical standpoint; triggering

alarms in a proactive way using several statistical analy-

sis or fr om take n decisions stemming from clas sifiers.

MavStream [13] is a prototype for a data stream man-

agement system, which has the capability of processing

complex events as an intrinsic aspect for data stream pro-

cessing. In this sense, our prototype supports the on-line

data stream analysis with the incorporation of metadata to

measures (data), handling not only measures values com-

ing from attributes of the assessed entity but also those

coming from contextual properties related to the situation

of the entity. In addition, the SDSPbMM prototype can

process measures with nondeterministic results, and per-

form analysis by trace group (or an overall analysis), which

in practical scenarios such as is the case for monitoring

of outpatients [4], represent crucial features.

Nile [14] is a data stream management system based

on a conceptual framework for detection and tracking of

phenomena or situations supported by deterministic mea-

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 664

sures. Our prototype unlike Nile, allows the incorpora-

tion of heterogeneous data sources embracing not only

deterministic but also nondeterministic measures. On the

other hand , Sin gh et al. [15] introduce a system architect-

ture for a formal framework of data mining oriented to

the situation presented in [16]. This system is used in

medical wireless applications and shows how the archi-

tecture can be applied to several medical areas such as

diabetes treatment and risk monitoring of heart disease.

In our humble opinion, this system neglects central issues

for assuring repeatability and interoperability because it

lacks a clear specification of metrics both for entity at-

tributes and contextual properties, indicators, scales and

scale types, among other metadata.

Lastly, Huang et al. [17] present an approach based on

self-managed reports for tracking of patients. Such re-

ports are made up of a set of questionnaire items with

numeric (scale) responses, which are filled in by patients

at home. The patient’s responses feed a classification mo-

del based on neural networks in order to progressively

improve the selection of questionnaire items incorporated

in the reports. Hence, they argue this decrease the pa-

tients’ response time and allow identifying those aspects

that will foster an improvement in their quality of life.

Likewise happens in Singh et al. [15,16] proposal; this

approach says nothing about how to define metrics, indi-

cators, scales, and so on.

Our strategy and its prototype support data stream pro-

cessing in alignment with a conceptual base, i.e. the met-

ric and indicator ontology [3], which guarantees not only

syntactic but also semantic processing, in addition to

interoperability and consiste ncy.

6. Conclusions and Future Work

In this work, we have discussed how the presence of me-

tadata based on the C-INCAMI M&E framework linked

to measures in data streams, allows the organization of

measurements which foster the consistency for statistical

analysis, since they specify not only the formal compo-

nents of data but also the associated context. Hence, it is

possible to perform particular analysis at trace group or

at a m ore ge n eral level , compari n g v al u es of m et rics among

different trace groups in order to identify, for example,

deviations of measures against their formal definition; the

main system variability factor s, as well as relations a mong

variables.

In the outpatient monitoring scenario introduced in

Section 3, we have shown in Figures 8 and 9 the relation-

ships between the data/metadata of metrics—i.e. metrics

that quantify contextual properties and attributes- and the

statistical analysis, considering our data stream process-

ing strategy. In this sense, even when the measures seemed

to have normal values, the data and metadata of metrics

in conjunction with the correlation analysis allowed iden-

tifying a drag situation and then triggering alarms to pre-

vent it. Moreover, such metadata allowed identifying va-

riability factors and detecting trends in a consistent way

considering the contextual situation as well.

Using the developed prototype which implements the

SDSPb MM strat egy, we have demonstr ated fr om the sta-

tistical analysis viewpoint of the simulation outcomes that

the prototype is more susceptible, in terms of processing

time, to the increase of the quantity of variables than to

the increase of the quantity of measurements by variable.

Using the principal components analysis technique, we

have proved that the aspects that more contribute to the

system variability are those associated to the Andesc, Cor,

Pca and qVar variables, being Andesc the one that de-

fines the biggest proportion of the final processing time

of data streams.

Taking into account that the implemented prototype

ran in a system which is totally accessible in the market,

we could establish as a benchmark that to process 99,000

measurements (99 variables and 1000 measures/variable),

the biggest time spent was 1092 ms. This is an important

starting point since now we can consistently evaluate se-

veral application scenarios against this benchmark. On

the other hand, the effectiveness of the load shedding me-

chanism in the multilevel buffer was also proved statisti-

cally, coming up that the evolution of the quantity of mea-

surements does not compromise the prototype operation

and the final processing time of the data stream has not

been affected.

Although the present work is just a simulation of the

outpatient monitoring scenario by means of a prototypi-

cal software application, we have initially proved that it

can scale up to a real scenario. As a future work, we also

plan to experimentally test our data stream processing

strategy enriched with context, measurement and evalua-

tion metadata on several scenarios in or der to statistically

validate the initial benchmark obtained for the outpatient

monitoring scenario.

7. Acknowledgements

This research is supported by the PICT 2188 project from

the Science and Technology Agency and by the 09/F052

project from the UNLPam, Argentina.

REFERENCES

[1] J. Namit, J. Gehrke and H. Balakrishan, “Towards a

Streaming SQL Standard,” Proceedings of the VLDB En-

dowment, Vol. 1, No. 2, 2008, pp. 1379-1390.

[2] H. Molina and L. Olsina, “Towards the Support of Con-

textual Information to a Measurement and Evaluation

Fram ework , ” International Conference on Quality of Infor-

mation and Communications Technology, Lisbon, 12-14

Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario

665

Septem be r 2007, pp. 1 54-166.

[3] L. Olsina, F. Papa and H. Molina, “How to Measure and

Evaluate Web Applications in a Consistent Way,” In: G.

Ros si, O. Pastor, D. Schwabe and L. Olsina, Eds., Web En-

gineering: Modelling and Implementing Web Applica-

tions, Sprin ger Book, London , 2008, pp. 385-420.

[4] M. Diván and L. Olsina, “Integrated Strategy for the Data

Stream Processing: A Scenario of Use,” Proceeding of

Iberoamerican Conference in “Software Engineering”, Me-

dellín, 2009, pp. 374-387.

[5] M. Diván, L. Olsina and S. Gordillo, “Data Stream Proc-

essing Enriched with Measurement Metadata: A Statisti-

cal Analysis,” Proceeding of Iberoamerican Conference

in “Software Engineering” (CIbSE), Rio de Janeiro, 2011,

p. 29.

[6] M. Wei, W. Rundensteiner and M. Mani, “Utility-Driven

Load Shedding for XML Stream Processing,” Interna-

tional Conference on World Wide Web, Beijing, 21-25

April 2008, pp. 855-864.

[7] C. Marrocco, R. Duin and F. Tortorella, “Maximizing the

Area under the ROC Curve by Pairwise Feature Combi-

nation,” ACM Pattern Recognition, Vol. 41, No. 6, 2008,

pp. 1961 -1974. doi:10.1016/j.patcog.2007.11.017

[8] R. Software Foundation, “The R Foundation for Statisti-

cal Computing,” 2010.

http://www.r-project.org/foundation/

[9] S. Babu and J. Widom, “Continuous Queries over Data

Streams,” ACM SIGMOD Record, Vol. 30, No. 3, 2001,

pp. 109- 120. doi:10.1145/603867.603884

[10] D. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M.

Cherniack, J. Hwang, W. Lindner, A. Maskey, A. Rasin,

E. Ryvkina, N. Tatbul, Y. Xing and S. Zdonik, “The De-

sign of the Borealis Stream Processing Engine,” Confer-

ence on Innovative Data Systems Research (CIDR), Asi-

lomar, 2005, pp. 277-289.

[11] The Stream Group, “STREAM: The Stanford Stream Data

Manager ,” Sta nfor d, 2003.

[12] S. Krishnamurthy, S. Chandrasekaran, O. Cooper, A. Desh-

pande, M. Franklin, J. Hellerstein, W. Hong, S. Madden, F.

Reiss and M. Shah, “Telegraph CQ: An Architectural

Status Report,” IEEE Data Engineering Bulletin, Vol . 26,

No. 2, 2003, pp. 11-18.

[13] S. Chakravarthy and Q. Jiang, “Stream Data Processing:

A Quality of Service Perspective,” Springer Book, New

York, 2009.

[14] M. Ali, W. Aref, R. Bose, A. Elmagarmid, A. Helal, I.

Kamel and M. Mokbel, “NILE-PDT: A Phenomenon De-

tection and Tracking Framework for Data Stream Man-

agemen t S yste ms, ” Very Large Database, Trondheim, 2005,

pp. 1295-1298.

[15] S. Singh, P. Vajirkar and Y. Lee, “Context-Aware Data

Mining Framework for Wireless Medical Application,”

LNCS of Springer, Vol. 2736, 2 0 03, pp . 381-391.

[16] S. Singh, P. Vajirkar and Y. Lee, “Context-Based Data

Mining Using Ontologies,” LNCS of Springer, Vol. 2813.

2003, pp. 405-418.

[17] Y. Huang, H. Zheng, C. Nugent , P. McCullagh, N. Black,

K. Vowles and L. McCracken, “Feature Selection and Cla-

ssification in Supporting Report Based Self Management

for People with Chronic Pain,” IEEE Transactions on In-

formation Technology in Biomedicine, Vol. 15, No. 1, 2011,

pp. 54-61.