Journal of Software Engineering and Applications, 2011, 4, 653-665
doi:10.4236/jsea.2011.412077 Published Online December 2011 (http://www.SciRP.org/journal/jsea)
Cop yright © 2011 Sci Res. JSEA
653
Strategy for Data Stream Processing Based on
Measurement Metadata: An Outpatient
Monitoring Scenario
Mario Diván1,2 , Luis Olsina2, Silvia Gordillo3
1Law and Economy School, Universidad Nacional de La Pampa, Santa Rosa, Argentina; 2Engineering School, National University of
La Pampa, General Pico, Argentina; 3LIFIA, Informatics School, National University of La Plata, La Plata, Argentina.
Email: mjdivan@eco.unlpam.edu.ar, olsinal@ing.unlpam.edu.ar, gordillo@lifia.info.unlp.edu.ar
Received Oct ober 25th, 2011; revised December 1st, 2011; accepted December 12th, 2011.
ABSTRACT
In this work we discuss SDSPbMM, an integrated Strategy for Data Stream Processing based on Measurement Meta-
data, applied to an outpatient monitoring scenario. The measures associated to the attributes of the patient (entity) un-
der monitoring, come from heterogeneous data sources as data streams, together with metadata associated with the
formal definition of a measurement and evaluation project. Such metadata supports the patient analysis and monitoring
in a more consistent way, facilitating for instance: 1) The early detection of problems typical of data such as missing
values, outliers, among others; and 2) The risk anticipation by means of on-line classification models adapted to the
patient. We also performed a simulation using a prototype developed for outpatient monitoring, in order to analyze
empirically processing times and variable scalability, which shed light on the feasibility of applying the prototype to
real situations. In addition, we analyze statistically the results of the simulation, in order to detect the components
which incorporate more variability to the system.
Keywords: Measurement, Data Stream Processing, C-INCAMI, Statistical Analysis
1. Introduction
Nowadays, there are applications which make customized
pro ce ssing of da ta set s, ge ne rated i n a co ntinuo us wa y, in
response to queries and/or to adjust their behavior de-
pend ing on t he ar ri val of ne w d at a [ 1 ]. Exa mple s of the se
applications are namely for vital signs monitoring of pa-
tients; for behavioral tracking of financial markets; for
air traffic monitori ng, among others. In such applications,
the arrival of a new data represents the arrival of a value
(e.g. for a cardiac frequency, a foreign currency rate, etc.)
associated to a syntactical behavior. Frequently, they only
analyze the number (value) itself without formal and se-
mantic support, disregarding not only the measurement
metadata, but also the context in which the phenomenon
happens. Therefore, in order to understand the meaning of
arriving data and then act accordingly, such applications
must necessarily incorporate a logic layer, i.e. procedures
and metadata, which transform and/or interpret the data
streams. Since a lack of clear separation of concerns be-
tween the syntactic and semantic aspects of those current
applications, very often an expert (e.g., for the outpatient
monitoring system, the expert can be a doctor responsible
for the monitoring) should intervene in order to interpret
the situation. So we argue that given the state-of-the-art
of IT in metadata and semantic pro cessing the intervene-
tion of experts should be minimized as long as the appli-
cations can perform the job.
Taking into account the semantic and formal basis for
measurement and evaluation (M&E), the C-INCAMI (Con-
text-Information Need, Concept model, Attribute , Metric
and Indicator) conceptual framework establishes an on-
tology that includes the concepts and relationships nec-
essary to specify data and metadata in any M&E project
[2, 3]. O n the othe r ha nd, we have envisi oned t he ne ed of
integrating heterogeneous data streams with metadata
based on the C-INCAMI framework, in order to allow a
more consistent and richer analysis of data sets (meas-
ures). As result, the Strategy for Data Stream Processing
based on Measurement Metadata (SDSPbMM) [4,5] was
developed.
The main SDSPbMM aim is filling the gap among the
integration of heterogeneous data sources; the incorpora-
tion and processing of metadata for attributes, contextual
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 654
properties, metrics (for measurement) and indicators (for
evaluation); and on-line classifiers that support in a more
rob ust way dec i sion-making processes.
Thus, by using the SDSPbMM approach for the above-
mentioned applications, in this paper we present particu-
larl y the fo undatio ns for d eveloping t he outpa tient moni -
toring scenario. We also performed a simulation using a
prototype developed for this scenario, in order to analyze
empirically processing times and variable scalability, which
shed light on the feasibility of applying the prototype to
real situations. For this end, statistical techniques such as
descriptive a nalysis, correla tion analysis a nd principal com-
ponent analysis are used.
The contributions of this work is manifold: 1) related
to metrics: the detection of deviations of metrics’ values
with respect to their formal definitions, identification of
missing values and outliers; 2) related to the set of mea-
sures: the instant detection of correlations; the identifica-
tion of variability factors of the system; and the detection
of trends on data streams, considering also the contextual
situation; and 3) related to the empirical study: the vali-
dation of the implemented prototype on a specific do-
main scenario, i.e. the outpatient monitoring, which al-
low us determining the feasibility to be applied in real si-
tuations.
The quoted contributions represent a step further with
regard to our previous works [4,5], because now the en-
hanced prototype incorporate the online classifiers which
support proactive decision making, and their interaction
with statistical techniques.
Following this introduction, Section 2 points out the main
motivation and provides an overview of the C-INCAMI
framework and the SDSPbMM approach. Section 3 illus-
trates the outpatient monitoring scenario, and Section 4
discusses the planning of the study and the analysis of
results related to the simulation. Section 5 analyzes the
contributions of this research regarding related work and,
finally, Section 6 draws the main conclusions and out-
lines future work.
2. Fundamentals of SDSPbMM
2.1. Motivation
The SDSPbMM [4] approach proposes a flexible frame-
work in which co-operative processes and components
are specialized for data stream management with the ul-
timate aim of having proactive decision making. In this
sense, SDSPbMM allows the automation of data collec-
tion and adaptation processes supporting also the incur-
poration of heterogeneous data sources; the correction
and analysis processes supporting the early detection of
problems typical of data such as missing values, outliers,
etc.; and online decision-making processes based on for-
mal definitio ns of M &E projects, a nd current/updated cla -
ssifiers (see Figure 2, which depicts a view of the SD
SPbMM approach).
For instance, to deal with detection, correction and ana-
lysis processes, our proposal uses in online form, statis-
tical techniques such as descriptive analysis, correlation
analysis and principal co mponent analysis. In addition to
these techniques other statistical techniques are used to
initially validate the work. In a nutshell, we performed a
simulation using a prototype developed for outpatient mo-
nitoring scenario, in order to analyze empirically proc-
essing times and variab le scalability, which shed lig ht on
the feasibility of applying the pr o totype in real situations.
Before going through the simulation and statistical ana-
lysis issues, it is necessary to illustrate the main aspects
of the C-INCAMI framework, which is a key part to the
SDSPbMM approach.
2.2. C-INCAMI Overview
C-INCAMI is a conceptual framework [2,3], which de-
fines the concepts and their related components for the
M&E area in software organizations. It provides a domain
(ontological) model defining all the terms, properties, and
relationships needed to design and implement M&E pro-
cesses. It is an approach in which the requirements speci-
fication, M&E design, and analysis of results are designed
to satisfy a specific information need in a given context.
In C-INCAMI, concepts and relationships are meant to
be used along all the M&E activities. This way, a com-
mon understanding of data and metadata is shared among
the organization’s projects lending to more consistent
analysis and results across projects.
The SDSPbMM approach reuses totally the C-INCAMI
conceptual base, in order to obtain a repeatable and con-
sistent data stream processing where raw data usually is
coming from data sources as sensors. While C-INCAMI
was initially developed for software applications, the in-
vol ved concept s such as metri c, measure ment method, scale,
scale type, indicator, elementary function, decision crite-
rion, etc., are semantically the same when applied to other
domains, as it is the case when applied to the outpatient
monitoring system for the healthcare domain.
The C-INCAMI framework is structured in six com-
ponents, namely: 1) M&E project definitio n, 2) Nonfunc-
tional requirements specification, 3) Context specifica-
tion, 4) Measurement design and implementation, 5) Eva-
luation design and implementation, and 6) Analysis and
recommendation specificatio n.
The components are supported by ontological terms de-
fined in [3].
The M &E project definition component (not shown in
Figure 1), defines and relates a set of project concepts
needed to deal with M&E activities, roles and artifacts. It
allows defining the ter ms for a requirements project, and
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario
Cop yright © 2011 Sci Res. JSEA
655
Figure 1. C-INCAMI main concepts and relationships for nonfunctional requirements specification, context specification,
measurement design and implementation, and evaluation design and implementation components.
Figure 2. Conceptual schema for the strategy for data stream processing based on measurement metadata.
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 656
its associated measurement and evaluation sub-projects.
The Nonfunctional requirements specification compo-
nent (requirements package in Figure 1) allows specify-
ing the Information Need of any M&E project. The in-
formation need identifies the purpose (e.g. “understand”,
“predict”, “monitor”, etc.) and the user viewpoint (e.g.
“patient”, “final user”, etc); in turn, it focuses on a Cal-
culable Concept—e.g. software quality, quality of vital
signs, etc. and specifies the Entity Category to evaluate
—e.g. a resource, a product, etc. A calculable concept
can be defined as an abstract relationship between attrib-
utes of an entity and a given information need. This can
be represented by a Concept Model where the leaves of
an instantiated model are Attributes. Attributes can be mea-
sured by met rics .
For the Context Specification component (context pack-
age in Figure 1), o ne concep t i s Context, which represents
the relevant state of the situation of the entity to be as-
sessed with regard to the information need. We consider
Context as a special kind of Entity in which related rele-
vant entities are involved. To describe the context, attrib-
utes of the relevant entities are used—which are also At-
tributes called Context Properties (see [2] for details).
The Measurement Design and Implementation com-
ponent (measurement package in Figure 1), includes the
concepts and relationships intended to specify the meas-
urement design and implementation. Regarding measure-
ment design, a Metric provides a Measurement speci-
fication of how to quantify a particular attribute of an
entity, u s ing a part i cular Method, and how to represent its
values, using a particular Scale. The properties of the
measured values in the scale with regard to the allowed
mathematical and statistical operations and analysis are
given by t he scale Type.
Two types of metrics are distinguished. Direct Metric
is that for which values are obtained directly from meas-
uring the corre spondi ng entit y’s a ttrib ute, b y using a Mea-
surement Method. On the other hand, the Indirect Metric
value is calculated from other direct metrics’ values fol-
lowing a function specification and a particular Calcula-
tion Method. For measurement implementation, a Meas-
urement specifies the activity by using a particular metric
description in order to produce a Measure value. Other
associated metadata is the data collector name and the
timestamp in which the measurement was performed.
The Evaluation Design and Implementation compo-
nent (evaluation package in Figure 1) includes the con-
cepts and relationships intended to specify the evaluation
desig n and imple mentatio n. It is worthy to mentio n that the
selected metrics are useful for a measurement process as
long as the selected indicators are useful for an evaluation
process in order to interpret the stated information need.
Indicator is the main term, and there are two types of in-
dicators. First, Elementary Indicator that evaluates at-
tributes combined in a concept model. Each elementary
indicator has an Elementary Model that provides a map-
ping function from the metric’s measures (the domain) to
the indicator’s scale (the range). The new scale is inter-
preted using agreed decision criteria, which help analyze
the level of satisfaction reached by each elementary non-
functional requirement, i.e. by each attribute. Second, Par-
tial/Global Indicator, whic h eval ua tes mid - le vel and hi gher -
level r eq ui re me nts , i.e. sub-characteristics and characteris-
tics in a concept model. Different aggregation models
(Global Model) can be used to perform evaluations. The
global indicator’s value ultimately represents the global
degree of satisfaction in meeting the stated information
need for a given purpose and user viewpoint. As for the
implementation, an Evaluation represents the activity in-
volving a single calculation, following a particular indi-
cator specification—either elementary or global-, produ-
cing an Indicator Value.
The Analysis and Recommendation Specification com-
ponent (not shown in Figure 1), includes concepts and
rela tions hip s dea ling with a na l ysis de sign a nd i mpl eme n-
tation as well as conclusio n and recommendation. Analy-
sis and recommendation use information coming from each
M&E project, which includes requirements, context, mea-
surement and evaluation data and metadata. By process-
ing all this information and by using different kinds of
statistical techniques and visualization tools, stakeholders
can analyze the assessed entities’ strengths and weaknesses
with regard to established information needs, and justify
recommendations and ultimately decision making in a
consis t ent wa y.
Considering the SDSPbMM strategy and its developed
prototype, streams coming from data sources (i.e. us uall y
sensors) are structured incorporating to measures the me-
tadata based on C-INCAMI such as the entity being mea-
sured, the attribute and its corresponding metric, the trace
group, among others. For a given data stream, not only
measures associated to metrics of attributes are tagged
but also measures associated accordingly to contextual
properties as well.
Thanks to each M&E project specification is based on
C-INCAMI, the processing of tagged data streams are
then in alignment with the project objective and informa-
tion need, allowing thus traceability and consistency by
supporting a clear separation of concerns. For instance,
for a given project—more than one can be running at the
same time—it is easy to identify whether a measure is
coming from an attribute or from a contextual property,
and also its associated scale type and unit. Therefore, the
statistical analysis is benefited because the verification of
each measure for consistency against its formal (metric)
definition can be performed.
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario657
2.3. SDSPbMM Overview
Data collecting and adapting processes deal with how to
adapt different measurement devices to collect measures
and then communicate them to correction and analysis
processes. The main components (see Figure 2) are data
sources, measurement adapters and the gathering function.
The underlying idea of the SDSPbMM approach [4] is
depicted in Figure 2.
Briefly, the measurement stream is informed by each
heterogeneous data source to the measurement adapter
(MA). The MA incorporates the metadata (e.g. metric ID,
context property ID, etc.) associated to each data source
into the stream, in order to transmit measurements to the
gathering function (GF). Such measurements are organ-
ized in GF by their metadata and then sent to the Analysis
& Smoothing Function (ASF). ASF perfor ms a set of sta -
tis ti cal an alys is on th e st re am in o rd er t o de tect dev iat i ons or
problems with data, considering its formal definition (as
per C-INCAMI DB). In turn, the incremental classifiers
(i.e. the current and updated classifiers) analyze the arri-
vei ng mea sur eme nts and act a cco rdin gly tr igger in g alar ms
in case a risk situation arises.
SDSPbMM is made up of the following processes: 1)
Data Collecting and Adapting Processes; 2) Correction
and Analysis Processes; and 3) Decision-Making Proc-
esses, which are summarized below.
2.3.1. Data Collecting and Adapting Processes
The data collecting and adapting processes deal with how
to adapt different measurement device s to collect measures
and then communicate them to correction and analysis pro-
cesses. T he mai n compon ents (see Figure 2) are data sour-
ces, measurement adapters and the gathering function.
In short, measures are generated in the heterogeneous
data sources, and sent continuously to the MA. MA can
usually be embedded in mobile devices, but also can be
embedded i n any co mputin g device asso ciated to data sour -
ces. It incorporates the measured values join to the M&E
project metadata respectively, sending in turn both to the
GF.
GF introduces streams into a buffer (see Figure 3) or-
ganized by trace groups—a flexible way to group data
sources established dynamically by the M&E project di-
rector. This organization allows consistent statistical ana-
lysis at trace group level, without representing an addi-
tional processing load. Within each trace group, as shown
in Figure 3, the organization of measurements is tracked
by metric. This fosters a consistent analysis among dif-
ferent attributes (e.g. axillary temperature, cardiac fre-
quency, etc.), which are monitored by a given trace group
for a particular patient. Also, homogeneous comparisons
of attributes can be made for different trace groups (or
patients).
Figure 2. A view of the multilevel buffer.
Moreover, GF incorporates load shedding techniques
[6], which allow managing the queue of services associ-
ated to measurements, thus mitigating overflow risks re-
gardless of how they are grouped.
2.3.2. Correction and Analysis Processes
The correction process is based on statistical techniques
where data and their a ssociated metadata allo w richer (se-
mantic) analysis. The semantic lies in the formal definition
of each M&E project regarding the C-INCAMI concep-
tual framework (introduced in sub Section 2.2).
It is important to remark that the formal definition of
each project is made by experts. In this way, such a defi-
nition becomes a reference pattern in order to determine
if a particular measure (value) is coherent and consistent
with regard to its associated metric specification.
Once the measures are organized in the buffer, the
SDSPbMM prototype applies descriptive, correlation and
principal components analysis. These t e ch ni ques allow de-
tecting inconsistent situations, trends, correlations, and/or
identifying system components that incorporate more va-
riability. If some situation is detected in ASF (see Figure
2), a statistical alarm is triggered to the decision maker
(DM) component in order to evaluate whether it is nec-
essary to send an external alarm (via e-mail, SMS, etc.)
for reporting on this situation to medical staff or not.
2.3.3. Decision-Making Processes
Once the statistical analysis was performed, the unified
streams are communicated to the current classifier (CC)
component, which classifies measurements to decide whe-
ther they correspond to a risk situation or not and report
accordingly such decision to DM. Simultaneously, CC is
regenerated by incorporating the unified streams to the
training measure set, and then producing a new model
named Updated Classifier (UC) in Figure 2.
Later, the UC classifies the unified streams and pro-
duces an updated decision notifying to DM. Ultimately,
DM evaluates if both decisions (from CC and UC) cor-
respond to a risk situation and its probability of occurrence.
Finally, regardless the selected decision made by DM,
the UC becomes the CC replacing the previous one (see
the adjust model arrow in Figure 2), only if an impro-
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario
Cop yright © 2011 Sci Res. JSEA
658
vement in the classification capacity according to the ad-
justment model based on ROC (Receiver Operating Cha-
racteristic) [7] curves exists.
Hence, CINCAMI/MIS is a schema—based on the C-
INCAMI conceptual base as discussed in subsection 2.2-,
which cope with interoperability issues in the provision of
data from hetero geneo us devic es, and their furthe r orga ni-
zation.
2.3.4. Contribution of Metadata to the Measurement
Process In Figure 4 an annotated schema of a C-INCAMI/MIS
stream is presented. For each sent stream, MA incorpo-
rates to the raw data—e.g. the value 80—the structure of
C-INCAMI/MIS schema, indicating the correspondence
of each measure with each attribute and contextual prop-
erty. For instance, in the message of Figure 4, IDEntity =
1 represents the outpatient entity, IDMetric = 2 the metric
value of cardiac frequency, and IDProperty = 5 the met-
ric value of environmental humidity percentage, in the pa-
tient location—representing a contextual property. Thus,
the metadata in the message clearly includes a set of in-
formation which allows keeping a link between a meas-
ure value a nd th e ori gin o f d at a to id en ti fy the data so urce,
the metric and entity ID, among others. This information
allows increasing the consistency in the processing model
for each M&E project definition.
In this subsection the added value of metadata for data
interoperability, consistency and processability is addressed.
Recall that measures are sent from heterogeneous data
sources to the GF component through MA. When MA
receives data streams from each data source, incorporates
metadata accordingly to a common stream-independent-
tly that measures come from several data sources- and
transmit it by means of the C-INCAMI/MIS (Measure-
ment Interchange Schema) [4] schema t o the GF co mpo-
nen t. Th us, pr evio us to se ndin g measure s, ea ch d ata so urce
must configure just once each metric that quantifies each
attribute (e.g. the cardiac frequency attribute) of the en-
tity under assessment (e.g. an outpatient), and the in-
cluding contextual properties (e.g. environmental tempe-
rature) of the situation. This al lows MA be aware of how
such metadata should be embedde d into the stream.
Figure 3. Annotated XML (Extended Markup Language) schema of a C-INCAMI/MIS stream.
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario659
Let’s suppose, for example, a value of 80 associated to
a cardiac frequency of an outpatient arrived; then, the fo-
llowing basic questions can be raised: What does it re-
present? W hich unit of measu re do es it have? W hich ma-
thematical and statistical properties have the value regar-
ng the scale type? Is it good or bad? What is good and
what is bad, i.e. what are the decision criteria? Could any
software process t he mea sure?
Therefore, if the stream metadata were not available,
many questions as those could not be answered in a con-
sistent way. Even more, the processability of measures can
be hampered and the analysis be skewed.
3. Outpatient Monitoring Scenario
In this section, we illustrate the formal definition of a
M&E project for outpatient monitoring, and some as-
pects of its implementation as well. In the M&E project
defi nitio n, t he kno wledge of e xper ts (e .g. d oct ors) is a va-
luable asset.
3.1. Definition
The present scenario aims at illustrating the SDSPbMM
approach. The underl ying hypothesis is doctors of a health -
care centre could avoid adverse reactions and major da-
mage in the health of patients (particularly, outpatients)
if the y ha d a c ontin uo us mo ni t or ing ove r t hem. That is to
say, doctors should have a mechanism by which can be
informed about unexpected variations and/or inconsis-
tencies in health indicators defined by them (as experts).
So, the core idea is that there exists some proactive me-
chanism based on health metrics and indicators that pro-
duces an on-line report (alarm) for each risk situation as-
sociated to the outpatient under monitoring.
Considering C-INCAMI, the information need is “to
monitor the principal vital signs of an outpatient when
he/she is given the medical discharge from the health-
care centre”. The entity under analysis is the outpatient.
According to medical experts, the corporal temperature,
the systolic arterial pressure (maximum), the diastolic
arterial pressure (minimum) and the cardiac frequency
represent the relevant attributes of the outpatient vital
signs to monitor. They also consider as necessary moni-
toring the environmental temperature, environmental pres-
sure, hu midity, and patien t position (i.e. latitude and lon-
gitude) contextual properties. The definition of the infor -
mation need, the entity, its associated attributes and the
context are part of the “Nonfunctional requirements speci-
fication” and “Context specification” components as dis-
cussed in sub-Section 2.2.
The quantification of attributes and contextual proper-
ties is performed by metrics as shown in the Measure-
ment Design and Implementation component in Figure 1.
For monitoring purposes, the metrics that quantify the
cited attributes, were selected from the C-INCAMI DB
repository; likewise the metrics that quantify the cited con-
textual properties. Figure 5 shows the specification of
the metric for environmental temperature contextual pro-
perty.
After the set of metrics and contextual properties for
outpatient monitoring has been selected, the correspond-
ing elementary indicators for interpretation purposes (as
discussed in sub-Section 2.2) have also to be selected by
experts. In this way, they have included the following
elementar y indicators: level of corporal temperature, level
of pressure, level of cardiac frequency and level of dif-
ference between the corporal and the environmental tem-
perature. The concepts related to indicators are part of
the Evaluation Design and Implementation component
(see Figure 1).
Figure 6 shows the specification of the level of corpo-
ral temperature elementary indicator. For example, the
different acceptability levels with their interpretations are
shown, among other metadata. Besides, considering that
ranges of the acceptability levels (shown in Figure 6) are
in a categorical scale (i.e. an ordinal scale type), then the
target variable for the mining function (classification) is
also categorical. So, the classifiers both CC and UC, act
relying on the values of the given indicators and their ac-
ceptability levels.
3.2. Implementation Issues
Once all the above project information was established, it
is necessary for implementation issues to choose a con-
crete architecture to deploy the system. Figure 7 depicts
an abridged deployment view for the outpatient moni-
toring system considering the SDSPbMM approach.
Let’s suppose we install and set up the MA in a mobile
device—the outpatient device-, which will work in con-
junction with sensors as shown in Figure 7. Therefore,
Figure 5. Metric definition for the environmental tempera-
ture contextual property .
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 660
Figure 6. Details of the level of corporal temperature ele-
mentary indicator specification.
while the data collecting and adapting processes are im-
plemented in a mobile device by the MA, the gathering
function and other processes can reside in the healthcare
center computer. The M A co mponent, using web services ,
informs the measures (streams) to the gathering function
(GF) in an asynchronous and continuous way. MA takes
the measures from sensors—the data sources- and incur-
porates the associated metric metadata for attributes and
contextual properties accordingly. For instance, it incur-
porates the contextual property ID for the environmental
temperature (VTAPT, in Figure 5) joint to the value to
transmit; and so for every attribute and contextual prop-
erty. Note that data (values) and metadata are transmitted
through the C-INCAMI/MIS schema to the gathering
function (GF), as discussed in sub-Section 2.3.4.
When the gathering function receives measures from
several outpatients under monitoring it arranges them, for
insta nce, by patient (i. e. the trace group) and transmits them
to the analysis and correction processes. As discussed in
subSection 2.3.2, ASF mainly solves typical problems of
data such as missing values, noises, among others. For
example, and thanks to metadata, if ASF receives for the
Value of axillary temperature metric a zero value, by the
metric definition the processing model identifies an er-
ror because the scale is numeric (in interval scale type),
Figure 7 . A depl oyme nt view f or the O utpatient Monitoring System.
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario661
continuous, and defined in the interval of positive real
numbers.
Although all the values of metrics and contextual pro-
perties from monitored o utpatients are si multa neousl y re-
ceived and analyzed, let’s consider for a while, for illu-
stration purpose, that the system only receives data for
the axillary temperature attribute and the environmental
temperature contextual property from one outpatient, and
that also the system visualizes them. As depicted in Fig-
ure 8, the lower and upper limit defined for the level of
(axillary temperature) corporal temperature indicator to-
gether with the evolution of the environmental tempera-
ture and the axillary temperature can be tracked.
The measures and, ultimately, the acceptability level
achieved by the level of corporal temperature elementary
indicator ( see Figure 8) indica te a normal situatio n for the
patient. Nevertheless, the on-line decision-making process,
apart from analyzing for attributes the level of accept-
ability met also analyzes the interaction with contextual
properties and their values. This analysis allows detect-
ing a situation like that exposed in Figures 9(a) and (b).
At first glance, what seemed to be normal and evident,
it was probably not because in a proactive form the pro-
cessing model has detected a correlation between axillary
temperature and environmental temperature as shown in
Figure 9(b). This could cause the triggering of a preven-
tive alarm from the healthcare centre to doctors, because
the increment on the environmental temperature can drag
in turn the increment in the corporal temperature, and
therefore this situation can be associated to a gradual rais e
in the risk probability for the outpatient.
4. Scenario Simulation
4.1. Goal
The developed prototype for the SDSPbMM approach im-
plements functionalities (see Figure 2) ranging from the
Figure 8. Visualization of the evolution of axillary tempera-
ture versus e nvironmental temperature measu res.
(a)
(b)
Figure 9. (a) Correlation Analysis for the axillary tempera-
ture versus the enviro nmental temperature; (b) Correlati on
Matrix.
formal definition of the M&E project including the C-
INCAMI repository with metadata, the integration of he-
terogeneous data sources, trace groups and MA, to clas-
sifiers for on-line decision-making process. In addition, it
implemen ts the C- INCAMI/M IS schema for the inte rch an ge
of measures in an interoperable way, and the multilevel
buffer based on metadata (see Figure 3).
The prototype has been implemented in JAVA, using
R [8] as statistical calculus engine, and the CRAN (Com-
prehensive R Archive Network) Rserve mechanism to
access TCP/IP from the streaming application to the R
engine, without requiring persistence and prioritizing the
direct communication.
The simulation goal is to determine the processing times
involved in the outpatient scenario and the variable scal-
ability. This simulation can allow us analyzing the feasi-
bility of applying the prototype to real situations. Fur-
thermore, we discuss statistically the results of the simu-
lation, in order to detect the components which incorpo-
rate more variability to the system.
4.2. Si m ul atio n Plan ning an d Ex e c ut i on
The simulation has been performed from the illustrated
scenario in Section 3. The measurement data have been
generated in a pseudo-random way considering two pa-
rameters: quantity of metrics (in a simulation each metric
corresponds to a variable), and quantity of measurements
by variable. Each patient has 8 associated metrics pertain-
ing to attributes and contextual properties as commented
in sub-Section 3.1.
The simulation discretely varied the quantity of vari-
ables (metrics) into the data stream from 3 to 99, and the
quantity of measurements by variable from 100 to 1000.
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 662
The idea of discretely vary the quantity of metrics instead
of doing it as a multiple of 8—i.e. based on the 8 ones
associated per each patient-, lies in analyzing the proto-
type behavior in presence of missing values and the pro-
gressive reincorporation of measures to the stream.
The prototype, R and Rserve were running in a PC
equipped with AMD Athlon × 2 64 bits processor, 3 GB
of RAM, and Windows Vista Home Premium as operat-
ing syste m.
For the simulation, the following variables which are
the target of measurement considering the stream as the
entity under a nalysis have been define d, namely:
Startup: the necessary time (in ms) to start up the
functio ns of the ana lysis
AnDesc: the necessary time (ms) to make the de-
scriptive analysis on the complete data stream
Cor: the necessary time (ms) to make the correla-
tion analysis by trace group inside the complete data stream
Pca: the necessary time (ms) to make the principal
component analysis by trace group inside the complete
data stream
Tota l: the necessary time (ms) to make all the ana-
lysis on the complete data stream
The simulation parameters used for the statistical ana-
lysis of results are represented by qVar, to indicate the
quantity of variables of the data stream; and by meds, to
indicate the quantity of measures by variable of the data
stream. From now onwards, in order to simplify the reading
of the statistical analysis, the parameters qVar and meds
will be directly referred to as variables, and Startup, An-
desc, Cor, Pca and Total will be called variables as well.
From the simulation process standpoint, we have ob-
tained 1390 measurements over the overall processing
time in relation to the evolution of the quantity of vari-
ables and measurements. This allows us to statistically
arr ive at verif iable results that he lp us conseque ntly vali -
dating the prototype in a controlled environment.
4.3. Analysis of Results
The chart in Fig ure 10(b) clea rly shows us how the evolu-
tion of quantity of var iables affects significantly the over-
all processing time o f data streams, incrementin g according
to the values shown in chart (a). Here, we can observe that
the increment in the processing time produced by the in-
crease of measurements is extremely low in compareson
with the one produced by the increase of variables. This
latter aspect indicates that the load shedding mechanism
really achieved the goal of avoiding overflows without
affecting the time of stream processing against the varia-
tion of the s tream vol ume. Wh ile the i ncorporatio n of ne w
variables does influence because besides the stream vo-
lume by adding a new variable, there exists also the in-
teraction with the preexistent variables, being this the cause
Figure 10. Two views of the evolution of overall processing
time (ms) against the evolution of the quantity of variables
and meas urements .
and main difference in terms of the processing time with
respect to the increase produced by measurements.
In both disper sion char ts, (a) and (b), eac h point is rep -
resented with a color that is associated with the quantity
of variables. This allows us to identify regions in the graph
in a graceful way and to compare them from both per-
spectives. In chart (b), we can observe that the overall pro-
cessing time keeps a linear rel ation according to the quan-
tity of variables. Considering suc h a situation, and on the
basis that the statistical analyzer (ASF in Figure 2) per-
forms a series of analysis on the data stream, we have
studied the incidence of each analysis in the overall pro-
cessing time, in order to detect which of them are more
critical in temporary terms.
The Pearson’s correlation matrix shown in Figure 11(a)
would confirm, firstly, the linear relationship indicated
between the quantity of variables (qVar) and the overall
processing time of the data stream (Total) given the co e f -
ficient value of 0.95. Secondly, it can be concluded that
the overall processing time would keep a strong linear
relationship with respect to the time of the descriptive
analysis with a coefficient of 0.99, followed up by the
time of Pca with 0.9, and Cor with 0.89 respectively.
The resulting matrixes of the principal component ana-
lysis—shown in Figures 11(b) and (c) reveal which of the
variables provide more variability to the system. Thus, the
first autovalue (row 1, Figure 11(b)) explains the 66% of
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario663
(a)
(b) (c)
Figure 11. (a) Pearson’s correlation matrix, (b) Matrix of
autov alues, and (c) Matri x of autovectors ass ociated to prin-
cipal component analysis (PCA).
the variability of the system. Also, if we look at its com-
position in the matrix of autovectors (col. e1, Figure 11
(c)), the variables that more contribute—in absolute terms-
are AnDesc, Cor, Pca y qVar.
Therefore, if we want to replace the seven cited vari-
ables with the three new variables (e1 to e3), we would
be explaining the 96% of the variability of the system,
where the main variables in terms of contribution are
associated to AnDesc, Cor, Pca y qVar. The system is
only affected in a 16% by the evolution of measurements
and in a 14% by the startup time. This is an important as-
pect to remark because the only external variable to the
prototype, i.e. the volume of measurements arrival, which
cannot be controlled by it, represents just a 16% and by
no way constituted an overflow situation in the queue of
services.
Lastly, taking into account the four variables that more
contributed to the system variability, three of them are
part of the overall processing time. In this way, and using
the box plot of Figure 12, we can corroborate that the
most influential variable, in terms of the magnitude to the
overall processing time, is AnDesc. In addition, note that
the biggest resulti ng time to process 99 variables ( metrics)
with 1000 measures each (i.e. in total 99,000 measures
by stream) was 1092 ms.
This outcome represents a satisfactory applicability
threshold for the prototype, especially taking into account
the basic hardware used. So, in our humble opinion, this
could easily meet the response time requirements for the
outpatient monitoring scenario.
Figure 12. Boxplot of the AnDesc, Cor, PCA and total vari-
ables .
5. Related Work and Discussion
There are many researches oriented to data stream proc-
essing from the syntactical point of view, in which the
continuous query over data streams is made in terms of
attributes and their associated values using CQL (Con-
tinuous Query Language) [9]. This approach has been
implemented in several projects such as Aurora & Bore-
alis [10], STREAM [11], and TelegraphCQ [12], among
others. Our approach (and prototype) includes the capa-
bility to incorporate metadata based on an M&E frame-
work, w h ich allows gu iding th e or gani zation of d at a stre ams
in the buffer; making possible the consistent and compa-
rable analysis from the statistical standpoint; triggering
alarms in a proactive way using several statistical analy-
sis or fr om take n decisions stemming from clas sifiers.
MavStream [13] is a prototype for a data stream man-
agement system, which has the capability of processing
complex events as an intrinsic aspect for data stream pro-
cessing. In this sense, our prototype supports the on-line
data stream analysis with the incorporation of metadata to
measures (data), handling not only measures values com-
ing from attributes of the assessed entity but also those
coming from contextual properties related to the situation
of the entity. In addition, the SDSPbMM prototype can
process measures with nondeterministic results, and per-
form analysis by trace group (or an overall analysis), which
in practical scenarios such as is the case for monitoring
of outpatients [4], represent crucial features.
Nile [14] is a data stream management system based
on a conceptual framework for detection and tracking of
phenomena or situations supported by deterministic mea-
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario 664
sures. Our prototype unlike Nile, allows the incorpora-
tion of heterogeneous data sources embracing not only
deterministic but also nondeterministic measures. On the
other hand , Sin gh et al. [15] introduce a system architect-
ture for a formal framework of data mining oriented to
the situation presented in [16]. This system is used in
medical wireless applications and shows how the archi-
tecture can be applied to several medical areas such as
diabetes treatment and risk monitoring of heart disease.
In our humble opinion, this system neglects central issues
for assuring repeatability and interoperability because it
lacks a clear specification of metrics both for entity at-
tributes and contextual properties, indicators, scales and
scale types, among other metadata.
Lastly, Huang et al. [17] present an approach based on
self-managed reports for tracking of patients. Such re-
ports are made up of a set of questionnaire items with
numeric (scale) responses, which are filled in by patients
at home. The patient’s responses feed a classification mo-
del based on neural networks in order to progressively
improve the selection of questionnaire items incorporated
in the reports. Hence, they argue this decrease the pa-
tients’ response time and allow identifying those aspects
that will foster an improvement in their quality of life.
Likewise happens in Singh et al. [15,16] proposal; this
approach says nothing about how to define metrics, indi-
cators, scales, and so on.
Our strategy and its prototype support data stream pro-
cessing in alignment with a conceptual base, i.e. the met-
ric and indicator ontology [3], which guarantees not only
syntactic but also semantic processing, in addition to
interoperability and consiste ncy.
6. Conclusions and Future Work
In this work, we have discussed how the presence of me-
tadata based on the C-INCAMI M&E framework linked
to measures in data streams, allows the organization of
measurements which foster the consistency for statistical
analysis, since they specify not only the formal compo-
nents of data but also the associated context. Hence, it is
possible to perform particular analysis at trace group or
at a m ore ge n eral level , compari n g v al u es of m et rics among
different trace groups in order to identify, for example,
deviations of measures against their formal definition; the
main system variability factor s, as well as relations a mong
variables.
In the outpatient monitoring scenario introduced in
Section 3, we have shown in Figures 8 and 9 the relation-
ships between the data/metadata of metrics—i.e. metrics
that quantify contextual properties and attributes- and the
statistical analysis, considering our data stream process-
ing strategy. In this sense, even when the measures seemed
to have normal values, the data and metadata of metrics
in conjunction with the correlation analysis allowed iden-
tifying a drag situation and then triggering alarms to pre-
vent it. Moreover, such metadata allowed identifying va-
riability factors and detecting trends in a consistent way
considering the contextual situation as well.
Using the developed prototype which implements the
SDSPb MM strat egy, we have demonstr ated fr om the sta-
tistical analysis viewpoint of the simulation outcomes that
the prototype is more susceptible, in terms of processing
time, to the increase of the quantity of variables than to
the increase of the quantity of measurements by variable.
Using the principal components analysis technique, we
have proved that the aspects that more contribute to the
system variability are those associated to the Andesc, Cor,
Pca and qVar variables, being Andesc the one that de-
fines the biggest proportion of the final processing time
of data streams.
Taking into account that the implemented prototype
ran in a system which is totally accessible in the market,
we could establish as a benchmark that to process 99,000
measurements (99 variables and 1000 measures/variable),
the biggest time spent was 1092 ms. This is an important
starting point since now we can consistently evaluate se-
veral application scenarios against this benchmark. On
the other hand, the effectiveness of the load shedding me-
chanism in the multilevel buffer was also proved statisti-
cally, coming up that the evolution of the quantity of mea-
surements does not compromise the prototype operation
and the final processing time of the data stream has not
been affected.
Although the present work is just a simulation of the
outpatient monitoring scenario by means of a prototypi-
cal software application, we have initially proved that it
can scale up to a real scenario. As a future work, we also
plan to experimentally test our data stream processing
strategy enriched with context, measurement and evalua-
tion metadata on several scenarios in or der to statistically
validate the initial benchmark obtained for the outpatient
monitoring scenario.
7. Acknowledgements
This research is supported by the PICT 2188 project from
the Science and Technology Agency and by the 09/F052
project from the UNLPam, Argentina.
REFERENCES
[1] J. Namit, J. Gehrke and H. Balakrishan, “Towards a
Streaming SQL Standard,” Proceedings of the VLDB En-
dowment, Vol. 1, No. 2, 2008, pp. 1379-1390.
[2] H. Molina and L. Olsina, “Towards the Support of Con-
textual Information to a Measurement and Evaluation
Fram ework , ” International Conference on Quality of Infor-
mation and Communications Technology, Lisbon, 12-14
Cop yright © 2011 Sci Res. JSEA
Strategy for Data Stream Processing Based on Measurement Metadata: An Outpatient Monitoring Scenario
Cop yright © 2011 Sci Res. JSEA
665
Septem be r 2007, pp. 1 54-166.
[3] L. Olsina, F. Papa and H. Molina, “How to Measure and
Evaluate Web Applications in a Consistent Way,” In: G.
Ros si, O. Pastor, D. Schwabe and L. Olsina, Eds., Web En-
gineering: Modelling and Implementing Web Applica-
tions, Sprin ger Book, London , 2008, pp. 385-420.
[4] M. Diván and L. Olsina, “Integrated Strategy for the Data
Stream Processing: A Scenario of Use,” Proceeding of
Iberoamerican Conference in Software Engineering”, Me-
dellín, 2009, pp. 374-387.
[5] M. Diván, L. Olsina and S. Gordillo, “Data Stream Proc-
essing Enriched with Measurement Metadata: A Statisti-
cal Analysis,” Proceeding of Iberoamerican Conference
in Software Engineering” (CIbSE), Rio de Janeiro, 2011,
p. 29.
[6] M. Wei, W. Rundensteiner and M. Mani, “Utility-Driven
Load Shedding for XML Stream Processing,” Interna-
tional Conference on World Wide Web, Beijing, 21-25
April 2008, pp. 855-864.
[7] C. Marrocco, R. Duin and F. Tortorella, “Maximizing the
Area under the ROC Curve by Pairwise Feature Combi-
nation,” ACM Pattern Recognition, Vol. 41, No. 6, 2008,
pp. 1961 -1974. doi:10.1016/j.patcog.2007.11.017
[8] R. Software Foundation, “The R Foundation for Statisti-
cal Computing,” 2010.
http://www.r-project.org/foundation/
[9] S. Babu and J. Widom, “Continuous Queries over Data
Streams,” ACM SIGMOD Record, Vol. 30, No. 3, 2001,
pp. 109- 120. doi:10.1145/603867.603884
[10] D. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M.
Cherniack, J. Hwang, W. Lindner, A. Maskey, A. Rasin,
E. Ryvkina, N. Tatbul, Y. Xing and S. Zdonik, “The De-
sign of the Borealis Stream Processing Engine,” Confer-
ence on Innovative Data Systems Research (CIDR), Asi-
lomar, 2005, pp. 277-289.
[11] The Stream Group, “STREAM: The Stanford Stream Data
Manager ,” Sta nfor d, 2003.
[12] S. Krishnamurthy, S. Chandrasekaran, O. Cooper, A. Desh-
pande, M. Franklin, J. Hellerstein, W. Hong, S. Madden, F.
Reiss and M. Shah, “Telegraph CQ: An Architectural
Status Report,” IEEE Data Engineering Bulletin, Vol . 26,
No. 2, 2003, pp. 11-18.
[13] S. Chakravarthy and Q. Jiang, “Stream Data Processing:
A Quality of Service Perspective,” Springer Book, New
York, 2009.
[14] M. Ali, W. Aref, R. Bose, A. Elmagarmid, A. Helal, I.
Kamel and M. Mokbel, “NILE-PDT: A Phenomenon De-
tection and Tracking Framework for Data Stream Man-
agemen t S yste ms, ” Very Large Database, Trondheim, 2005,
pp. 1295-1298.
[15] S. Singh, P. Vajirkar and Y. Lee, “Context-Aware Data
Mining Framework for Wireless Medical Application,”
LNCS of Springer, Vol. 2736, 2 0 03, pp . 381-391.
[16] S. Singh, P. Vajirkar and Y. Lee, “Context-Based Data
Mining Using Ontologies,” LNCS of Springer, Vol. 2813.
2003, pp. 405-418.
[17] Y. Huang, H. Zheng, C. Nugent , P. McCullagh, N. Black,
K. Vowles and L. McCracken, “Feature Selection and Cla-
ssification in Supporting Report Based Self Management
for People with Chronic Pain,” IEEE Transactions on In-
formation Technology in Biomedicine, Vol. 15, No. 1, 2011,
pp. 54-61.