Wireless Sensor Network, 2010, 2, 115-122
doi:10.4236/wsn.2010.22016 y 2010 (http://www.SciRP.org/journal/wsn/).
Copyright © 2010 SciRes. WSN
Published Online Februar
K-Nearest Neighbor Based Missing Data Estimation
Algorithm in Wireless Sensor Networks
Liqiang Pan, Jianzhong Li
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Email: {panlq, lijzh}@hit.edu.cn
Received November 21, 2009; revised November 30, 2009; accepted December 4, 2009
Abstract
In wireless sensor networks, the missing of sensor data is inevitable due to the inherent characteristic of
wireless sensor networks, and it causes many difficulties in various applications. To solve the problem, the
missing data should be estimated as accurately as possible. In this paper, a k-nearest neighbor based missing
data estimation algorithm is proposed based on the temporal and spatial correlation of sensor data. It adopts
the linear regression model to describe the spatial correlation of sensor data among different sensor nodes,
and utilizes the data information of multiple neighbor nodes to estimate the missing data jointly rather than
independently, so that a stable and reliable estimation performance can be achieved. Experimental results on
two real-world datasets show that the proposed algorithm can estimate the missing data accurately.
Keywords: Missing Data, Estimation, Wireless Sensor Networks
1. Introduction
The rapid development of wireless communication tech-
niques, micro-electronics techniques and embedded com-
putation techniques makes Wireless Sensor Networks
(WSNs) being applied in many fields [1–4]. WSNs con-
sist of many sensor nodes deployed in a special region
where users are interested in, and each sensor node has
some computing ability, storage ability and communica-
tion ability. Users issue queries to obtain information
about the monitored region. Faced with the features of
WSNs, many query processing algorithms have been
proposed for various applications. However, all these
query processing techniques are frustrated by a common
problem, that is, the missing of sensor data.
Actually, the missing of sensor data is inevitable due
to the inherent characteristic of WSNs. For example, the
communication ability of sensor nodes is limited. Some
sensor nodes may be isolated from the WSNs for a short
or long time due to the influences of surrounding envi-
ronment such as mountains and obstacles, which results
that the sensor data of these nodes may be lost. In addi-
tion, the natural environment such as rain, thunder and
lightning will influence the sensor nodes’ communica-
tion quality either and make the communication links
between sensor nodes connected and disconnected fre-
quently. This will also result in the sensor data lost dur-
ing the data transmission. Secondly, the power of sensor
nodes is limited. When a sensor node’s power is low, it
usually works under an unstable state. This not only
causes the unstable communication which may make the
sensor data lost, but also makes the sensor data sampled
be often useless abnormal data (e.g. the temperature of a
room is 300). The abnormal data is looked as the
missing data since it can never be used. When the power
of a sensor node is exhausted, the sensor node cannot
collect the data any more and the data cached in the
storage which have not been sent back may also be lost.
In addition, the size of sensor node is small and it is easy
to be damaged, which may also result in the lost of sen-
sor data. Due to the reasons given above, no matter how
efficient and robust query processing algorithms are de-
veloped, the missing of the sensing data is inevitable.
The missing of sensor data will cause many difficulties
in various applications. For example, in the data collec-
tion applications, the missing data will not only decrease
the availability of sensing datasets, but also decrease the
efficiency of WSNs greatly. In the research of forest en-
vironment [5], a WSN is deployed in the forest to collect
the environment variables such as temperature, humidity,
atmosphere pressure and sunlight etc. Based on the sensor
data collected, biologists can study the forest microcli-
mate, the dynamic tree respiration and growth models etc.
However, the data collected by sensor nodes is raw data.
Biologists need use some analysis tools on the amounts of
raw data and then can get the analysis results and draw a
conclusion. Unfortunately, the existing analysis tools
which are adopted in these fields, such as support vector
L. Q. PAN ET AL.
Copyright © 2010 SciRes. WSN
116
machines, principal component analysis and singular
value decomposition etc., cannot process the datasets
with missing data, and it is infeasible to modify the ex-
isting analysis tools for the datasets with missing data.
Besides, it is also difficult to process the raw data artifi-
cially due to the amount of raw data being huge. So, how
to deal with the datasets with missing data frustrates the
biologists greatly.
If all the missing data is deleted, much original data
information will be lost, which not only decreases the
accuracy and the reliability of biologists’ research, but
also may lead to the wrong research results. In addition,
deleting the missing data will also cause the waste of
energy. This is because the non-missing data in the same
tuple is valuable and believable. Collecting these data
also cost much energy. Further more, from the perspec-
tive of temporal dimension, the state of the monitored
objects at a certain moment can only be observed once,
hence the missing data cannot be collected any more, it
can only be estimated as accurately as possible.
Datasets [5,6] are two real-world datasets whose data
is collected by the WSNs deployed in the Sonoma red-
wood trees and the Intel-Berkeley lab respectively. They
show that there do exist vast missing data in the actual
data collection. Since the missing of sensor data is inevi-
table and it causes many difficulties, developing the high
quality missing data estimation algorithms is necessary
and urgent. Unfortunately, there exist few works on in-
vestigating how to process the missing data efficiently in
WSNs so far.
In this paper, a k-nearest neighbor based missing data
estimation algorithm is proposed. It adopts linear regres-
sion model to describe the spatial correlation of sensor
data among different sensor nodes and uses the multiple
neighbor nodes’ data jointly rather than independently to
estimate the missing data. Hence, it can achieve a good
estimation effect for the missing data, even for the sensor
data of changing irregularly which appears often in
WSNs. The performance of the algorithm proposed in
this paper is evaluated through extensive experiments on
two real-world datasets and compared with the other
missing data estimation algorithm. The experiment re-
sults show that the proposed algorithm can estimate the
missing data more accurately.
The remainder of this paper is organized as follows. In
Section 2, an overview of related works is presented. In
Section 3, we first give a formal definition of the missing
data estimation problem, and then present the algorithm.
Section 4 shows the experimental results, and Section 5
concludes the paper.
2. Related Work
Research on missing data estimation has been studied in
some fields, such as artificial intelligence [7,8], bioin-
formatics [9,10–12], and data mining [13,14], but there
are few works in WSNs. The works in those fields are
not adapted for WSNs, since they do not take account of
the features of sensor data being temporal and spatial
correlated. The idea of k-nearest neighbor has been
adopted in the bioinformatics to estimate the missing
values of DNA microarrays [12]. However, the algorithm
in [12] is trivial, since it only directly uses the weighted
average of the other genes’ corresponding data as the
estimated values of the missing data. While in WSNs, the
sensor data of different nodes is more likely to have
some functional relationship rather than being similar in
values simply. Thus, the algorithm in [12] is not adapted
for estimating the missing sensor data.
Research on query processing in WSNs mainly fo-
cuses on processing continuous queries and approximate
queries. Processing continuous queries mainly focuses on
how to schedule the continuous queries optimally and
how to collect the sensing data satisfying these queries
energy-efficiently according to network topology and
other system characteristics [3,15–18]. Processing ap-
proximate queries mainly focuses on how to utilize the
temporal-spatial relationship of sensing data to construct
appropriate mathematical models and how to use these
models to answer the queries approximately, trying to
lower communication cost [19–23]. To the best of our
knowledge, there exist few works investigating how to
process the missing data.
Although [24] and [25] seem to be similar with this
paper, they focus on different problems from ours. We
focus on how to estimate the missing data as accurately as
possible, but [24] focuses on how to save the energy
mostly when processing continuous queries. The accuracy
of the estimated values is not mainly concerned in [24],
and on the contrary, [24] will sacrifice the accuracy of the
estimated values for saving energy in many cases. So the
methods in [24] are not suitable for our problem. In [25],
authors map the sensor network on a graph, and based on
the graph theory, they focus on how to estimate the
measurement values at arbitrary positions with the least
sensor nodes, which is also different from ours. Besides,
[25] assumes that the measurement values in the sensor
network satisfy some spatial physical laws, and these
physical laws can be modeled by the lumped-parameter
models. However, WSNs are usually deployed in some
unknown regions to execute monitoring task. The models
which describe the measurement values of these unknown
regions are difficult to be got in fact. So, the techniques in
[25] are difficult to be used actually.
Based on the data mining techniques, literature [26,27]
studied the estimation of the missing data in data streams.
However, the algorithms in [26,27] have great limitations
and cannot be used widely. For example, the algorithms in
[26,27] can only deal with the discrete data, but not the
continuous data. However, in many applications, the en-
vironment variables monitored by WSNs such as tem-
perature, humidity, and atmosphere pressure etc. change
L. Q. PAN ET AL.117
continuously. In addition, the accuracy and the perform-
ance of the algorithms in [26,27] depend on the associa-
tion rules support and confidence thresholds which need to
be pre-specified by users. Since users are not familiar with
the monitored environments usually and the vast raw data
are difficult to be understood, users may not give the
proper thresholds, which results that the accuracy and the
performance of the algorithms decrease greatly. Further
more, the algorithms in [26,27] estimate the missing data
according to the frequent patterns which are pre-computed
based on the existing data. If the patten containing the
missing data does not appear in the frequent patterns, the
missing data cannot be estimated by [26,27]. Compared
with the algorithms in [26,27], the algorithm proposed in
this paper can solve above problems well.
3. Algorithm Presentation
This paper investigates how to estimate the missing sen-
sor data as accurately as possible. Before introducing the
algorithm, we first give the problem definition.
Definition1: The sensor data collected by the sensor
node Ni can be looked as a time series
Si=(<yi1,T1>, … ,<yin,Tn>), where yik is the sensor data of
Ni at time Tk . For Tk , k {1, … , n}, if the sensor data
yik is missed, then computing its estimated value to
minimize the expression
ˆik
y
ˆik ik
y
y is called the missing
data estimation problem.
In many applications, the environment variables
monitored by the WSNs such as temperature and humid-
ity change continuously. When some data of a sensor
node is missed, a naive method for estimating the miss-
ing data is, based on the temporal correlation of sensor
data, using the non-missing data whose collection time is
near to the missing data to estimate them. However, this
method works well only when the sensor data changes
smoothly and the missing data appears in a short time
period. In the other cases, this method may cause large
estimation errors. This is because the sensor data in
WSNs changes sharply and irregularly often in fact, es-
pecially the data sensed in the natural environment since
too many uncertain factors, such as environment noise,
will affect the variety of the sensor data. So, only de-
pending on the temporal correlation of sensor data to
estimate the missing data is not enough in many cases.
Figure 1. Temperature collected by two sensor nodes.
To estimate the missing data as accurately as possible,
we should consider not only the temporal correlation of
sensor data, but also the spatial correlation of sensor data.
Motivated by this observation, we propose the Applying
K-nearest neighbor Estimation (AKE) algorithm which
estimates the missing data based on the spatial correla-
tion more than the temporal correlation of sensor data.
As known, there are many sensor nodes deployed in a
monitored region. The sensor data of these nodes has
spatial correlations. That is, at a moment, the data sensed
by the sensor nodes whose locations are nearby is similar
or has some relationships. For example, Figure1 shows
the temperature observed by two sensor nodes in two
days [20]. From Figure 1, we can see that the data sensed
by node 1 and node 25 has the similar variety curves. So,
when some data of a sensor node is missed, we can esti-
mate them by its neighbor nodes.
For convenience of the algorithm description, without
loss of generality, we assume that only sensor node Ni
has missing data, and Ni has m neighbor nodes totally,
they are N1, … , Nm respectively. We call the sensor node
set consists of Ni ‘s all neighbors as Ni ‘s neighbor node
set which noted as Nb(i)= { N1, … , Nm }. For the node
Ni , since it has multiple neighbor nodes and for Nj
Nb(i), Nj has the spatial correlation with Ni, for decreas-
ing the random error caused by a single node when esti-
mating the missing data, AKE looks Ni and its all
neighbor nodes as a whole, and utilizes Ni ‘s all neighbor
nodes jointly rather than independently to estimate the
missing data of Ni.
For Nj Nb(i), the functional relationship between
the sensor data of Ni and Nj is unknown. Since the loca-
tions of the node Ni and Nj are close and an excitation
will cause the similar responses on the sensor data of Ni
and Nj, the relationship of Ni and Nj can be looked as
linear approximately in a short time period. AKE adopts
linear regression model to describe the spatial correlation
of node Ni and Nj, i.e. for any time t, there has
itjt jt
yy
 
  (1)
where yit is the sensor data of Ni at time t, and yjt is the
sensor data of Nj at the same time;
and
are the model
coefficients, and
jt is the random error at time t. Ac-
cording to the theory of linear regression model, to esti-
mate the missing data by utilizing Formula (1), we
should first select h (h12) pairs of known sensor data<
yin , yjn >, n t, as the sample data, and then use the sam-
ple data to regress the coefficients in Formula (1). That is,
compute ˆ
and ˆ
, which are the estimated values of
and
, based on the sample data according to least
squares principle. When the sensor data of Ni at time t is
missed, its estimated value computed by the node Nj can
be expressed as:
() ˆ
ˆ
ˆj
it jt
y

y
 (2)
where represents the estimated value of yit , which
()
ˆj
it
y
Copyright © 2010 SciRes. WSN
L. Q. PAN ET AL.
Copyright © 2010 SciRes. WSN
118
computed by node Nj. The deviation between the esti-
mated value ()
ˆ
it
y
()j
and the real value yit is called the re-
sidual of and yit, which is noted as =
ˆit
y()j
t
e()
ˆj
it
y
yit .
From the least squares principle, it is easy to know that
()
t
ehas the minimal variance.
Based on the Formula (2), totally m estimated values
can be got for a missing data yit, since Ni has m neighbor
nodes and according to each neighbor node Nj, Nj
Nb(i), an estimated value ()
ˆ
it
ycan be computed. To de-
crease the random estimation error caused by a single
neighbor node and improve the estimation system’s reli-
ability and stability, AKE uses the weighted average of
the m estimated values computed by the m neighbor
nodes as the final estimated value, i.e.
ˆit
y ()
1
ˆ
m
j
j
it
j
wy

(3)
where wj is the weight coefficient correspondingly, 0 <
wj < 1 and .
1
m
jw
(1)
t
e
()j
t
ˆit
y
1
j
Theorem1: For the estimated values computed by the
m neighbor nodes of Ni, assume that their corresponding
residuals are , , … , respectively, and these
residuals are independent and identically-distributed,
then the variance of residual = yit is less than
that of any , j={1,2, … , m }.
(2)
t
e()m
t
e
t
eˆit
y
e
Proof: According to the definition of the residuals,
there have = yit + and
t
e()
ˆ
it
y= yit + ()
t
e
()
. Substitute
them into Formula (3), we can get the relationship of
and , that is, =
t
e()j
t
et
e1
m
j
jt
2
()
j
m
j
j
we
1
m
jw
. Since ,
, …, are independent and identically-distributed,
without loss of generality, we assume the variance of
, j={1,2, … , m }, is DX. Then, from the properties of
variance, we can deduce that the variance of is
. Obviously, since 0 <
wj < 1 and . Accordingly,
< DX.
(1)
t
e
t
e
1
2
()
j
wD
(2)
t
e
()j
t
e
m
j
(m
t
e
2
()w
)
DX
1
m
jw
1j
1
j1X
Next, we discuss the weight assignment in Formula (3).
Since many factors will affect the spatial correlations
among the sensor nodes, the accuracy of the estimated
value computed by different neighbor nodes may be dif-
ferent. Intuitively, a more accurate estimated value
should be assigned a larger weight. Considering that,
given a set of sample data, the sample determination co-
efficient R2(0 R2 1) can reflect the goodness of regres-
sion equation fitting the sample data. The more the value
of R2 is, the better the regression equation fits the sample
data, which indicates that the estimated values computed
by the regression equation will be more accurate. Thus,
we can assign the weight according to the sample deter-
mination coefficient R2. For the regression equation
() ˆ
ˆ
ˆj
it jt
y

y
 , assume the sample data consists of h
pairs of sensor data, then the sample determination coef-
ficient corresponding to this regression equation can be
expressed as
() 2
2
() 2
1
ˆ
(
()
j
h
in i
j
nin i
yy
Ryy
)
(4)
where i
y
is the sample mean of node Ni. Accordingly,
the weight coefficient corresponding to the estimated
value ()
ˆ
j
it
y
can be defined as
2
()
2
()
1
j
jm
k
k
R
w
R
(5)
Based on the Formula (5), AKE can assign the appro-
priate weights to the corresponding estimated values
which computed according to different neighbor nodes.
Obviously, a more accurate ()
ˆ
it
ywill contribute more to
the final estimated value.
The computational complexity of AKE consists of two
components mainly. One is that of computing the coeffi-
cients of the regression equation for each neighbor node.
Another is that of computing the sample determination
coefficient R2 for each regression equation and then com-
puting the estimated values according to Formula (3).
From the theory of linear regression model, it is easy to
know that the cost of computing the coefficients for each
regression equation is O(h), and h is usually an empirical
constant. So, the cost of computing the coefficients for all
m regression equation is O(m). From Formula (4), we can
know that the cost of computing 2
()
R is also O(h). Thus,
the cost of computing the sample determination coefficient
for m regression equations and then estimating the missing
data based on Formula (3) is also O(m). Due to computing
the coefficients of regression equation and computing the
sample determination coefficient of regression equation
are two individual steps and executed by AKE sequen-
tially, the computational complexity of AKE is O(m),
where m is the number of Ni ‘s neighbor nodes.
Since AKE is based on the sensor data spatial correla-
tion to estimate the missing data and linear model is
adopted by the algorithm, it will perform best when the
sensor data of different nodes is linear correlated abso-
lutely. Although Figure 1 shows that the correlation of
sensor data may not be linear sometimes, it does not
matter too much. This is because the linear model can
approximate the real data correlation well in a short time
period, and hence when the sample size is not too much,
AKE will perform well even when the sensor data is not
linear correlated rigidly.
4. Experiment Results
The algorithm proposed in this paper is implemented by
L. Q. PAN ET AL.119
Java, and evaluated over two real-world datasets whose
data is collected by the WSNs indoors and outdoors re-
spectively. One dataset is Intel-lab dataset [6], which is a
trace of readings from 54 sensor nodes deployed in the
Intel Research Berkeley lab. These sensor nodes col-
lected light, humidity, temperature and voltage readings
once every 30 seconds. Another dataset is Redwood
dataset [5], which is a trace of readings from 72
Mica2dot sensor nodes deployed throughout two 67 me-
ters high giant redwood trees in a grove. These sensor
nodes collected humidity, temperature and voltage read-
ings once every 5 minutes.
To evaluate the performance of the algorithm, we
make the algorithm estimate the non-missing data in
datasets, and compare the estimated values with their
corresponding real data. Before the algorithm is executed,
we repair the datasets first since there is many missing
data. First of all, we select some fragments of datasets as
candidate test dataset. These fragments contain as little
missing data as possible. Then, we replace the missing
data in the fragments with the average of the non-missing
data nearby and to get the test datasets without missing
data. Next, we label some data in test datasets as the
missing data randomly, and make the algorithm estimate
these dummy missing data. Due to the problem focused
by this paper is how to estimate the missing data as ac-
curately as possible, we use the accuracy of the estimated
values as the evaluation criteria of the algorithm. Spe-
cifically, we use Root Mean Square Error (RMSE):
2
ˆ
[() ]
it it
RMSEmean yy
where yit is the known value which is labeled as missing
data, is the estimated value of yit, and mean repre-
sents computing the average for all the data labeled as
missing value.
ˆit
y
we compare the effectiveness of the algorithm pro-
posed in this paper against three algorithms:
LIN method: This is a temporal correlation based
missing data estimation method which is based on the
linear interpolation model. For the missing data yit, the
estimated value given by method LIN can be ex-
pressed as
ˆit
y
ˆ()
iv iu
it iuu
vu
yy
yy tT
TT
 
where yiu and yiv are non-missing data whose collecting
moments are near to time t.
KNN method: This is a naive spatial correlation
based missing data estimation method. For the missing
data yit , KNN estimates it with the weighted average of
all neighbor nodes’ data. i.e. 1
ˆm
itk kt
k
y
wy
, where
ykt is the data of Nk Nb(i), wk is the normalized weight
coefficient which represents the similarity of the node Ni
and Nk . We use KNN as a baseline to show the effec-
tiveness of the algorithm proposed in this paper.
DESM method [24]: This method computes the
missing data based on the temporal-spatial correlation.
For the missing data yit, the estimated value given by
method DESM can be expressed as
ˆit
y
ˆit
y(1)
ˆ
(1) it
y
ˆ
()z
, where is the estimated value of yit computed
based on node Nj, Nj Nb(i), and
is the Pearson corre-
lation coefficient between Ni and Nj.
ˆ
z
Since the data sampling interval, the number of
neighbor nodes, and the number of missing data are the
main factors which affect the effectiveness of the miss-
ing data estimation algorithm, we use them as the ex-
periment parameters. In the experiments, the data sam-
pling interval varies from 1 to 30 minutes, and its default
value is 15 minutes. The number of neighbor nodes var-
ies from 4 to 12, and its default value is 8. The number of
the missing data varies from 1 to 30, and its default value
is 10. In all experiments, while changing a parameter, all
other parameters are set as their default values. Specifi-
cally, due to the data used in the experiments is collected
by the real WSN and the locations of sensor nodes in the
real WSN are changeless, the number of neighbor nodes
is in logical. In fact, varying the number of neighbor
nodes is equivalent to assuming the sensor node has dif-
ferent sensing radius, so that the number of a node’s
neighbor nodes is alterable.
4.1. Intel-Lab Dataset
Figure 2 and Figure 3 show the experimental results of
the algorithms on temperature and humidity data of the
Intel-lab dataset respectively. Figure 2(a) shows that the
estimation errors of the algorithms increase when pro-
longing the sensor node’s sampling time interval. This is
because all these algorithms estimate the missing data
based on the temporal correlation more or less. The in-
creasing of data sampling interval will decrease the tem-
poral correlation of sensor data, which results the algo-
rithms’ estimation errors increased, since the sensor data
may change greatly with a long time interval. Due to
algorithm LIN estimating the missing data according to
the temporal correlation absolutely, its estimation error
increases most when sampling time interval is enlarged.
While, DESM, KNN and AKE estimate the missing data
based on the spatial correlation more than temporal cor-
relation, so their estimation errors increase less. Specifi-
cally, due to AKE adopts the regression model and uses
the multiple neighbor nodes to estimate the missing data
jointly, its estimation error increases least.
Figure 2(b) shows that the estimation errors of the al-
gorithm KNN and AKE increase slightly with the num-
ber of neighbor nodes increasing. This is because KNN
and AKE estimating the missing data are based on the
multiple neighbor nodes. Due to the data used in the ex-
eriments is collected by the real WSN and the locations p
Copyright © 2010 SciRes. WSN
L. Q. PAN ET AL.
Copyright © 2010 SciRes. WSN
120
(a) (b) (c)
Figure 2. RMSE of the algorithms on temperature data of Intel-lab dataset. (a) RMSE vs. sampling interval; (b) RMSE vs.
of neighbor nodes; (c) RMSE vs. of missing data.
(a) (b) (c)
Figure 3. RMSE of the algorithms on humidity data of Intel-lab dataset. (a) RMSE vs. sampling interval; (b) RMSE vs. of
neighbor nodes; (c) RMSE vs. of missing data.
gorithms increase with the number of missing data in-
creasing. The reason is that much missing data will de-
crease the temporal correlation between the missing data
and the non-missing data, which results the algorithms’
estimation errors increased. Due to LIN estimates the
missing data according to the temporal correlation ab-
solutely, its estimation error increases most. While,
AKE is based on the spatial correlation more than the
temporal correlation, so its estimation error increases
less than that of LIN.
of sensor nodes in the real WSN are changeless, some
nodes farer in distance will be involved into missing data
estimation when increasing the number of neighbor
nodes in the experiments. Since the farer the distance
between the sensor nodes is, the lower the spatial corre-
lation of sensor nodes is, using the nodes farer in dis-
tance into the estimation equation will decrease the ac-
curacy of the estimated values. From Figure 2(b), we can
see that the estimation errors of AKE are always smaller
than those of KNN under different number of neighbor
nodes. This is because AKE describes the functional re-
lationship of different sensor nodes’ data by regression
model and estimates the missing data based on the func-
tional relationship of sensor data rather than using the
data of neighbor nodes simply which is adopted by KNN
method. So, AKE can estimate the missing data more
accurately than KNN.
Figure 3 shows the experimental results of the algo-
rithms on the humidity data, and the similar results can
be got. Being different from Figure 2, the estimation er-
rors of the algorithms on the humidity data are larger
than that on the temperature data. This is because the
correlation of humidity data is lower than that of tem-
perature data since it is more apt to be affected by some
environment factors.
From Figure 2(b), we can also see that the estimation
error of LIN and DESM is independent of the number of
neighbor nodes. This is because LIN estimates the miss-
ing data only using the data of itself and no neighbor
nodes data is involved. Similarly, since only one of the
neighbor nodes is used by DESM to estimate the missing
data, varying the neighbor nodes number has no impact
on the estimation error of DESM.
From Figure 2 and Figure 3, we can see that no matter
on the temperature data or the humidity data, the estima-
tion accuracy of AKE is always better than that of
DESM and KNN for all parameters. This is because
AKE estimates the missing data not only utilizing the
neighbor nodes jointly, but also exploiting the functional
relationship of sensor data. So, the estimation perform-
ance of AKE is the most stable. In addition, we can al-
Figure 2(c) shows that the estimation errors of the al-
L. Q. PAN ET AL.121
sosee that, with the increasing of the sampling time in-
terval and the number of missing data, the estimation
effect of AKE is much better than that of the other algo-
rithms. This is also because the same reasons.
4.2. Redwood Dataset
Figure 4 and Figure 5 show the experimental results of the
algorithms on temperature and humidity data of the Red-
wood dataset respectively. From these two figures, we can
see the similar experimental results with those of the In-
tel-lab dataset. The difference is that, on the Redwood
dataset, the performance of LIN decreases more greatly
when the sampling time interval or the number of missing
data increases. This is because the data of the Redwood
dataset is collected by the WSN deployed outdoors. The
data of outdoors changes more sharply and irregularly,
which makes the temporal correlation of the sensor data be
lower. Thus, the estimation performance of LIN is worse
on the Redwood dataset. Comparatively, due to AKE is
based on the spatial correlation more than the temporal
correlation, its performance remains relative stable.
From Figure 4 and Figure 5, we note that even KNN
which is a naive spatial correlation based missing data
estimation algorithm always outperforms LIN for all
parameters, especially on humidity data. Thus, we can
conclude that, for the data of changing non-smoothly, the
spatial correlation based missing data estimation algo-
rithms will perform better.
From Figure 4 and Figure 5, we can also see that the
performance gap between AKE and KNN is not too much
on the Redwood dataset, especially on humidity data.
This is mainly because the sensor data of outdoors
changes more sharply and irregularly, the sensor data is in
a low correlation. This decreases the advantage of the
regression equation, and hence shrinks the performance
gap between AKE and KNN. However, no matter in what
cases, we can see that AKE always performs the best.
5. Conclusions
Missing data causes many difficulties in various applica-
tions of WSNs. Whereas, it is inevitable due to the in-
herent characteristic of WSNs. To solve the problem, the
best way is to estimate the missing data as accurately as
possible. In this paper, a k-nearest neighbor based miss-
ing data estimation algorithm, called AKE, is proposed.
The algorithm is based on the spatial correlation more
than the temporal correlation of sensor data, and esti-
mates the missing data utilizing multiple neighbor nodes
(a) (b) (c)
Figure 4. RMSE of the algorithms on temperature data of redwood dataset. (a) RMSE vs. sampling interval; (b) RMSE vs.
of neighbor nodes; (c) RMSE vs. of missing data.
(a) (b) (c)
Figure 5. RMSE of the algorithms on humidity data of redwood dataset. (a) RMSE vs. sampling interval; (b) RMSE vs. of
neighbor nodes; (c) RMSE vs. of missing data.
Copyright © 2010 SciRes. WSN
L. Q. PAN ET AL.
122
jointly rather than independently. So, the estimation per-
formance of the algorithm is stable and reliable. In addi-
tion, the algorithm estimates the missing data by ex-
ploiting the functional relationship of sensor data rather
than using the sensor data directly, so, the estimated val-
ues computed by AKE are more accurate. Experimental
results on two real-world datasets show that the algo-
rithm proposed in this paper performs well both for the
data indoors and the data outdoors.
6. Acknowledgments
This work is partially supported by the Key Program of
the National Natural Science Foundation of China under
Grant No.60533110, the National Grand Fundamental
Research 973 Program of China under Grant No.2006
CB303005, the National Natural Science Foundation of
China under Grant No.60773063 and No.60703012, and
the NSFC-RGC of China under Grant No. 60831160525.
7
. References
[1] D. E. Cullar, D. Estrin, and M. Strvastava, “Overview of
sensor networks,” IEEE Computer, Vol. 37, No. 8, pp.
41–49, 2004.
[2] W. F. Fung, D. Sun, and J. Gehrke, “Cougar: the network
is the database,” In SIGMOD Conference, pp. 621, 2002.
[3] S. Madden, M. J. Franklin, J. M. Hellerstein, and W.
Hong, “The design of an acquisitional query processor for
sensor networks [C],” In SIGMOD. San Diego, Califor-
nia, 2003.
[4] Y. Yao and J. Gehrke, “The cougar approach to in-
network query processing in sensor networks,” In
SIGMOD Record, Vol. 31, No. 3, pp. 9–18, 2002.
[5] G. Tolle, “Sonoma redwoods data,” 2005. http://www.cs.
berkeley.edu/~get/sonoma/.
[6] S. Madden, “Intel Berkeley research lab data,” 2003.
http://berkeley.intel-research.net/labdata.
[7] X. Zhu, S. Zhang, J. Zhang, and C. Zhang, “Cost-
sensitive imputing missing values with ordering,” In
AAAI. Vancouver, Canada, pp. 1922–1923, 2007.
[8] N. A. Setiawan, P. A. Venkatachalam, and A. F. M. Hani,
“Missing attribute values prediction based on artificial
neural network and rough set theory,” In BMEI. Sanya,
Hainan, China, pp. 306–310, 2008.
[9] M. S. B. Sehgal, I. Gondal, L. Dooley, and R. L. Coppel,
“Ameliorative missing value imputation for robust
biological knowledge inference,” Journal of Biomedical
Informatics, Vol. 41, No. 4, pp. 499–514, 2008.
[10] M. S. B. Sehgal, I. Gondal, and L. Dooley, “Collateral
missing value imputation: A new robust missing value
estimation algorithm for microarray data,” Bioinformatics,
Vol. 21, No. 10, pp. 2417–2423, 2005.
[11] H. Kim, G. H. Golub, and H. Park., “Missing value
estimation for dna microarray gene expression data: local
least squares imputation[J],” Bioinformatics, Vol. 22, No.
11, pp. 1410–1411, 2006.
[12] O. G. Troyanskaya, M. Cantor, G. Sherlock, P. O. Brown,
T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman,
“Missing value estimationmethods for dna microarrays,”
Bioinformatics, Vol. 17, No. 6, pp. 520–525, 2001.
[13] C. Zhang, X. Zhu, J. Zhang, Y. Qin, and S. Zhang, “Gbkii:
An imputation method for missing values,” In PAKDD.
Nanjing, China, pp. 1080–1087, 2007.
[14] S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang,
“Missing value imputation based on data clustering,”
Transactions on Computational Science, Vol. 1, No. 1, pp.
128–138, 2008.
[15] A. Manjhi, S. Nath, and P. B. Gibbons, “Tributaries and
deltas: efficient and robust aggregation in sensor network
streams,” In SIGMOD Conference. Baltimore, Maryland,
pp. 287–298, 2005.
[16] A. Silberstein, K. Munagala, and J. Yang, “Energy-
efficient monitoring of extreme values in sensor
networks,” In SIGMOD Conference. Chicago, Illinois, pp.
169–180, 2006.
[17] D. J. Abadi, S. Madden, and W. Lindner, “Reed: robust,
efficient filtering and event detection in sensor networks,”
In VLDB, Trondheim, Norway, pp. 769–780, 2005.
[18] X. Yang, H. B. Lim, M. T. Ozsu, and K. L. Tan. “In-
network execution of monitoring queries in sensor
networks,” In SIGMOD Conference, Beijing, China, pp.
521–532, 2007.
[19] J. Considine, F. Li, G. Kollios, and J. Byers,
“Approximate aggregation techniques for sensor data-
bases,” In ICDE. Boston, MA, pp. 449–460, 2004.
[20] A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein,
and W. Hong, “Model-driven data acquisition in sensor
networks,” In VLDB, Toronto, Canada, pp. 588–599,
2004.
[21] A. Deshpande, C. Guestrin, W. Hong, and S. Madden,
“Exploiting correlated attributes in axquisitional query
processing,” In ICDE, Tokyo, Japan, pp. 143–154, 2005.
[22] D. Chu, A. Deshpand, J. M. Hellerstein, and W. Hong,
“Approximate data collection in sensor networks using
probabilistic models,” In ICDE. Atlanta, pp. 48, 2006.
[23] A. Silberstein, R. Braynard, C. S. Ellis, K. Munagala, and J.
Yang, “A sampling-based approach to optimizing top-k
queries in sensor networks,” In ICDE. Atlanta, pp. 68, 2006.
[24] Y. Li, C. Ai, W. P. Deshmukh, and Y. Wu, “Data
estimation in sensor networks using physical and
statistical methodologies,” In ICDCS, Beijing, China, pp.
538–545, 2008.
[25] H. Zhang, J. M. F. Moura, and B. H. Krogh. “Estimation
in sensor networks: A graph approach,” In IPSN, Los
Angeles, California, pp. 203–209, 2005.
[26] M. Halatchev and L. Gruenwald. “Estimating missing
values in related sensor data streams,” In COMAD,
Hyderabad, India, pp. 83–94, 2005.
[27] N. Jiang and L. Gruenwald, “Estimating missing data in
data streams,” In DASFAA, Bangkok, Thailand, pp.
981–987, 2007.
C
opyright © 2010 SciRes. WSN