Statistical Mining in Data StreamsAnkur Jain Recent years have seen a steady rise of a new class of data management systemscalled Data Stream Management Systems DSMS.. sen-In this disser
Trang 1Statistical Mining in Data Streams
A Dissertation submitted in partial satisfaction
of the requirements for the degree of
Doctor of Philosophy
in Computer Science
by Ankur Jain
Committee in Charge:
Prof Edward Y Chang, Chair
Prof Divyakant Agrawal
Prof Yuan-Fang Wang
December 2006
Trang 23245961 2007
Copyright 2006 by Jain, Ankur
UMI Microform Copyright
All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346 All rights reserved.
by ProQuest Information and Learning Company
Trang 3Prof Divyakant Agrawal
Prof Yuan-Fang Wang
Prof Edward Y Chang, Committee Chairperson
October 2006
Trang 4Copyright c 2006
byAnkur Jain
Trang 5There are a lot of people whom I want to thank for their direct and indirect contributions
in making this dissertation possible
First of all, I would like to thank my advisor, Prof Edward Y Chang, for theguidance, support, and honest criticisms he has provided throughout the course of thiswork His energy and dedication has always been a great source of inspiration for me Iwould also like to thank Prof Yuan-Fang Wang who helped me in my research work I
am fortunate to have learned so much from his overall attitude towards problem solving,and efforts to explain things clearly
I would like to thank my friends Badri, Irfan, Kapil, Nagender, Amit, Nisheeth,Vibhore, Sumit, Sukhi, Gayatri, Shalini, Sirisha and Kavitha for supporting me over allthese years especially when it was most needed I thank my colleagues Panda, Arun,Gang, Zhihua, Zoran, Yi, Raju and Kinsghy who where always there for me whenever
I needed some help or advice
Lastly, and most importantly, I wish to thank my parents, R K Jain and Abha Jain,and sisters, Richa and Prachi, for their patience, love and support To them I dedicatethis thesis
Trang 6“Adaptive nonlinear clustering for data streams,” Ankur Jain, Z Zhang & Edward Y.Chang Proc of ACM CIKM’06, Intl Conf on Information and Knowledge Manage-ment, Arlington, VA, USA
Trang 7Y Chang & Yuan-Fang Wang UCSB Technical Report, June 2006.
“Using stationary-dynamic camera assemblies for wide-area video surveillance and lective attention,” Ankur Jain, Dan Kopell, Kyle Kakligian & Yuan-Fang Wang Proc
se-of IEEE CVPR’06, Conf on Computer Vision and Pattern Recognition, New York
“Adaptive stream resource management using Kalman Filters,” Ankur Jain, Edward Y.Chang & Yuan-Fang Wang Proc of the ACM SIGMOD’04, Intl Conf on Manage-ment of Data, Paris, France
“Adaptive sampling for sensor networks,” Ankur Jain & Edward Y Chang Proc ofDMSN’04, Intl workshop on Data Management in Sensor Networks (in conjunctionwith VLDB), Toronto, Canada
“Managing and Mining Video Sensor Data,” Edward Y Chang, Ankur Jain, NavneetPanda, Yuan-Fang Wang, Gang Wu & Yi Wu UCSB Technical Report, 2004
Graduate Coursework
Computer Networks, Design & Analysis of Algorithms, Programming Languages andImplementation, Theory of Computation, Computational Geometry, Internet Comput-ing and Web Technology, Bioinformatics, Advanced Computer Architecture, AdvancedAlgorithms for Multimedia Systems
Trang 8Statistical Mining in Data Streams
Ankur Jain
Recent years have seen a steady rise of a new class of data management systemscalled Data Stream Management Systems (DSMS) These systems manage rapid, high-volume data-streams with transient relations instead of static data with persistent rela-tions Data streams are common to applications such as network traffic and transac-
tion monitoring systems, click-stream processors, industrial process control, and sor networks ADSMSoperates on these continuous and time-varying data streams tofacilitate on-the-fly query answering, and to support data acquisition, monitoring andanalysis
sen-In this dissertation, we present statistical stream mining solutions for effective
on-line processing of streaming data We focus research issues related to adaptive streamresource conservation and online mining in a DSMS We have developed statisticallinear and non-linear filtering techniques based on the Kalman Filter to capture tem-poral correlations in the streaming data Such correlations help in stream resourceconservation We also propose techniques that capture spatial correlations between thestreaming sources that further helps improving resource conservation and facilitatesanswering group-queries in an efficient manner
Trang 9dress issues related to online stream mining Once the data stream arrives at a centralserver, effective mining techniques are necessary for stream analysis, before the data
can be discarded Since a stream continuously evolves with time, stream mining niques need to be adaptive and should operate under a given memory constraint We
tech-propose adaptive clustering solutions that use the kernel trick to capture non-linear
re-lations in the streaming data We also presentOCODDS, a change-detection approachthat can track evolutionary changes in the stream in both linear and non-linear set-tings Finally, we present our techniques for effective acquisition and processing of
data streams common to video sensor networks
Trang 10List of Figures xii
1.1 Data Stream Mining 7
1.2 Contributions toward statistical stream mining 8
2 Kalman Filter in Stream Resource Mangement 16 2.1 Adaptive resource management for maximizing resource conservation 18 2.1.1 What is the Kalman Filter? 20
2.1.2 Contribution Summary 23
2.1.3 Related Work 25
2.1.4 The Kalman Filter 28
2.1.5 The Dual Kalman Filter Model 32
2.1.6 Why Kalman Filter? 35
2.1.7 Modeling Kalman Filter for Data Streaming Applications 39
2.1.8 Experimental Results 47
2.2 Adaptive resource management for maximizing query precision 58
2.2.1 Related Work 61
2.2.2 Our Framework 63
2.2.3 Results 68
2.3 Conclusion 73
3 Bayesian reasoning for sensor resource management and diagnosis 75 3.1 Related Work 81
3.2 Architecture and Model 83
Trang 113.2.3 Answering Diagnostic Queries 98
3.3 Experimental Validation 101
3.3.1 Experiment Setup 101
3.3.2 Resource Conservation 104
3.3.3 Query Answer Quality-Loss 107
3.3.4 Abnormality 109
3.3.5 Selectivity 110
3.4 Conclusion 111
4 Adaptive clustering in Data Streams 121 4.1 Related Work 130
4.2 Kernel Methods 132
4.3 Adaptive Non-linear Clustering Framework 135
4.4 Kernel Methods for Stream Clustering 139
4.4.1 Kernel Stream Segmentation (Tier-1) 139
4.4.2 Data Projection inLDS(Tier-2) 142
4.5 Experimental Evaluation 150
4.5.1 Datasets 151
4.5.2 Performance Results 155
4.6 Conclusions 163
5 OCODDS: Online change-over detection framework for tracking evolu-tionary changes in streaming data 168 5.1 Related Work 172
5.2 TheOCODDSFramework 178
5.2.1 Finding the changeover location in a window 181
5.2.2 Conducting the hypothesis test 185
5.2.3 OCODDSusing the kernel method 187
5.3 Experimental Evaluation 190
5.3.1 Comparative analysis with CUSUM 193
5.3.2 Effect of window and padding size 196
5.3.3 Effect of population variance ratio 197
5.3.4 Effect of location uncertainty (LU) 198
5.3.5 Effect of Noise 199
5.4 Conclusion 200
Trang 126.2 Technical Rationales 214
6.2.1 Off-line calibration 214
6.2.2 On-line selective focus-of-attention 218
6.3 Experimental Results 229
6.4 Conclusions 236
Trang 131.1 A Data Stream Management System (DSMS) 4
2.1 TheDKFmodel 19
2.2 Architecture ofDKFmodel 29
2.3 Moving-object dataset (Example 1) 49
2.4 Number of updates received at the central server (Example 1) 50
2.5 Average error produced by differentKFmodels (Example 1) 50
2.6 Electric power load dataset (Example 2) 52
2.7 Number of updates received at the central server (Example 2) 53
2.8 Average error produced by differentKFmodels (Example 2) 53
2.9 Network monitoring dataset (Example 3) 55
2.10 Comparative results forKFsmoothing against moving average approach 55 2.11 Performance ofDKFon smoothed data with F = 10−7(Example 3) 56
2.12 Performance ofDKFfor precision widthδ = 10 (Example 3) 56
2.13 m on varying # of streaming sources 70
2.14 η on varying # of streaming sources 70
2.15 ERU on varying # of streaming sources 71
2.16 ERU on varyingλi 72
2.17 ERU on varying W i 72
3.1 Correlation model for NDBC data 77
3.2 Compact representation usingBN 77
3.3 Sensor Network Architecture 84
3.4 Algorithm for computing candidate attributes set,Υ 113
3.5 Supplementary procedures for the algorithm shown in Figure 3.4 114
3.6 Resource conservation as a function ofδminand|Q| 115
3.7 Resource conservation as a function ofδmin 116
3.8 Resource conservation withδmin = 0.90 116
Trang 143.11 Abnormality Detection 119
3.12 Selectivity withδmin = 0.90 120
4.1 Linear separation using the kernel methods 126
4.2 Well-behaved data in the feature space (Network intrusion data for “ip-sweep” attack) 127
4.3 Overview of 2-tier clustering architecture 128
4.4 Adaptive non-linear clustering framework 138
4.5 Reuters stream pattern 152
4.6 Network-intrusion stream pattern 153
4.7 MNIST stream pattern 154
4.8 Cumulative Cluster Purity (u= 10) 164
4.9 Fraction of elements in significant clusters (u= 10) 165
4.10 Effect of dimensionality 166
4.11 EVDComputations (α = 75) 167
5.1 OCODDSvs CUSUM - Detection accuracy (LU = 1, m = 1, n = 30) 191 5.2 OCODDSvs CUSUM - Processing time (LU = 1, m = 1, n = 30) 191
5.3 Effect of window length (LU = 1, m = 1, N rms = 0) 201
5.4 Effect of R Accuracy with LU = 1, m = 1, N rms = 0, n = 50 202
5.5 Effect of padding length (LU = 1, n = 50, N rms = 0 ) 202
5.6 Effect of PVR on detection bias (LU = 1, m = 1, n = 50, varying PVR) 203 5.7 Effect of PVR on detection bias (LU = 1, m = 1, n = 50, varying R) 204
5.8 Effect population variance on detection accuracy (LU = 1, m = 1, n = 50) 205
5.9 Effect of LU (m = 1, n = 50) 206
5.10 Effect of RMS noise (LU = 1, m = 1, n = 50, varying RMS noise) 207
5.11 Effect of RMS noise (LU = 1, m = 1, n = 50, varying R) 208
6.1 Computing the correct pan DOF (a) if optical and pan centers are col-located, and (b) if they are not 219
6.2 Error in centering assuming computational collocation of the optical center on the rotation axis 221
6.3 Selective focus-of-attention as a visual servo problem 224
6.4 Comparison of calibration accuracy as a function of experimental setup (using a CCD of 300× 300 pixels) 230
6.5 Relation between requested and realized angle of rotation for Sony PTZ camera 237
Trang 15models 237
6.8 Centering errors under various experimental conditions 238
6.9 Focus-of-attention experiments using real video 239
6.10 Focus-of-attention experiments using real video 240
Trang 161.1 Data stream applications 31.2 Popular Data Stream Management Systems (DSMS) 5
2.1 Summary of existing solutions and advantages of using the Kalman Filter 262.2 Symbols and their meanings 333.1 Relative Costs of Sensors for Real Datasets 101
6.1 Comparison of calibration accuracy For [50], 51% simulation runsfailed to converge If the simulation did converge, 85 iterations were needed
in average 230
Trang 17A data stream is simply a continuous sequence of data elements (x1, x2, x3, · · · , xi, · · · )
that arrive in on-line fashion, where each element xicould be a scalar or a vector entity
Babcock et al., presented the data stream model with the following salient features [18]:
• The data elements arrive on-line
• The system has no control over the order in which the data elements arrive
• Once an element has been seen or processed, it cannot be easily retrieved or seen
again unless it is explicitly stored in the memory
Data streams are common to a variety of data-intensive applications, some of which
we have summarized in Table 1.1 In all these applications, it is not feasible to operateusing a traditional database management system (DBMS), since it assumes persistentdata relations and predominately focuses on query optimization, data access (indexing)and privacy On the other hand, a data stream management system (DSMS), must
Trang 18consider critical factors such as noise from the data sources, management of the systemresources (both at the central and remote locations), and evolutionary changes in thedata trends Moreover, unlike DBMS, a DSMS also undertakes the task of handlingissues related to data acquisition from the streaming sources For example, reduction ofcomputation and communication load on nodes in a sensor network, while maximizingthe total information gain Since a data stream is potentially unbounded in terms of itssize, and theDSMSengine operates under the constraints of finite memory and storage,
the data stream query model is also different from that of a traditional DBMS Some of
these striking differences are:
• Approximate Query Results: Since access to the entire stream at any point of
time is difficult (and often impossible), data stream query results are often
re-ported in an approximate fashion A sliding window paradigm is adopted, whereresults are computed using only as much historical data as a window can hold
For example, given in a stock market feed, a user might ask “What is the value of YAHOO stock index average over the last 30 seconds?” In this case, the sliding
window will hold data received only in the last 30 seconds
• Continuous Queries (query registration): A user registers her query with the
DSMSonly once, however, the query result can be a data stream in itself
Con-sidering the stock market example again, a user might ask “Report the value of
Trang 19Application Example queries
Financial Stock tickers “Find stocks with more than 5% gain in last 30
min-utes.”
Transaction log analysis “Determine number of distinct base stations used for
the last 5 longest cellular phone calls.”
Network monitoring and
traf-fic engineering
“Report the average number of HTTP packets seen
on a network link in.”
Sensor networks “Continuously monitor the standard deviation of the
temperature in a nuclear reactor.”
Video surveillance “Determine areas in a parking lot with unusual
ac-tivity.”
Table 1.1: Data stream applications.
the YAHOO stock index, whenever it goes higher than its mean over 30-second sliding window.” This will result in a stream of mean stock values.
• Online Adaptivity: Since data in a stream arrives online, a true stream
pro-cessing algorithm should also be compatible online However, since the streamcan also exhibit evolutionary behavior, an effective stream-processing algorithm
should be able to analyze such evolutionary changes For example, data could besampled at a higher rate for stocks that have been showing increased activity on
a particular day (say IT stocks) as compared to other stocks (say Finance stocks)
to improve the overall accuracy of query results
In Figure 1.1, we show the architecture of a typical DSMS The data sources (onthe right) forward data to the central processor (in the middle) Data could be continu-
ously pushed to the central processor, or could be pulled from the sources as and when
Trang 20Figure 1.1: A Data Stream Management System (DSMS)
required [37] The central processor, operates on the received data under the constraintsput forth by the user (on the left) These constraints may include the sliding windowsize, query answer precision values etc Once the data have been processed, the centralprocessor may choose to discard the data, or to store data summaries (synopsis).There are many interesting research issues related to different aspects of a DSMS.For example, at the central site, the DSMS should address issues related to resourcemanagement, query processing, etc., and at the remote site, should address the issuesrelated to sensor calibration (e.g camera calibration in video sensor network [26]),
Trang 21DSMS University A
ffilia-tions
Summary
STREAM Stanford University A general-purpose DSMS with SQL-like
query interface Supports continuousquery registration and processing
Aurora MIT, Brandeis
Univer-sity and Brown versity
Uni-Real-time stream processing engine forcontinuous sensor data processing Pri-mary focus is toward operator schedulingand load shedding
Cougar Cornell University A distributed sensor database system
Support in-network query processing and
effective query plan generation
TelegraphCQ UC Berkeley A DSMS with continuous data-flow
pro-cessing and adaptive query processors
data acquisition, efficient data sampling and forwarding techniques Due to this wide
spectrum of open problems related to data stream processing at different levels, stream
research has been perceived and studied with increasing interest over the past few years.The task of building an efficient and effectiveDSMS has been undertaken at different
universities with different performance goals Table 1.2 summarizes the salient features
of some of such popular projects In the remainder of this section, we briefly surveyrelated research in this area
A significant body of research aims to develop effective query languages for a
DSMS These query languages have a declarative SQL-like interface, and they port constructs particular to data-stream queries (for example, specification of the slid-ing window size) ESL [88], is one such stream query language under development at
Trang 22sup-UCLA Stanford’s CQL is another example of aDSMSquery language [15] Given a
stock-ticker stream stock(ticker, value), a query to select all tickers having a price of
more than $10 in the last 5 seconds can be formulated in CQL as follows:
SELECT ticker
WHERE value > 10
In-network processing [32] and load-shedding techniques, have been devised to
reduce data forwarding from the sources in an effort to minimize the computational
load on the query processor, and the communication effort needed to transfer data to
the central server Research work proposed in [124], inserts context-sensitive dropoperators in query plans for dropping data tuples In [119] the authors propose an
efficient filter operator placement technique, wherein the placement node is selected
based on computational capability of the node, as well as the selectivity and cost of theoperator in question
A significant area of data-stream research focuses on building an efficient query
processors for data streams [97, 22] that optimize memory and storage requirementswhile meeting QoS specifications Also, there have been research efforts devoted to
data sampling techniques from a sliding window [19] and aimed toward creation of an
effective data stream synopsis [45]
Trang 231.1 Data Stream Mining
In addition to stream query processing, theDSMSalso undertakes a relatively more
challenging task of stream mining or data stream analysis Although, data mining is a
well-established field in itself, direct application of mining algorithms to data streams isoften unsuitable due to the fact that it is impossible to maintain all the stream elements
in memory Moreover, as new data arrives online, the mining algorithms must adapt to
the changing trends Hence, approximation and adaptivity are the key ingredients of
any data stream mining algorithm
While data-stream querying research is relatively older, data stream mining hasreceived more of the researchers attention in the past few years However, stream-querying shares interesting inter-dependencies with stream-mining For one reason,stream-mining solutions can help improve the performance of a DSMS: For example,
an effective stream trend prediction technique can help in the placement of query
oper-ators and also reduce the data-forwarding load in a sensor network However, streamquerying solutions are often general-purpose, whereas mining solutions are application-specific and are particular to the problem semantics For example, a particular approach
for stream trend prediction might work well for a “stock ticker” stream and at the same time show no performance enhancement for the “Internet tra ffic” stream Moreover,
mining queries are often represented in an abstract fashion (e.g “Determine areas of
Trang 24unusual activity in a parking lot”) Some of the common stream-mining tasks include
but are not limited to [36]:
• Multi-dimensional on-line analysis [39]
• Mining spatial and temporal correlations in streaming data trends [51, 68]
• Mining novelty, outliers and anomalous behavior [7]
• On-line adaptive clustering and classification [10, 8]
• Frequent pattern matching [46, 16]
1.2 Contributions toward statistical stream mining
The primary focus of this dissertation is to develop efficient and effective statistical
stream mining solutions that improve the performance of aDSMS In addition to beingon-line and adaptive, our proposed solutions handle issues related to non-linearity indata relations, an issue largely ignored by the data stream research community in thepast
Non-linearity in a data stream applications manifests itself in different forms
Con-sider the problem of data stream trend analysis as an example To save communicationresources, it is often desirable to predict the future attribute value at the central queryserver based on the current trend, rather then acquiring fresh values from the remote
Trang 25streaming source Thus, the performance of the system depends directly on the bility of the prediction scheme If the trend has non-linear characteristics, it becomesincreasingly difficult to make reliable predictions Adaptive on-line stream classifica-
relia-tion/clustering is another interesting example Real-world data is often separated by
non-linear class boundaries, which are hard to delineate by simple classification rithms The problem gets more challenging even considered in the data stream settingdue to unavailability of the stream in its entirety and limited memory constraints
algo-In this dissertation, we propose efficient adaptive solutions that address the
prob-lems arising due to the non-linearity in data relations while operating in limited ory In particular, we focus on the following problems:
mem-• Communication resource conservation while addressing approximate user queries
• On-line trend analysis
• Effective data acquisition from streaming sources
All of our proposed solutions operate in limited memory under the data stream modelpresented in Figure 1.1 We further simplify the data stream model by assuming that the
elements in the stream arrive in-order and that we access them sequentially However,
when we incorporate sliding-window techniques, we assume that random access to any
Trang 26element in the sliding window is possible1 In the rest of this section, we provide anoverview of our proposed approaches and a roadmap to the rest of the thesis.
In chapter 2, we study how linear and non-linear filtering techniques help in ducing communication overhead while answering approximate user queries A majorresearch area in stream management is to allocate resources (such as network band-width) to query plans The objective is to:
re-1 minimize resource usage under a user-query precision requirement, or
2 maximize precision of query results under resource constraints
Both of these problems can be perceived as fundamentally filtering problems, in whichthe objective is either to filter out as much data as possible at the source to conserve re-sources (provided that a user specified query-precision constraint is met at the server),
or to maximize the precision while filtering data adaptively at the remote site, such thatthe overall resource usage is under a user-specified resource-constraint To date, many
solutions have been proposed; however, most solutions are ad hoc with hard-coded
heuristics to generate query plans We select the Kalman Filter as a general and tive filtering solution for conserving resources The Kalman Filter has the ability toadapt to various stream characteristics, sensor noise, and time variance Furthermore,
adap-we realize a significant performance boost by switching from traditional methods ofcaching static data (which can soon become stale) to our method of caching dynamic
Trang 27procedures that can predict data reliably at the server without the clients’ involvement.Through examples and empirical studies, we demonstrate the flexibility and effective-
ness of using the Kalman Filter (in both linear and non-linear flavors) as a solutionfor managing trade-offs between precision of results and resources in satisfying stream
queries (case (1)) We also propose a novel adaptive sampling technique with which
we maximize query precision, given the available network bandwidth to address case(2) This problem is prevalent in systems where a large network of sensors, deliverscontinuous data to a central server, and the objective is to maximize the total informa-tion gain, given the total communication overhead the network can bear Our approachemploys a Kalman-Filter (KF)-based estimation technique wherein the sensor can usethe KF estimation error to adaptively adjust its sampling rate within a given range, au-tonomously When the desired sampling rate violates the range, a new sampling rate
is requested from the server The server allocates new sampling rates under the straint of available resources such that KF estimation error over all the active streamingsensors is minimized
con-In chapter 3, we present solutions for addressing user queries in a different setting
where data is not being continuously pushed by the sources to the server; instead, it can be pulled from the sources as and when required However, each pull of the data
comes at some cost of resource usage This setting is typical in sensor networks, whereacquisition of data from a sensor comes at the cost of deterioration of its battery life
Trang 28Using the correlations and dependencies that could be prevalent in the sensor attributevalues, we propose a scheme for modeling sensor networks using Bayesian Networks(BN), where each node represents a sensor, and a directed edge represents the influence
relationship between two sensors We show that by taking advantage of the blanket property ofBNs, we can generate resource-conserving group-query plans, and also address a new class of diagnostic queries When multiple sensors are queried, the
Markov-queries can be processed collectively as a single group-query that exploits inter-attributedependencies for deriving cost-effective query plans
In addition to answering user queries, a data stream management system should beable to analyze the stream automatically and mine interesting patterns In Chapter 4,
we shift our focus to stream clustering, a problem that has emerged as a challengingand interesting one over the past few years Due to the evolving nature and one-passrestriction imposed by the data stream model, traditional clustering algorithms are in-applicable for stream clustering This problem becomes even more challenging whenthe data are high-dimensional and the clusters are not linearly separable in the inputspace We propose a non-linear stream clustering algorithm that adapts to the stream’s
evolutionary changes Using the kernel methods for dealing with the non-linearity
of data separation, we propose a novel 2-tier stream clustering architecture Tier-1captures the temporal locality of the stream, by partitioning it into segments, using akernel-based novelty detection approach Tier-2 exploits this segment structure to con-
Trang 29tinuously project the streaming data non-linearly onto a low-dimensional space (LDS),before assigning them to a cluster.
Since streaming data distribution exhibits evolutionary changes that could bly be non-linear, one of the fundamental problems in stream mining is to track suchchanges effectively and efficiently This necessitates a framework that can automat-
possi-ically monitor evolutionary changes in the streaming data distribution and cause analert when a significant change is detected In Chapter 5, we present OCODDS: On-line ChangeOver Detection framework for Data Streams OCODDS offers real-time,
high-changeover detection accuracy while being completely oblivious to the underlyingdistribution’s shape and parameter We also presentOCODDS’s tie-in with the kernelmethod that allowsOCODDSto operate in any high dimensional space using implicitnon-linear projections, without making changes in its original framework This tie-inallows for even higher changeover detection accuracy at a slightly higher computational
effort Our approach offers a theoretical guarantee of being an unbiased estimator of
the changeover point under certain conditions, and it offers better change-detection
ac-curacy than the traditional algorithms like CUSUM
In chapter 6, we attempt to address some of the issues associated with a different
kind of sensor networks, those that have cameras as the sensors Such networks are alsoreferred to as video sensor networks [26] or VSN Video sensor networks have certain
Trang 30characteristics that differentiate them from a general purpose sensor network, some of
which are as follows:
• A large volume of continuous data is produced by the sensors, most of which
must be processed at the sensor site For example, the video sensor may have toprocess a high resolution video stream to produce a simple stream of 3-dimensionalmoving object coordinates
• VNSs operate under the collaborative effort computer vision, computer graphics,
networking, and data stream processing techniques
In Chapter 6, we present a prototype video surveillance system that uses dynamic (or master-slave) camera assemblies to achieve wide-area surveillance andselective focus-of-attention We address two critical issues in deploying such a VSN:(1) off-line camera calibration and (2) on-line selective focus-of-attention We formu-
stationary-late this selective focus-of-attention problem into that of a feedback loop visual-servo;
a non-linear problem to solve We show that our methods effectively process the stream
of a moving object coordinates to continuously focus on an object by adaptively ing the PTZ camera parameters Since the performance (focus-of-attention accuracy)
adjust-of such a system is heavily dependent on the quality adjust-of information produced from thevideo sensors, we also propose our video-sensor calibration methods that calibrate allthe degrees of freedom of a video sensor in a closed-form Calibration is important
Trang 31because VSNs employ numerous cheap video sensors which do not come with quality mechanical designs Although these calibration procedures are off-line and are
high-not directly related to the theme of this dissertation, we provide them for completenessand better understanding of other methods proposed in Chapter 6
Trang 32Kalman Filter in Stream Resource
perfor-to the server If the server can answer queries within specified precision constraints,these methods do not enact data communication Indeed, these methods have beenshown effective for reducing network-bandwidth consumption, thereby also conserv-
ing the storage and processing loads at the server
The problem of data stream resource management can be perceived from two ferent aspects:
dif-1 Minimizing resource usage under query precision constraints
Trang 332 Maximizing query precision under given resource constraints.
In our work, we treat stream resource management as fundamentally a filteringproblem We believe that an adaptive and general stream filtering solution can addressboth cases (1) and (2) effectively We advocate the use of the Kalman Filter (KF) [81]for stream-filtering, sinceKF has been well studied and widely applied to many datafiltering and smoothing problems We present our methods for addressing cases (1) and(2) in Sections 2.1 and 2.2 respectively We incorporate the Kalman Filter as the basicbuilding block of a stream management system for the following two reasons:
• Traditional methods cache static data that easily become stale over time This
necessitates frequent and expensive synchronization between clients and serversthrough retransmission Our method, on the other hand, caches filter parameters
that enable dynamic and accurate system prediction on the server without clients’
intervention
• As will be fully explained in Section 2.1.6, the Kalman Filter can be easily
cus-tomized to handle varying stream characteristics, sensor noise, and time variance
to meet the requirements specified in [97] The same filtering framework can
be adapted to address a wide variety of stream resource management problems,providing a unified paradigm that is both powerful and versatile
Trang 342.1 Adaptive resource management for maximizing
re-source conservation
An effective algorithm to address this problem is one that filters out a maximum
amount of data as long as the precision constraints are met at the server A major coming of the existing solutions, is that they are often ad hoc, as explained in [13], andhave been highly application-dependent No unified solution has yet been developed for
short-managing stream resources We introduce our Dual Kalman Filter (DKF) architecture
as a general and adaptive solution to the stream-resource-conservation maximizationproblem
To further emphasize the need to conserve network bandwidth, let us consider a ical wireless sensor monitoring system In applications such as moving-object tracking,weather monitoring, and video surveillance, the power dissipation rate of a wirelesssensor-node is an issue of primary concern It has been established [118, 135] that themajority of power dissipation occurs when transmitting bits over wireless networks,not when processing them The ratio of energy spent in sending one bit over networks
typ-to that spent in executing one instruction is between 220 typ-to 2, 900 on various
architec-tures [109, 110] Thus, filtering data at sensors is beneficial not only for conservingbandwidth, but also for conserving power Since the computational cost incurred by
Trang 35Update the central server
NOSend update received
with new value
StreamingSource
NOYES
Drop thedata tuplefrom remote source
YES
User Is update available
from remote source ?
given query precision ?central server outside
Is prediction of
Remote Source (Running KFm) Central Server (Running KFs)
Send prediction from the server KFs
KFis insignificant in many practical sensing scenarios,KFis an attractive option as thefiltering solution for resource conservation
Figure 2.1 depicts the role of our proposedDKFmodel in a typical DSMS ture A user (on the left-hand side of the figure) issues a query to the server with someprecision constraints The server activates aKF, denoted asKFs, and at the same time,the target sensor activates a mirrorKFwith the same parameters, denoted asKFm Thedual filtersKFs andKFm predict future data values Only when the filter at the remotesource, KFm, fails to predict future data within the precision constraint (and thusKFs
architec-cannot provide an accurate prediction at the server) that the sensor sends updates to
KFs Significant bandwidth conservation can be achieved if a reliable and accurate dataprediction mechanism is employed We propose the KF as such a mechanism for itssimplicity, efficiency, and provable optimality under fairly general conditions
Trang 362.1.1 What is the Kalman Filter?
The Kalman Filter is a stochastic, recursive data filtering algorithm It has beenwidely used in predicting a system’s internal state based on the observation of its ex-ternal behavior For stream management applications, a stream is modeled as a genera-tive process (a streaming model) controlled by the stream’s internal parameters (state),which evolve over time The state may or may not be directly observable, and hence,has to be inferred by observing the system’s external behavior The observed datastream serves as the external observation that is used to estimate the internal state of thestream’s generative process The state estimation process operates using recursive steps
of prediction (propagating the internal state of the system) and correction (fine-tuningthe prediction with external observation) [34, 81, 130]
A concrete example is the problem of estimating the state of a moving vehicle.The system state comprises the current location and velocity of the vehicle, and is
represented by a vector x k , at a discrete time step k The system evolves over time due
to the driver’s acceleration and braking actions, and road friction of a random Hence,
Trang 37the time evolution of the system’s state is governed by the equation
where w k is process noise, matrices A and B relate the state at k to that at k + 1, a k is
the time-varying acceleration, c k is the velocity, and T is the time between step k and step k + 1 Now, suppose at discrete time intervals we can measure the position p Then
our measurement at time k can be denoted as
whereνkis the measurement noise inherent in all measurement processes
Given hints on the system state through the state propagation and external vation mechanisms, the KFintegrates all the information to arrive at the best estimate
obser-of the system state over time TheKFweighs all available information by taking intoconsideration the noise in external measurement and the uncertainty in state propaga-
tion More specifically, let us assume that the state propagation uncertainty w kis white
Gaussian with a covariance matrix Q The measurement noiseνkis white Gaussian with
a covariance matrix R, and it is not correlated with the noise in state propagation The
formulation of theKFalgorithm provides us with the following statistical properties:
Trang 381 The expected value of theKFestimate is equal to the expected value of the state.
That is, on average, the estimate of the state will equal the true state Or theKF
is an unbiased estimator
2 Of all linear estimation algorithms, theKFalgorithm minimizes the variance of
the square of the estimation error That is, on average, it gives the smallest
possi-ble variance in estimation error Or theKFis the linear estimator that can deliverthe most consistent estimation results
Continuing with our example of tracking a vehicle as it moves in a two-dimensionalspace, the vehicle might be able to provide rapid updates of its position to the centralserver as it moves (e.g., if it is equipped with a GPS positioning system) An inher-ent limitation is that the exact object location cannot be updated continuously due tothe limited bandwidth and battery power of the remote devices Thus, approximateanswers to the queries of the vehicle’s position are acceptable A promising solution
is to maintain a precision bound width δ at the remote source and update the server
whenever the true value deviates more than δ units from the server value as proposed
in [101, 104] We use our dual Kalman Filter approach to accomplish this (systemequations are discussed in detail in Section 2.1.7)
Again, let us use the schematic diagram in Figure 2.1 to explain Suppose a userquery comes with a precision constraintδ Our system will activate a Kalman FilterKFs
Trang 39at the central server and its mirror Kalman FilterKFm at the remote site The remote
source keeps track of the server prediction at time step k (note that this does not require
any extra memory except for the usual matrices of theKF) and filters out the data (doesnot forward it to the central server) if the prediction at the central server deviates by lessthan a margin of δ Notice that KFs, after receiving the first few measurements from
the remote source, would have established a good estimate of the state vector (p k , c k)
Based on that, the server can compute the rate of change of the X and Y locations
using Eq 2.1, and would require fewer updates from the remote source Updates areneeded only when sudden acceleration or braking actions induce a large error in thestate estimate, or when noise gradually corrupts the state prediction to such a degreethat a refresh is necessary For tracking and recreating a vehicle’s locations, the server
does not need to record any information other than the X and Y coordinates of the
vehicle and the Kalman Filter matrices
In addition to perceiving and formulating stream resource management as mentally a filtering problem, and proposing using the Kalman Filter as a general andadaptive solution, the specific contributions of this thesis are as follows:
Trang 40funda-1 We present a comparative analysis of our model versus the existing techniquesand discuss different applications where the Kalman Filter has been successfully
incorporated (Section 2.1.3)
2 We present the mathematic formulation of the filter and discuss how the filterformulation can be easily customized for a large variety of problem formulations
in stream management (Section 2.1.4)
3 We propose our dual Kalman Filter model (in Section 2.1.5) and discuss howconstraints (query precision and smoothing factors) provided by the user are used
to install and set initial parameters for the Kalman Filters
4 Through examples and empirical studies (in Sections 2.1.7 and 2.1.8), we showthat employing the Kalman Filter can facilitate and support different query sce-
narios In terms of bandwidth conservation, the Kalman Filter is at least as good
as traditional approaches, if not better More important, its generality and tivity show promise as a building block for stream applications that concern fu-sion and integration
adap-5 In Section 2.3 we advance a set of promising extensions to build upon this work