Luận án tiến sĩ: Statistical mining in data streams

Statistical Mining in Data StreamsAnkur Jain Recent years have seen a steady rise of a new class of data management systemscalled Data Stream Management Systems DSMS.. sen-In this disser

Trang 1

Statistical Mining in Data Streams

A Dissertation submitted in partial satisfaction

of the requirements for the degree of

Doctor of Philosophy

in Computer Science

by Ankur Jain

Committee in Charge:

Prof Edward Y Chang, Chair

Prof Divyakant Agrawal

Prof Yuan-Fang Wang

December 2006

Trang 2

3245961 2007

UMI Microform Copyright

ProQuest Information and Learning Company

by ProQuest Information and Learning Company

Trang 3

Prof Divyakant Agrawal

Prof Yuan-Fang Wang

Prof Edward Y Chang, Committee Chairperson

October 2006

Trang 4

Copyright c 2006

byAnkur Jain

Trang 5

There are a lot of people whom I want to thank for their direct and indirect contributions

in making this dissertation possible

First of all, I would like to thank my advisor, Prof Edward Y Chang, for theguidance, support, and honest criticisms he has provided throughout the course of thiswork His energy and dedication has always been a great source of inspiration for me Iwould also like to thank Prof Yuan-Fang Wang who helped me in my research work I

am fortunate to have learned so much from his overall attitude towards problem solving,and eﬀorts to explain things clearly

I would like to thank my friends Badri, Irfan, Kapil, Nagender, Amit, Nisheeth,Vibhore, Sumit, Sukhi, Gayatri, Shalini, Sirisha and Kavitha for supporting me over allthese years especially when it was most needed I thank my colleagues Panda, Arun,Gang, Zhihua, Zoran, Yi, Raju and Kinsghy who where always there for me whenever

I needed some help or advice

Lastly, and most importantly, I wish to thank my parents, R K Jain and Abha Jain,and sisters, Richa and Prachi, for their patience, love and support To them I dedicatethis thesis

Trang 6

“Adaptive nonlinear clustering for data streams,” Ankur Jain, Z Zhang & Edward Y.Chang Proc of ACM CIKM’06, Intl Conf on Information and Knowledge Manage-ment, Arlington, VA, USA

Trang 7

Y Chang & Yuan-Fang Wang UCSB Technical Report, June 2006.

“Using stationary-dynamic camera assemblies for wide-area video surveillance and lective attention,” Ankur Jain, Dan Kopell, Kyle Kakligian & Yuan-Fang Wang Proc

se-of IEEE CVPR’06, Conf on Computer Vision and Pattern Recognition, New York

“Adaptive stream resource management using Kalman Filters,” Ankur Jain, Edward Y.Chang & Yuan-Fang Wang Proc of the ACM SIGMOD’04, Intl Conf on Manage-ment of Data, Paris, France

“Adaptive sampling for sensor networks,” Ankur Jain & Edward Y Chang Proc ofDMSN’04, Intl workshop on Data Management in Sensor Networks (in conjunctionwith VLDB), Toronto, Canada

“Managing and Mining Video Sensor Data,” Edward Y Chang, Ankur Jain, NavneetPanda, Yuan-Fang Wang, Gang Wu & Yi Wu UCSB Technical Report, 2004

Graduate Coursework

Computer Networks, Design & Analysis of Algorithms, Programming Languages andImplementation, Theory of Computation, Computational Geometry, Internet Comput-ing and Web Technology, Bioinformatics, Advanced Computer Architecture, AdvancedAlgorithms for Multimedia Systems

Trang 8

Statistical Mining in Data Streams

Ankur Jain

Recent years have seen a steady rise of a new class of data management systemscalled Data Stream Management Systems (DSMS) These systems manage rapid, high-volume data-streams with transient relations instead of static data with persistent rela-tions Data streams are common to applications such as network traﬃc and transac-

tion monitoring systems, click-stream processors, industrial process control, and sor networks ADSMSoperates on these continuous and time-varying data streams tofacilitate on-the-fly query answering, and to support data acquisition, monitoring andanalysis

sen-In this dissertation, we present statistical stream mining solutions for eﬀective

on-line processing of streaming data We focus research issues related to adaptive streamresource conservation and online mining in a DSMS We have developed statisticallinear and non-linear filtering techniques based on the Kalman Filter to capture tem-poral correlations in the streaming data Such correlations help in stream resourceconservation We also propose techniques that capture spatial correlations between thestreaming sources that further helps improving resource conservation and facilitatesanswering group-queries in an eﬃcient manner

Trang 9

dress issues related to online stream mining Once the data stream arrives at a centralserver, eﬀective mining techniques are necessary for stream analysis, before the data

can be discarded Since a stream continuously evolves with time, stream mining niques need to be adaptive and should operate under a given memory constraint We

tech-propose adaptive clustering solutions that use the kernel trick to capture non-linear

re-lations in the streaming data We also presentOCODDS, a change-detection approachthat can track evolutionary changes in the stream in both linear and non-linear set-tings Finally, we present our techniques for eﬀective acquisition and processing of

data streams common to video sensor networks

Trang 10

List of Figures xii

1.1 Data Stream Mining 7

1.2 Contributions toward statistical stream mining 8

2 Kalman Filter in Stream Resource Mangement 16 2.1 Adaptive resource management for maximizing resource conservation 18 2.1.1 What is the Kalman Filter? 20

2.1.2 Contribution Summary 23

2.1.3 Related Work 25

2.1.4 The Kalman Filter 28

2.1.5 The Dual Kalman Filter Model 32

2.1.6 Why Kalman Filter? 35

2.1.7 Modeling Kalman Filter for Data Streaming Applications 39

2.1.8 Experimental Results 47

2.2 Adaptive resource management for maximizing query precision 58

2.2.1 Related Work 61

2.2.2 Our Framework 63

2.2.3 Results 68

2.3 Conclusion 73

3 Bayesian reasoning for sensor resource management and diagnosis 75 3.1 Related Work 81

3.2 Architecture and Model 83

Trang 11

3.2.3 Answering Diagnostic Queries 98

3.3 Experimental Validation 101

3.3.1 Experiment Setup 101

3.3.2 Resource Conservation 104

3.3.3 Query Answer Quality-Loss 107

3.3.4 Abnormality 109

3.3.5 Selectivity 110

3.4 Conclusion 111

4 Adaptive clustering in Data Streams 121 4.1 Related Work 130

4.2 Kernel Methods 132

4.3 Adaptive Non-linear Clustering Framework 135

4.4 Kernel Methods for Stream Clustering 139

4.4.1 Kernel Stream Segmentation (Tier-1) 139

4.4.2 Data Projection inLDS(Tier-2) 142

4.5 Experimental Evaluation 150

4.5.1 Datasets 151

4.5.2 Performance Results 155

4.6 Conclusions 163

5 OCODDS: Online change-over detection framework for tracking evolu-tionary changes in streaming data 168 5.1 Related Work 172

5.2 TheOCODDSFramework 178

5.2.1 Finding the changeover location in a window 181

5.2.2 Conducting the hypothesis test 185

5.2.3 OCODDSusing the kernel method 187

5.3 Experimental Evaluation 190

5.3.1 Comparative analysis with CUSUM 193

5.3.2 Eﬀect of window and padding size 196

5.3.3 Eﬀect of population variance ratio 197

5.3.4 Eﬀect of location uncertainty (LU) 198

5.3.5 Eﬀect of Noise 199

5.4 Conclusion 200

Trang 12

6.2 Technical Rationales 214

6.2.1 Oﬀ-line calibration 214

6.2.2 On-line selective focus-of-attention 218

6.3 Experimental Results 229

6.4 Conclusions 236

Trang 13

1.1 A Data Stream Management System (DSMS) 4

2.1 TheDKFmodel 19

2.2 Architecture ofDKFmodel 29

2.3 Moving-object dataset (Example 1) 49

2.4 Number of updates received at the central server (Example 1) 50

2.5 Average error produced by diﬀerentKFmodels (Example 1) 50

2.6 Electric power load dataset (Example 2) 52

2.7 Number of updates received at the central server (Example 2) 53

2.8 Average error produced by diﬀerentKFmodels (Example 2) 53

2.9 Network monitoring dataset (Example 3) 55

2.10 Comparative results forKFsmoothing against moving average approach 55 2.11 Performance ofDKFon smoothed data with F = 10−7(Example 3) 56

2.12 Performance ofDKFfor precision widthδ = 10 (Example 3) 56

2.13 m on varying # of streaming sources 70

2.14 η on varying # of streaming sources 70

2.15 ERU on varying # of streaming sources 71

2.16 ERU on varyingλi 72

2.17 ERU on varying W i 72

3.1 Correlation model for NDBC data 77

3.2 Compact representation usingBN 77

3.3 Sensor Network Architecture 84

3.4 Algorithm for computing candidate attributes set,Υ 113

3.5 Supplementary procedures for the algorithm shown in Figure 3.4 114

3.6 Resource conservation as a function ofδminand|Q| 115

3.7 Resource conservation as a function ofδmin 116

3.8 Resource conservation withδmin = 0.90 116

Trang 14

3.11 Abnormality Detection 119

3.12 Selectivity withδmin = 0.90 120

4.1 Linear separation using the kernel methods 126

4.2 Well-behaved data in the feature space (Network intrusion data for “ip-sweep” attack) 127

4.3 Overview of 2-tier clustering architecture 128

4.4 Adaptive non-linear clustering framework 138

4.5 Reuters stream pattern 152

4.6 Network-intrusion stream pattern 153

4.7 MNIST stream pattern 154

4.8 Cumulative Cluster Purity (u= 10) 164

4.9 Fraction of elements in significant clusters (u= 10) 165

4.10 Eﬀect of dimensionality 166

4.11 EVDComputations (α = 75) 167

5.1 OCODDSvs CUSUM - Detection accuracy (LU = 1, m = 1, n = 30) 191 5.2 OCODDSvs CUSUM - Processing time (LU = 1, m = 1, n = 30) 191

5.3 Eﬀect of window length (LU = 1, m = 1, N rms = 0) 201

5.4 Eﬀect of R Accuracy with LU = 1, m = 1, N rms = 0, n = 50 202

5.5 Eﬀect of padding length (LU = 1, n = 50, N rms = 0 ) 202

5.6 Eﬀect of PVR on detection bias (LU = 1, m = 1, n = 50, varying PVR) 203 5.7 Eﬀect of PVR on detection bias (LU = 1, m = 1, n = 50, varying R) 204

5.8 Eﬀect population variance on detection accuracy (LU = 1, m = 1, n = 50) 205

5.9 Eﬀect of LU (m = 1, n = 50) 206

5.10 Eﬀect of RMS noise (LU = 1, m = 1, n = 50, varying RMS noise) 207

5.11 Eﬀect of RMS noise (LU = 1, m = 1, n = 50, varying R) 208

6.1 Computing the correct pan DOF (a) if optical and pan centers are col-located, and (b) if they are not 219

6.2 Error in centering assuming computational collocation of the optical center on the rotation axis 221

6.3 Selective focus-of-attention as a visual servo problem 224

6.4 Comparison of calibration accuracy as a function of experimental setup (using a CCD of 300× 300 pixels) 230

6.5 Relation between requested and realized angle of rotation for Sony PTZ camera 237

Trang 15

models 237

6.8 Centering errors under various experimental conditions 238

6.9 Focus-of-attention experiments using real video 239

6.10 Focus-of-attention experiments using real video 240

Trang 16

1.1 Data stream applications 31.2 Popular Data Stream Management Systems (DSMS) 5

2.1 Summary of existing solutions and advantages of using the Kalman Filter 262.2 Symbols and their meanings 333.1 Relative Costs of Sensors for Real Datasets 101

6.1 Comparison of calibration accuracy For [50], 51% simulation runsfailed to converge If the simulation did converge, 85 iterations were needed

in average 230

Trang 17

A data stream is simply a continuous sequence of data elements (x1, x2, x3, · · · , xi, · · · )

that arrive in on-line fashion, where each element xicould be a scalar or a vector entity

Babcock et al., presented the data stream model with the following salient features [18]:

• The data elements arrive on-line

• The system has no control over the order in which the data elements arrive

• Once an element has been seen or processed, it cannot be easily retrieved or seen

again unless it is explicitly stored in the memory

Data streams are common to a variety of data-intensive applications, some of which

we have summarized in Table 1.1 In all these applications, it is not feasible to operateusing a traditional database management system (DBMS), since it assumes persistentdata relations and predominately focuses on query optimization, data access (indexing)and privacy On the other hand, a data stream management system (DSMS), must

Trang 18

consider critical factors such as noise from the data sources, management of the systemresources (both at the central and remote locations), and evolutionary changes in thedata trends Moreover, unlike DBMS, a DSMS also undertakes the task of handlingissues related to data acquisition from the streaming sources For example, reduction ofcomputation and communication load on nodes in a sensor network, while maximizingthe total information gain Since a data stream is potentially unbounded in terms of itssize, and theDSMSengine operates under the constraints of finite memory and storage,

the data stream query model is also diﬀerent from that of a traditional DBMS Some of

these striking diﬀerences are:

• Approximate Query Results: Since access to the entire stream at any point of

time is diﬃcult (and often impossible), data stream query results are often

re-ported in an approximate fashion A sliding window paradigm is adopted, whereresults are computed using only as much historical data as a window can hold

For example, given in a stock market feed, a user might ask “What is the value of YAHOO stock index average over the last 30 seconds?” In this case, the sliding

window will hold data received only in the last 30 seconds

• Continuous Queries (query registration): A user registers her query with the

DSMSonly once, however, the query result can be a data stream in itself

Con-sidering the stock market example again, a user might ask “Report the value of

Trang 19

Application Example queries

Financial Stock tickers “Find stocks with more than 5% gain in last 30

min-utes.”

Transaction log analysis “Determine number of distinct base stations used for

the last 5 longest cellular phone calls.”

Network monitoring and

traf-fic engineering

“Report the average number of HTTP packets seen

on a network link in.”

Sensor networks “Continuously monitor the standard deviation of the

temperature in a nuclear reactor.”

Video surveillance “Determine areas in a parking lot with unusual

ac-tivity.”

Table 1.1: Data stream applications.

the YAHOO stock index, whenever it goes higher than its mean over 30-second sliding window.” This will result in a stream of mean stock values.

• Online Adaptivity: Since data in a stream arrives online, a true stream

pro-cessing algorithm should also be compatible online However, since the streamcan also exhibit evolutionary behavior, an eﬀective stream-processing algorithm

should be able to analyze such evolutionary changes For example, data could besampled at a higher rate for stocks that have been showing increased activity on

a particular day (say IT stocks) as compared to other stocks (say Finance stocks)

to improve the overall accuracy of query results

In Figure 1.1, we show the architecture of a typical DSMS The data sources (onthe right) forward data to the central processor (in the middle) Data could be continu-

ously pushed to the central processor, or could be pulled from the sources as and when

Trang 20

Figure 1.1: A Data Stream Management System (DSMS)

required [37] The central processor, operates on the received data under the constraintsput forth by the user (on the left) These constraints may include the sliding windowsize, query answer precision values etc Once the data have been processed, the centralprocessor may choose to discard the data, or to store data summaries (synopsis).There are many interesting research issues related to diﬀerent aspects of a DSMS.For example, at the central site, the DSMS should address issues related to resourcemanagement, query processing, etc., and at the remote site, should address the issuesrelated to sensor calibration (e.g camera calibration in video sensor network [26]),

Trang 21

DSMS University A

ﬃlia-tions

Summary

STREAM Stanford University A general-purpose DSMS with SQL-like

query interface Supports continuousquery registration and processing

Aurora MIT, Brandeis

Univer-sity and Brown versity

Uni-Real-time stream processing engine forcontinuous sensor data processing Pri-mary focus is toward operator schedulingand load shedding

Cougar Cornell University A distributed sensor database system

Support in-network query processing and

eﬀective query plan generation

TelegraphCQ UC Berkeley A DSMS with continuous data-flow

pro-cessing and adaptive query processors

data acquisition, eﬃcient data sampling and forwarding techniques Due to this wide

spectrum of open problems related to data stream processing at diﬀerent levels, stream

research has been perceived and studied with increasing interest over the past few years.The task of building an efficient and effectiveDSMS has been undertaken at different

universities with diﬀerent performance goals Table 1.2 summarizes the salient features

of some of such popular projects In the remainder of this section, we briefly surveyrelated research in this area

A significant body of research aims to develop eﬀective query languages for a

DSMS These query languages have a declarative SQL-like interface, and they port constructs particular to data-stream queries (for example, specification of the slid-ing window size) ESL [88], is one such stream query language under development at

Trang 22

sup-UCLA Stanford’s CQL is another example of aDSMSquery language [15] Given a

stock-ticker stream stock(ticker, value), a query to select all tickers having a price of

more than $10 in the last 5 seconds can be formulated in CQL as follows:

SELECT ticker

WHERE value > 10

In-network processing [32] and load-shedding techniques, have been devised to

reduce data forwarding from the sources in an eﬀort to minimize the computational

load on the query processor, and the communication eﬀort needed to transfer data to

the central server Research work proposed in [124], inserts context-sensitive dropoperators in query plans for dropping data tuples In [119] the authors propose an

eﬃcient filter operator placement technique, wherein the placement node is selected

based on computational capability of the node, as well as the selectivity and cost of theoperator in question

A significant area of data-stream research focuses on building an eﬃcient query

processors for data streams [97, 22] that optimize memory and storage requirementswhile meeting QoS specifications Also, there have been research eﬀorts devoted to

data sampling techniques from a sliding window [19] and aimed toward creation of an

eﬀective data stream synopsis [45]

Trang 23

1.1 Data Stream Mining

In addition to stream query processing, theDSMSalso undertakes a relatively more

challenging task of stream mining or data stream analysis Although, data mining is a

well-established field in itself, direct application of mining algorithms to data streams isoften unsuitable due to the fact that it is impossible to maintain all the stream elements

in memory Moreover, as new data arrives online, the mining algorithms must adapt to

the changing trends Hence, approximation and adaptivity are the key ingredients of

any data stream mining algorithm

While data-stream querying research is relatively older, data stream mining hasreceived more of the researchers attention in the past few years However, stream-querying shares interesting inter-dependencies with stream-mining For one reason,stream-mining solutions can help improve the performance of a DSMS: For example,

an eﬀective stream trend prediction technique can help in the placement of query

oper-ators and also reduce the data-forwarding load in a sensor network However, streamquerying solutions are often general-purpose, whereas mining solutions are application-specific and are particular to the problem semantics For example, a particular approach

for stream trend prediction might work well for a “stock ticker” stream and at the same time show no performance enhancement for the “Internet tra ﬃc” stream Moreover,

mining queries are often represented in an abstract fashion (e.g “Determine areas of

Trang 24

unusual activity in a parking lot”) Some of the common stream-mining tasks include

but are not limited to [36]:

• Multi-dimensional on-line analysis [39]

• Mining spatial and temporal correlations in streaming data trends [51, 68]

• Mining novelty, outliers and anomalous behavior [7]

• On-line adaptive clustering and classification [10, 8]

• Frequent pattern matching [46, 16]

1.2 Contributions toward statistical stream mining

The primary focus of this dissertation is to develop eﬃcient and eﬀective statistical

stream mining solutions that improve the performance of aDSMS In addition to beingon-line and adaptive, our proposed solutions handle issues related to non-linearity indata relations, an issue largely ignored by the data stream research community in thepast

Non-linearity in a data stream applications manifests itself in diﬀerent forms

Con-sider the problem of data stream trend analysis as an example To save communicationresources, it is often desirable to predict the future attribute value at the central queryserver based on the current trend, rather then acquiring fresh values from the remote

Trang 25

streaming source Thus, the performance of the system depends directly on the bility of the prediction scheme If the trend has non-linear characteristics, it becomesincreasingly diﬃcult to make reliable predictions Adaptive on-line stream classifica-

relia-tion/clustering is another interesting example Real-world data is often separated by

non-linear class boundaries, which are hard to delineate by simple classification rithms The problem gets more challenging even considered in the data stream settingdue to unavailability of the stream in its entirety and limited memory constraints

algo-In this dissertation, we propose eﬃcient adaptive solutions that address the

prob-lems arising due to the non-linearity in data relations while operating in limited ory In particular, we focus on the following problems:

mem-• Communication resource conservation while addressing approximate user queries

• On-line trend analysis

• Eﬀective data acquisition from streaming sources

All of our proposed solutions operate in limited memory under the data stream modelpresented in Figure 1.1 We further simplify the data stream model by assuming that the

elements in the stream arrive in-order and that we access them sequentially However,

when we incorporate sliding-window techniques, we assume that random access to any

Trang 26

element in the sliding window is possible1 In the rest of this section, we provide anoverview of our proposed approaches and a roadmap to the rest of the thesis.

In chapter 2, we study how linear and non-linear filtering techniques help in ducing communication overhead while answering approximate user queries A majorresearch area in stream management is to allocate resources (such as network band-width) to query plans The objective is to:

re-1 minimize resource usage under a user-query precision requirement, or

2 maximize precision of query results under resource constraints

Both of these problems can be perceived as fundamentally filtering problems, in whichthe objective is either to filter out as much data as possible at the source to conserve re-sources (provided that a user specified query-precision constraint is met at the server),

or to maximize the precision while filtering data adaptively at the remote site, such thatthe overall resource usage is under a user-specified resource-constraint To date, many

solutions have been proposed; however, most solutions are ad hoc with hard-coded

heuristics to generate query plans We select the Kalman Filter as a general and tive filtering solution for conserving resources The Kalman Filter has the ability toadapt to various stream characteristics, sensor noise, and time variance Furthermore,

adap-we realize a significant performance boost by switching from traditional methods ofcaching static data (which can soon become stale) to our method of caching dynamic

Trang 27

procedures that can predict data reliably at the server without the clients’ involvement.Through examples and empirical studies, we demonstrate the flexibility and eﬀective-

ness of using the Kalman Filter (in both linear and non-linear flavors) as a solutionfor managing trade-oﬀs between precision of results and resources in satisfying stream

queries (case (1)) We also propose a novel adaptive sampling technique with which

we maximize query precision, given the available network bandwidth to address case(2) This problem is prevalent in systems where a large network of sensors, deliverscontinuous data to a central server, and the objective is to maximize the total informa-tion gain, given the total communication overhead the network can bear Our approachemploys a Kalman-Filter (KF)-based estimation technique wherein the sensor can usethe KF estimation error to adaptively adjust its sampling rate within a given range, au-tonomously When the desired sampling rate violates the range, a new sampling rate

is requested from the server The server allocates new sampling rates under the straint of available resources such that KF estimation error over all the active streamingsensors is minimized

con-In chapter 3, we present solutions for addressing user queries in a diﬀerent setting

where data is not being continuously pushed by the sources to the server; instead, it can be pulled from the sources as and when required However, each pull of the data

comes at some cost of resource usage This setting is typical in sensor networks, whereacquisition of data from a sensor comes at the cost of deterioration of its battery life

Trang 28

Using the correlations and dependencies that could be prevalent in the sensor attributevalues, we propose a scheme for modeling sensor networks using Bayesian Networks(BN), where each node represents a sensor, and a directed edge represents the influence

relationship between two sensors We show that by taking advantage of the blanket property ofBNs, we can generate resource-conserving group-query plans, and also address a new class of diagnostic queries When multiple sensors are queried, the

Markov-queries can be processed collectively as a single group-query that exploits inter-attributedependencies for deriving cost-eﬀective query plans

In addition to answering user queries, a data stream management system should beable to analyze the stream automatically and mine interesting patterns In Chapter 4,

we shift our focus to stream clustering, a problem that has emerged as a challengingand interesting one over the past few years Due to the evolving nature and one-passrestriction imposed by the data stream model, traditional clustering algorithms are in-applicable for stream clustering This problem becomes even more challenging whenthe data are high-dimensional and the clusters are not linearly separable in the inputspace We propose a non-linear stream clustering algorithm that adapts to the stream’s

evolutionary changes Using the kernel methods for dealing with the non-linearity

of data separation, we propose a novel 2-tier stream clustering architecture Tier-1captures the temporal locality of the stream, by partitioning it into segments, using akernel-based novelty detection approach Tier-2 exploits this segment structure to con-

Trang 29

tinuously project the streaming data non-linearly onto a low-dimensional space (LDS),before assigning them to a cluster.

Since streaming data distribution exhibits evolutionary changes that could bly be non-linear, one of the fundamental problems in stream mining is to track suchchanges eﬀectively and eﬃciently This necessitates a framework that can automat-

possi-ically monitor evolutionary changes in the streaming data distribution and cause analert when a significant change is detected In Chapter 5, we present OCODDS: On-line ChangeOver Detection framework for Data Streams OCODDS oﬀers real-time,

high-changeover detection accuracy while being completely oblivious to the underlyingdistribution’s shape and parameter We also presentOCODDS’s tie-in with the kernelmethod that allowsOCODDSto operate in any high dimensional space using implicitnon-linear projections, without making changes in its original framework This tie-inallows for even higher changeover detection accuracy at a slightly higher computational

eﬀort Our approach oﬀers a theoretical guarantee of being an unbiased estimator of

the changeover point under certain conditions, and it oﬀers better change-detection

ac-curacy than the traditional algorithms like CUSUM

In chapter 6, we attempt to address some of the issues associated with a diﬀerent

kind of sensor networks, those that have cameras as the sensors Such networks are alsoreferred to as video sensor networks [26] or VSN Video sensor networks have certain

Trang 30

characteristics that diﬀerentiate them from a general purpose sensor network, some of

which are as follows:

• A large volume of continuous data is produced by the sensors, most of which

must be processed at the sensor site For example, the video sensor may have toprocess a high resolution video stream to produce a simple stream of 3-dimensionalmoving object coordinates

• VNSs operate under the collaborative eﬀort computer vision, computer graphics,

networking, and data stream processing techniques

In Chapter 6, we present a prototype video surveillance system that uses dynamic (or master-slave) camera assemblies to achieve wide-area surveillance andselective focus-of-attention We address two critical issues in deploying such a VSN:(1) oﬀ-line camera calibration and (2) on-line selective focus-of-attention We formu-

stationary-late this selective focus-of-attention problem into that of a feedback loop visual-servo;

a non-linear problem to solve We show that our methods eﬀectively process the stream

of a moving object coordinates to continuously focus on an object by adaptively ing the PTZ camera parameters Since the performance (focus-of-attention accuracy)

adjust-of such a system is heavily dependent on the quality adjust-of information produced from thevideo sensors, we also propose our video-sensor calibration methods that calibrate allthe degrees of freedom of a video sensor in a closed-form Calibration is important

Trang 31

because VSNs employ numerous cheap video sensors which do not come with quality mechanical designs Although these calibration procedures are oﬀ-line and are

high-not directly related to the theme of this dissertation, we provide them for completenessand better understanding of other methods proposed in Chapter 6

Trang 32

Kalman Filter in Stream Resource

perfor-to the server If the server can answer queries within specified precision constraints,these methods do not enact data communication Indeed, these methods have beenshown eﬀective for reducing network-bandwidth consumption, thereby also conserv-

ing the storage and processing loads at the server

The problem of data stream resource management can be perceived from two ferent aspects:

dif-1 Minimizing resource usage under query precision constraints

Trang 33

2 Maximizing query precision under given resource constraints.

In our work, we treat stream resource management as fundamentally a filteringproblem We believe that an adaptive and general stream filtering solution can addressboth cases (1) and (2) eﬀectively We advocate the use of the Kalman Filter (KF) [81]for stream-filtering, sinceKF has been well studied and widely applied to many datafiltering and smoothing problems We present our methods for addressing cases (1) and(2) in Sections 2.1 and 2.2 respectively We incorporate the Kalman Filter as the basicbuilding block of a stream management system for the following two reasons:

• Traditional methods cache static data that easily become stale over time This

necessitates frequent and expensive synchronization between clients and serversthrough retransmission Our method, on the other hand, caches filter parameters

that enable dynamic and accurate system prediction on the server without clients’

intervention

• As will be fully explained in Section 2.1.6, the Kalman Filter can be easily

cus-tomized to handle varying stream characteristics, sensor noise, and time variance

to meet the requirements specified in [97] The same filtering framework can

be adapted to address a wide variety of stream resource management problems,providing a unified paradigm that is both powerful and versatile

Trang 34

2.1 Adaptive resource management for maximizing

re-source conservation

An eﬀective algorithm to address this problem is one that filters out a maximum

amount of data as long as the precision constraints are met at the server A major coming of the existing solutions, is that they are often ad hoc, as explained in [13], andhave been highly application-dependent No unified solution has yet been developed for

short-managing stream resources We introduce our Dual Kalman Filter (DKF) architecture

as a general and adaptive solution to the stream-resource-conservation maximizationproblem

To further emphasize the need to conserve network bandwidth, let us consider a ical wireless sensor monitoring system In applications such as moving-object tracking,weather monitoring, and video surveillance, the power dissipation rate of a wirelesssensor-node is an issue of primary concern It has been established [118, 135] that themajority of power dissipation occurs when transmitting bits over wireless networks,not when processing them The ratio of energy spent in sending one bit over networks

typ-to that spent in executing one instruction is between 220 typ-to 2, 900 on various

architec-tures [109, 110] Thus, filtering data at sensors is beneficial not only for conservingbandwidth, but also for conserving power Since the computational cost incurred by

Trang 35

Update the central server

NOSend update received

with new value

StreamingSource

NOYES

Drop thedata tuplefrom remote source

YES

User Is update available

from remote source ?

given query precision ?central server outside

Is prediction of

Remote Source (Running KFm) Central Server (Running KFs)

Send prediction from the server KFs

KFis insignificant in many practical sensing scenarios,KFis an attractive option as thefiltering solution for resource conservation

Figure 2.1 depicts the role of our proposedDKFmodel in a typical DSMS ture A user (on the left-hand side of the figure) issues a query to the server with someprecision constraints The server activates aKF, denoted asKFs, and at the same time,the target sensor activates a mirrorKFwith the same parameters, denoted asKFm Thedual filtersKFs andKFm predict future data values Only when the filter at the remotesource, KFm, fails to predict future data within the precision constraint (and thusKFs

architec-cannot provide an accurate prediction at the server) that the sensor sends updates to

KFs Significant bandwidth conservation can be achieved if a reliable and accurate dataprediction mechanism is employed We propose the KF as such a mechanism for itssimplicity, eﬃciency, and provable optimality under fairly general conditions

Trang 36

2.1.1 What is the Kalman Filter?

The Kalman Filter is a stochastic, recursive data filtering algorithm It has beenwidely used in predicting a system’s internal state based on the observation of its ex-ternal behavior For stream management applications, a stream is modeled as a genera-tive process (a streaming model) controlled by the stream’s internal parameters (state),which evolve over time The state may or may not be directly observable, and hence,has to be inferred by observing the system’s external behavior The observed datastream serves as the external observation that is used to estimate the internal state of thestream’s generative process The state estimation process operates using recursive steps

of prediction (propagating the internal state of the system) and correction (fine-tuningthe prediction with external observation) [34, 81, 130]

A concrete example is the problem of estimating the state of a moving vehicle.The system state comprises the current location and velocity of the vehicle, and is

represented by a vector x k , at a discrete time step k The system evolves over time due

to the driver’s acceleration and braking actions, and road friction of a random Hence,

Trang 37

the time evolution of the system’s state is governed by the equation

where w k is process noise, matrices A and B relate the state at k to that at k + 1, a k is

the time-varying acceleration, c k is the velocity, and T is the time between step k and step k + 1 Now, suppose at discrete time intervals we can measure the position p Then

our measurement at time k can be denoted as

whereνkis the measurement noise inherent in all measurement processes

Given hints on the system state through the state propagation and external vation mechanisms, the KFintegrates all the information to arrive at the best estimate

obser-of the system state over time TheKFweighs all available information by taking intoconsideration the noise in external measurement and the uncertainty in state propaga-

tion More specifically, let us assume that the state propagation uncertainty w kis white

Gaussian with a covariance matrix Q The measurement noiseνkis white Gaussian with

a covariance matrix R, and it is not correlated with the noise in state propagation The

formulation of theKFalgorithm provides us with the following statistical properties:

Trang 38

1 The expected value of theKFestimate is equal to the expected value of the state.

That is, on average, the estimate of the state will equal the true state Or theKF

is an unbiased estimator

2 Of all linear estimation algorithms, theKFalgorithm minimizes the variance of

the square of the estimation error That is, on average, it gives the smallest

possi-ble variance in estimation error Or theKFis the linear estimator that can deliverthe most consistent estimation results

Continuing with our example of tracking a vehicle as it moves in a two-dimensionalspace, the vehicle might be able to provide rapid updates of its position to the centralserver as it moves (e.g., if it is equipped with a GPS positioning system) An inher-ent limitation is that the exact object location cannot be updated continuously due tothe limited bandwidth and battery power of the remote devices Thus, approximateanswers to the queries of the vehicle’s position are acceptable A promising solution

is to maintain a precision bound width δ at the remote source and update the server

whenever the true value deviates more than δ units from the server value as proposed

in [101, 104] We use our dual Kalman Filter approach to accomplish this (systemequations are discussed in detail in Section 2.1.7)

Again, let us use the schematic diagram in Figure 2.1 to explain Suppose a userquery comes with a precision constraintδ Our system will activate a Kalman FilterKFs

Trang 39

at the central server and its mirror Kalman FilterKFm at the remote site The remote

source keeps track of the server prediction at time step k (note that this does not require

any extra memory except for the usual matrices of theKF) and filters out the data (doesnot forward it to the central server) if the prediction at the central server deviates by lessthan a margin of δ Notice that KFs, after receiving the first few measurements from

the remote source, would have established a good estimate of the state vector (p k , c k)

Based on that, the server can compute the rate of change of the X and Y locations

using Eq 2.1, and would require fewer updates from the remote source Updates areneeded only when sudden acceleration or braking actions induce a large error in thestate estimate, or when noise gradually corrupts the state prediction to such a degreethat a refresh is necessary For tracking and recreating a vehicle’s locations, the server

does not need to record any information other than the X and Y coordinates of the

vehicle and the Kalman Filter matrices

In addition to perceiving and formulating stream resource management as mentally a filtering problem, and proposing using the Kalman Filter as a general andadaptive solution, the specific contributions of this thesis are as follows:

Trang 40

funda-1 We present a comparative analysis of our model versus the existing techniquesand discuss diﬀerent applications where the Kalman Filter has been successfully

incorporated (Section 2.1.3)

2 We present the mathematic formulation of the filter and discuss how the filterformulation can be easily customized for a large variety of problem formulations

in stream management (Section 2.1.4)

3 We propose our dual Kalman Filter model (in Section 2.1.5) and discuss howconstraints (query precision and smoothing factors) provided by the user are used

to install and set initial parameters for the Kalman Filters

4 Through examples and empirical studies (in Sections 2.1.7 and 2.1.8), we showthat employing the Kalman Filter can facilitate and support diﬀerent query sce-

narios In terms of bandwidth conservation, the Kalman Filter is at least as good

as traditional approaches, if not better More important, its generality and tivity show promise as a building block for stream applications that concern fu-sion and integration

adap-5 In Section 2.3 we advance a set of promising extensions to build upon this work