DSpace at VNU: DisCaRia-Distributed Case-Based Reasoning System for Fault Management tài liệu, giáo án, bài giảng , luận...
540 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL 12, NO 4, DECEMBER 2015 DisCaRia—Distributed Case-Based Reasoning System for Fault Management Ha Manh Tran and Jürgen Schönwälder, Senior Member, IEEE Abstract—Fault resolution in communication networks and distributed systems is a challenge that demands the expertise of system administrators and the support of multiple systems, such as monitoring and event correlation systems Trouble ticket systems are frequently used to organize the workflow of the fault resolution process In this context, we introduce DisCaRia, a distributed case-based reasoning system that assists system administrators and network operators in resolving faults DisCaRia integrates various fault knowledge resources that are already available in the Internet, and it exploits them by applying a distributed case-based reasoning methodology, which is based on scalable peer-to-peer technology We present the architecture of DisCaRia, the key algorithms used by DisCaRia, and provide an evaluation of a prototype implementation of the system Index Terms—Fault resolution, fault management, case-based reasoning, peer-to-peer, bug tracking system, software bug search I I NTRODUCTION T HE RESOLUTION of faults in communication networks and distributed systems is to a large extent a human driven process Automated monitoring and event correlation systems [1]–[4] usually produce fault reports that are forwarded to operators for resolution Support systems [5]–[7] such as trouble ticket systems are frequently used to organize the workflows Case-based Reasoning (CBR) [8] has been proposed in the early 1990s to assist operators in the resolution of faults by providing mechanisms to correlate an observed fault with previously solved similar cases (faults) CBR systems are typically linked to trouble ticket systems since the data maintained in trouble ticket systems can be used to populate a case database Existing CBR systems for fault resolution usually operate only on a local case database and cannot easily exploit knowledge about faults and their resolutions present at other sites This restriction to local knowledge resources, however, becomes an issue in environments where software components and offered services change very dynamically and the case database is thus frequently outdated Manuscript received March 2, 2015; revised October 20, 2015; accepted October 20, 2015 Date of publication October 30, 2015; date of current version December 17, 2015 This work was supported in part by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant 102.02-2011.01, in part by Flamingo, a Network of Excellence project (ICT-318488) supported by the European Commission under its Seventh Framework Programme, and in part by the EC IST-EMANICS Network of Excellence under Grant 26854 The associate editor coordinating the review of this paper and approving it for publication was J Lobo H M Tran is with the School of Computer Science and Engineering, International University-Vietnam National University, Ho Chi Minh City, Vietnam (e-mail: tmha@hcmiu.edu.vn) J Schönwälder is with the Department of Computer Science and Electrical Engineering, Jacobs University Bremen, Bremen, Germany (e-mail: j.schoenwaelder@jacobs-university.de) Digital Object Identifier 10.1109/TNSM.2015.2496224 With the recent growth of virtual communities, social networks, and cloud systems, domain specific search engines often become a suitable alternative to general purpose search engines The key advantage of these systems is to focus the search on specific domains Domain specific search engines have the potential of connecting a large number of experts with similar interests A virtual community of networking experts can provide the best solutions for networking problems Cloud computing systems, fostering the centralization of various services, in particular require a large number of experts and tools to manage faults and failures This is especially true for inter-cloud environments [9] that support applications and services running on multiple cloud systems It is thus necessary to develop support systems that can exploit the knowledge of several virtual communities and that can connect groups of experts for resolving problems Our distributed case-based reasoning system DisCaRia takes advantage of Peer-to-Peer (P2P) technology to extend the capability of conventional CBR systems by exploring problem solving knowledge resources available in a distributed environment The DisCaRia peers operate in parallel and comprise independent CBR components that work concurrently to scrutinize the knowledge resources Our distributed CBR approach applies previous research activities to improve the performance of managing a large case database and the quality of the proposed solutions, including case retrieval and reasoning approaches [10]–[12] case learning and retention [13], [14] This article therefore uses the DisCaRia system to present the novel integration and adaptation of CBR and P2P technologies to address a network and service management problem The contribution of the article is thus threefold: 1) We propose the distributed CBR approach based on multiple CBR engines organized in a P2P network for exploring and exploiting federated fault knowledge databases 2) We present the DisCaRia system with the main methods and algorithms, and how they work together to achieve the integration and adaptation of CBR and P2P technologies 3) We perform an evaluation of the DisCaRia system on EmanicsLab distributed computing testbed [15] with federated bug databases obtained from several popular bug tracking systems The rest of the article is structured as follows: Section II discusses related work Section III introduces the distributed CBR approach and research activities applied to the main components of the DisCaRia system The creation and maintenance of the case database are detailed in Section IV Section V presents several experiments performed on the EmanicsLab distributed computing testbed in order to evaluate the performance of the 1932-4537 © 2015 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission See http://www.ieee.org/publications_standards/publications/rights/index.html for more information TRAN AND SCHÖNWÄLDER: DisCaRia—DISTRIBUTED CASE-BASED REASONING SYSTEM FOR FAULT MANAGEMENT 541 DisCaRia system The paper concludes with some remarks on future work in Section VI II BACKGROUND AND R ELATED W ORK This section reviews related work in the areas of trouble ticket and bug tracking systems for fault resolution, case-based reasoning, peer-to-peer technology, and network search For simplification, we use the terms of super peer and DisCaRia peer, case and bug interchangeably in this article A Trouble Ticket Systems and Bug Tracking Systems Trouble ticket systems have been widely used by network operators in order to assure the quality of communication services The ITU-T recommendation X.790 [16] defines trouble as “any cause that may lead to or contribute to a manager perceiving a degradation in the quality of service of one or more network services or one or more network resources being managed.” X.790 introduces an interface for the interaction among parties, e.g., a trouble report data model, and a process for the resolution of troubles, e.g., determining the status of a trouble, escalating its severity, and notifying involved parties of its resolution The informational IETF document RFC 1297 [17] defines trouble as “a single malfunctioning piece of hardware or software that breaks at some time, has various efforts to fix it, and eventually is fixed at some given time.” Trouble ticket systems can be used for communication among network operation centers (NOCs) and they can be associated with a network alert system for generating trouble tickets automatically and for monitoring the progress of the trouble tickets The TMF documents NMF501 [18] and NMF601 [19] model a trouble administration system that provides the interface between a service provider and users for managing trouble information NMF501 focuses on the business and technical requirements and the trouble administration process; whereas NMF601 describes the functionality of exchanging management information to meet the requirements Moreover, the system can create and track telecommunication troubles reported following the X.790 recommendation A Bug Tracking System (BTS) is a trouble ticket system used to keep track of software bugs A bug tracking system uses an information model for a problem (also sometimes called a ticket, bug, defect, etc.) that is very similar to the information model used by trouble ticket systems Pre-defined fields are commonly used to keep track of the status of the problem while textual descriptions are used to describe the problem and to track the problem resolution process A BTS in general aims at improving the quality of software products They so by keeping track of reported problems and by maintaining historical records of previously experienced problems They also establish the basis of a knowledge base of an expert system that allows to search for similar past problems, and that provides reports and statistics for performance evaluation of the services [20] While most expert systems proposed for fault diagnosis and resolution, such as ACE [21], COMPASS [22], NEMESYS [23], Troubleshooter [24], DAD [25] and Fig Cyclic case-based reasoning approach using four processes (retrieval, reuse, revision, and retention) and a common case database, adopted from [8] MANDOLIN [26] explore a knowledge database supplied by a single BTS, DisCaRia is able to explore a federated knowledge database supplied by several BTSs B Case-Based Reasoning Case-based Reasoning (CBR) [8] seeks to find solutions for problems by exploiting experience A case essentially consists of a description of a specific problem and the corresponding solution When a new problem appears, the reasoning process first uses a similarity function to retrieve cases matching the current problem and then it adapts the retrieved cases to the circumstances of the current problem in order to obtain a possible solution Depending on the characteristics of a certain problem domain, this reasoning process can either classify the problem into a group of previously resolved problems or propose an adapted solution for the problem Problem classification can be suitable for a problem domain with a relatively large case database Various reasoning techniques including rule based reasoning, fuzzy logic, neural networks and belief networks can be used for this process Following the discussion in [8], a CBR system consists of four processes as shown in Figure 1: 1) case retrieval to obtain similar cases, 2) case reuse to propose adapted solutions, 3) case revision to verify the adapted solution, 4) case retention to learn the solution The CBR system uses a case database to store and provide cases for the operation of the CBR processes Several CBR systems [5], [7], [27] have been proposed for fault diagnosis and resolution These systems usually collaborate with trouble ticket systems in order to take advantage of trouble tickets as the case database While these systems can learn from previous problems to propose solutions for novel problems, they usually only operate on a local case database and hence they cannot exploit problem-solving knowledge resources present at remote sites Using shared knowledge resources not only provides better opportunities to find solutions but also improves the case databases that otherwise 542 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL 12, NO 4, DECEMBER 2015 frequently become obsolete in environments where software components and offered services change very dynamically DisCaRia uses several CBR engines that are interacting using a self-organizing P2P network in order to exploit various problem-solving knowledge resources C Peer-to-Peer Technology Peer-to-Peer (P2P) technology [28] has been introduced to establish application specific network overlays operating over the Internet [29] A P2P network consists of peers that act both as client and server simultaneously P2P systems exhibit a number of interesting properties such as self-organization, scalability, flexibility, and fault tolerance Peers join and leave the networks with loose control, enabling fully distributed systems with a very large number of peers Acting in both client and server roles, peers share resources, such as bandwidth, storage space, and computing power and typically provide lookup functions Since P2P networks not have a hierarchical organization or centralized control, they are designed such that the failure of individual peers can affect the availability of certain resources but cannot cause the failure of the overall P2P network P2P networks can be classified into two categories: structured and unstructured networks Structured P2P networks maintain a controlled and stable overlay network topology A structured P2P network aims at distributing content at deterministic locations using Distributed Hash Tables (DHTs), thus facilitating efficient content search and lookup Examples of structured P2P networks are CAN [30], Chord [31], Tapestry [32] and Kademlia [33] Unstructured P2P networks spend less effort on controlling the overlay network topology and the location of content Instead, they tend to grow randomly without maintaining a certain network topology according to some tight rules The content is arbitrarily distributed on the peers, thus fostering different search methods Examples of unstructured P2P networks include Gnutella [34], Freenet [35], and BitTorrent [36] Hybrid P2P networks combine the characteristics of structured and unstructured P2P networks and they often integrate well with the client-server paradigm [37] The main idea is to distinguish so called super peers from peers that possess varying storage, bandwidth and processing capabilities Hybrid P2P networks organize peers into clusters using a clustering technique A cluster contains at least one capable peer or super peer, other peers connect to the super peer in the cluster The connections between the super peers form the super peer P2P network With sufficient storage, bandwidth and processing power, the super peers act both as a server to handle queries from other peers and as a client to route queries to the other super peers The content advertisement and search mechanism only take place on the super peer network Hybrid P2P networks facilitate advanced search mechanisms due to the processing, storage and bandwidth resources available at the super peers Examples of super peer networks include Piazza [38], Edutella [39] and Bibster [40] DisCaRia uses a hybrid P2P network to take advantage of the inherent scaling properties hybrid P2P networks offer D Network Search More recently, distributed network search algorithms have been suggested as a primitive for building future network management systems [41], [42] Network search systems are organized as an overlay over a physical network topology The overlay provides a distributed query processing facility that can be used to retrieve operational state and configuration data from network elements The approach shares some similarity with the ideas behind the DisCaRia system However, the work on network search primarily aims to provide a generic search mechanism for management and monitoring data while the DisCaRia system assumes a certain CBR functionality to be implemented in the DisCaRia peers The idea of the DisCaRia system has also been applied to building a fault management system in the inter-cloud environment that supports applications and services for running across multiple cloud systems [43] This system recruits a P2P network of fault managers that allows system administrators to monitor faults and search similar faults with solutions on cloud systems III DisCaRia SYSTEM The distributed CBR approach focuses on CBR methodology and P2P technology While CBR methodology has been widely known as a problem-solving method for several problem domains, P2P technology has been widely used for resource sharing in distributed environment P2P systems provide remarkable features including self-organization in management, scalability in architecture, flexibility in content distribution and fault tolerance that allow peers to join the networks and exchange problems and solutions easily CBR systems require expressive case representation methods, which allow retrieval and reasoning techniques to work efficiently A distributed CBR system contains CBR engines communicating through a P2P network It offers the capability of exploring and exploiting knowledge resources for retrieval and reasoning that can be applied to fault resolution in communication networks and distributed systems DisCaRia uses a P2P network to achieve a certain level of self-organization and to benefit from a scalable architecture The system extends the underlying basic P2P network with a CBR approach for obtaining more relevant information for fault resolution [13] There are two kinds of peers in the system: super peers and peers Each super peer bears several components to perform CBR operations, e.g., retrieving, adapting, verifying and learning cases These components require sufficient storage, bandwidth and processing power Figure shows the DisCaRia system architecture based on a P2P network of super peers The super peers deal with complicated operations, thus alleviating the problem of peer heterogeneity, i.e., peers with limited capability not undertake complicated operations Each super peer is responsible for multiple functions including communication, computation, reasoning and maintenance, while each regular peer only communicates with super peers for finding relevant resources A super peer contains four main components: a P2P component, a computation component, a reasoning component, and a TRAN AND SCHÖNWÄLDER: DisCaRia—DISTRIBUTED CASE-BASED REASONING SYSTEM FOR FAULT MANAGEMENT Fig DisCaRia system architecture with peers and super peers Fig Components of a DisCaRia super peer and their interactions storage component These components are realized as independent processes, thus the failure of one component affects only this component Each component consists of several functional modules that handle different tasks, e.g., the P2P component possesses modules that manage incoming and outgoing connections, or the computation component possesses modules that handle similarity evaluation and case indexing Figure presents the main components and how they interact within a super peer The reasoning component communicates with the computation component to obtain cases from the case database, and it communicates with the P2P component to obtain cases from other super peers The storage component communicates with external BTSs to maintain the case database and the P2P component communicates with other super peers to exchange cases The P2P component communicates with peers for requests and responses In addition, DisCaRia supports a web interface for querying super peers directly without joining the P2P network In the following sections, we first discuss the communication protocols used by DisCaRia and then we explain each DisCaRia component in more detail A Communication Protocol DisCaRia uses the Gnutella P2P protocol [34] to build a fully distributed and unstructured P2P network of super peers This P2P network contains several advantages that facilitate data sharing and searching functions First, the network uses super peers to solve the problem of peer heterogeneity Peers with insufficient bandwidth and processing power cannot participate in complicated tasks, such as routing and processing queries 543 Second, the network allows super peers to maintain a local database and to perform keyword and semantic search methods on the database Third, the network also provides flexible data replication mechanisms for data sharing The disadvantage of the network is the flooding-based routing mechanism that can cause a large amount of traffic in the network DisCaRia has improved this routing mechanism by using the feedback scheme presented in the following section The Gnutella protocol supports five types of messages: ping and pong are used to probe the network, quer y and quer yhit are used to exchange data, and push is used to deal with peers behind the firewall Downloading data is handled separately from this protocol A Gnutella message consists of a header and payload The fields of the header are shown below: The ID field is a 16-octet string uniquely identifying the message on the network The Payload Descri ptor field is octet in size and it identifies the message type The TTL field contains the number of times this message will be forwarded to peers before it is discarded The H ops field contains the number of times the message has been forwarded These two fields are both octet long The Payload Length field is octets long and indicates the length of the payload The payload immediately follows the header The fields of the payload depend on message type, as defined in the protocol specification [34] The fields of the payload of a quer yhit message are shown below: The N umber o f Results field is octet long and contains the number of results in the result set The Set o f Results field stores a set of super peers and peers with their corresponding results, i.e., a set of solutions for a certain problem The ping, pong, quer y and quer yhit remain unchanged, while DisCaRia has extended the Gnutella protocol by adding a f eedback message type This message type allows peers receiving queryhits to evaluate the results contained in queryhits and to send feedback to the related peers The fields of the payload of a f eedback message are shown as follows: The N umber o f Evaluations field is octet long and contains the number of evaluations in the evaluation set The Set o f Evaluations field stores a set of super peers and peers with their corresponding results, and the grade of the results DisCaRia not only allows peers to connect to super peers, it also provides a web interface to access super peers This feature enables users to use the DisCaRia search function without downloading and installing peers B Peer-to-Peer Component The P2P component organizes the communication with other super peers and between a super peer and its directly attached 544 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL 12, NO 4, DECEMBER 2015 Algorithm Super Peer Ranking and Selection Fig Exchange of quer y, quer yhit, and f eedback messages regular peers It also organizes the communication with the reasoning component This is achieved by using the same P2P protocol in order to exchange information with the super peer’s reasoning component, i.e., the reasoning component acts as well as a peer The P2P component uses a feedback mechanism [13] for evaluating the quality of queryhits, and thus fostering peer learning The mechanism extends the Gnutella protocol to include the f eedback message described above Figure illustrates super peer communication and the usage of f eedback messages During bootstrapping, a peer discovers and connects to its super peer Subsequently, the peer can send a quer y message to its super peer The super peer will forward the message to its neighboring super peers and and they will in turn forward the query to the super peers 5, 6, and Lets assume that the super peers and are able to find an answer to the query Super peer sends a quer yhit message via and back to the peer while super peer sends a quer yhit message via and back to the peer After receiving the quer yhit messages, the peer will evaluate the received answers and generate f eedback messages to the super peers and The f eedback messages are forwarded by the super peers 1, 2, and The forwarding super peers will also learn about the peer’s evaluation of the queryhit results The feedback mechanism not only facilitates the learning process of super peers (corresponding to the retention process of the CBR approach), it also improves the flooding mechanism by sending the queries to a set of the super peers that have proven their competence in previous queryhits In order to achieve this improvement, super peers need to keep track of queries and the super peers with correct queryhits by listening to feedback messages, i.e., Qr y Lst is a list of elements (q, Pr Lst), where q is a query and Pr Lst is a list of super peers Super peers forward queries to the selected sets of expert super peers and random super peers, i.e., E x pLst is a list of elements ( p,e), where p is an expert super peer and e is its corresponding expert score and N br Lst is a list of neighbor super peers Note that the expert super peer and its score in E x pLst are regularly updated by learning from Pr Lst The random super peers add some randomness to the scheme in order to allow new super peers to join the P2P network and to become experts on their own Algorithm shows the super peer ranking and selection algorithm It iterates over a set of recently similar queries in Qr y Lst (line 3) to obtain a set of the expert super 10 11 12 13 14 15 16 17 18 19 20 Input: Qr y Lst: a list of elements (q,Pr Lst) N br Lst: a list of neighbor super peers E x pLst: a list of elements ( p,e) RnkLst: a list of ranked super peers , γ : similarity threshold and number of super peers qc : an input query Output: a list of selected super peers SelLst Rnk Lst ← ∅ Sel Lst ← ∅ for each (qi , Pr Lsti ) ∈ Qr y Lst if sim(qc , qi ) > then Rnk Lst ← Rnk Lst ∪ Pr Lsti end if sor t(Rnk Lst, E x pLst) for each p j ∈ Rnk Lst Sel Lst ← Sel Lst ∪ { p j } if |Sel Lst| = γ − break end if for each (qi , Pr Lsti ) ∈ Qr y Lst and p j ∈ Pr Lst i if |Sel Lst| = γ − then break end if / Sel Lst then if p j ∈ Sel Lst ← Sel Lst ∪ { p j } end if for each p j ∈ rand(N br Lst) / Sel Lst then if p j ∈ Sel Lst ← Sel Lst ∪ { p j } end if if |Sel Lst| = γ then break end if peers Rnk Lst (lines 4-5) Since each super peer in Rnk Lst contains its expert score e stored in E x pLst, the sort function ranks Rnk Lst by the expert score (line 7) It then chooses up to γ − super peers from the ranked set Rnk Lst (lines 8-10) If the number of the ranked super peers is insufficient, it considers the top elements of the query set Qr y Lst (lines 11-15), which includes the most recently used super peers It finally fills a set of selected super peers Sel Lst with the super peers from the neighbor set N br Lst C Reasoning Component This component is the heart of the DisCaRia peer that comprises case retrieval and reasoning operations corresponding to the retrieval and reuse operations of the CBR approach To process a case efficiently, the case is represented by multiple vectors using multi-vector representation method [10]: • a field-value vector to express fault pre-defined features, e.g., fault categories, system components, and product releases To represent n pairs, we employ the field-value vector: v f = < f :v1 , , f k : vk , , f n :vn >, where k is the fixed number of pre-defined pairs • a field-value vector to express fault comparable features (user-defined features), e.g., error messages, symptoms, and debug snippets These pairs are represented by v p = < p1 :v1 , , pm :vm >, where m is number of symptoms and parameters They are either binary, numeric or symbolic values TRAN AND SCHÖNWÄLDER: DisCaRia—DISTRIBUTED CASE-BASED REASONING SYSTEM FOR FAULT MANAGEMENT • a real-value vector to represent fault details in the textual form, e.g., fault descriptions, discussions, and related fault features This semantic vector vs is generated by the LSI technique [44] The similarity of field-value vectors is measured by the sum of weight values of matched field-value pairs, while the similarity of real-value vectors is measured by the cosine function The following example is a fault case extracted from a networking forum: Problem: Hub connectivity not functioning Description: The WinXP network contains many machines obtaining an address from a DHCP server The network is extended to more machines by unplugging the LAN cable from one of the machine’s and plugging it to a hub with the intention to add the new machines From the hub, none of the machines successfully obtains an IP address from the DHCP server; an error message shows “low or no network connectivity” To make this fault case understandable and comparable to CBR engines, vector v f contains Vector v p comprises Using LSI, several terms are considered to build the vector vs Upon receiving a query from the P2P component, the retrieval operation first checks whether the query already exists in the cache, and processes the query to generate multiple vectors It then compares these vectors with the vectors of the cases to select relevant cases from the database Evaluating vectors requires the computation component to classify cases based on field-value vectors and index cases based on realvalue vectors These computing operations normally process large matrices, thus consuming a large amount of time Super peers index and select the cases based on their database, then send the selected cases to the querying peer using the P2P component Aggregating cases from super peers results in a set of the retrieved cases Upon receiving the retrieved cases, the reasoning operation uses the two-process probability reasoning mechanism [11] to process the cases The ranking process of the mechanism aims to narrow down the scope of the query by weighting the common symptoms between the retrieved cases and the query This process applies the k-Nearest-Neighbor (kNN) algorithm [45] to rank the cases that share many of the same field-value pairs with the query The computation component is responsible for assigning weight values to field-value pairs The selection process of the mechanism aims to predict some promising cases for the query by correlating between the ranked cases and the query through the common symptoms This process employs the Bayesian approach to compute probability values for the ranked cases, and selects the cases with high probability values The probability value indicates the strength of belief in the case based on the previous knowledge of the ranked cases and the observed symptoms Given a set of cases C, where a case Cr contains a set of symptoms {S1 , , Sk } and a solution, we assume that 545 solutions in C as a set of exhaustive and mutually exclusive hypotheses {H1 , , Hn }, and that any symptom is the result of a diagnosing probe, e.g., a ping probe provides either the high probability of success or the low probability of success (i.e., failure) The problem contains a set of symptoms {S1 , , Sh } without a solution (note that cases and the problem can share the same symptoms) Thus, the puzzle is to find the highest conditional probability of the hypotheses P(Hi |S1 , , Sh ) with i = 1, , n Considering a set of exhaustive and mutually exclusive hypotheses H1 , , Hn and S1 , , Sh as a set of evidence pieces (or symptoms) obtained from the problem with an assumption that S1 , , Sh are independent from each other Applying the conditional probability formula, we obtain: P(Hi |S1 , , Sh ) = P(S1 , , Sh |Hi )P(Hi ) P(S1 , , Sh ) h = α P(Hi ) P(S j |Hi ) j=1 where P(S1 , , Sh |Hi ) = hj=1 P(S j |Hi ) since S j are independent from each other, P(Hi ) are the prior probabilities of hypotheses, and α = [P(S1 , , Sh )]−1 is determined via the n P(Hi |S1 , , Sh ) = requirement i=1 In case the evidence set S contains a new evidence piece Snew (e.g., the problem has a new symptom), updating the new evidence piece first computes P(Hi |S) and then uses P(Hi |S, Snew ), as follows: P(Hi |S)P(Snew |S, Hi ) P(Snew |S) = β P(Hi |S)P(Snew |Hi ) P(Hi |S, Snew ) = where β is determined by the same method as α, P(Hi |S) is computed as previously, and P(Snew |Hi ) is determined by the experts The following example computes the probability of solutions for the connection failure problem: • H1 = Checking firewall software for blocking connections – S1 = Desktop keeps disconnecting from the Internet – S2 = Desktop and Laptop keeps connecting from the router – S3 = Connection usually goes really slow – S4 = Connection is fine before updating the firewall software – S5 = Router is WHR-HP-G54 and wireless adapter is Linksys WMP54G • H2 = Reinstalling networking components (TCP/IP) – S1 = Desktop completely stops connecting to the Internet – S2 = Laptop can connect to desktop and the Internet – S3 = Desktop disconnects to laptop and D-Link router with a limited connectivity – S6 = Desktop uses an Etherlink 10/100 PCI card and laptop uses a wireless adapter – S7 = Registry was damaged on desktop few days ago • H3 = Checking router configuration for the IP address range 546 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL 12, NO 4, DECEMBER 2015 – S1 = Desktop cannot connect to a router and the Internet – S2 = Laptop connects to the router and the Internet – S4 = The firewall software is often updated on those machines – S8 = Desktop gets error message of address already used when renewing The following table presents the weight values of symptoms to the solutions (note that updated weight values are not bold) This table is for demonstration, we only need weight values related to the problem’s symptoms for implementation: H1 S1 0.1 S2 S3 S4 S5 0.1 0.25 0.448 0.001 H2 0.1 0.1 0.3 0.001 H3 0.25 0.245 0.005 0 S6 S7 S8 0.1 0.001 0.048 0.45 0.001 0.05 0.45 In order to fulfill the requirement of exhaustive and mutually exclusive hypotheses, we examine a set of hypotheses H1 , H2 and H3 , and ignore other hypotheses; i.e., the conditional probability of other hypotheses is This, however, can be a problem in practice if the examined set of hypotheses does not contain the desired hypothesis We also consider a set of evidence pieces S1 , ,S8 obtained by distinct probes independent because the effect of a probe to other evidence pieces is minor, and evidence pieces can only be correlated if probes are not distinct; e.g., if a connection failure occurs, symptoms collected by the ping and ftp probes can be correlated Hypotheses possess the same prior probabilities P(Hi ) = (0.33, 0.33, 0.34), and the problem contains the following symptoms: • H =? – S1 = Desktop gets connection failure – S2 = Other machines still connect to routers and to the Internet, – S4 = Desktop updated the firewall software two days ago By applying the above equations, we obtain P(Hi |S1 , S2 , S4 ) = (0.8719, 0.0019, 0.1261) with i = 1, 2, The result indicates that the chance of firewall software blocking connections is 87.19% given the symptoms of the problem Intuitively, a solution is deduced by an incomplete set of symptoms; solutions likely share a subset of symptoms Bayesian computation distinguishes those solutions by the significance of symptoms in a case and the significance of symptoms among cases In addition, this component cooperates with the P2P component to learn the resulting cases and send the feedback messages to the peers, performing the retention operation of the CBR approach Algorithm simply reflects the two processes as illustrated above It iterates over a set of cases C that contains a set of symptoms Cr and a solution Hr (line 3), finds a set of common symptoms S (line 4), and accumulates weight values (lines 5-8) before creating a set of ranked cases R (line 11) Note that V is a list of probability values of solutions It then iterates over hypotheses R (line 12), where each hypothesis is a solution, initializes the prior probability Vr with a value n1 , where n is the number of solutions (line 13), computes and normalizes the Algorithm Case Ranking and Selection 10 11 12 13 14 15 16 17 18 19 Input: C: a set of cases (Cr , Hr ), Hr : solution n: the number of solutions V : a list of probability values Vr Cr , wr : sets of symptoms & weight values (case) C p , w p : sets of symptoms & weight values (problem) S: a set of symptoms R: a set of ranked cases Output: a set of final cases F R←∅ F ←∅ for each Cr ∈ C S = Cr ∩ C p if S = ∅ then Tr = for each Si ∈ S Tr = Tr + wri w pi R ← R ∪ {(Cr , Hr )} end if sor t(R, Tr ) for each Hr ∈ R Vr = n1 for each Si ∈ C p Vr = Vr w pi for each Vr ∈ V & Cr ∈ R Vr = Vr V −1 F ← F ∪ {(Cr , Hr )} sor t(F, Vr ) posterior probabilities (lines 14-17) before generating a set of final cases F (line 19) D Computation Component This component supports several assessment operations for retrieving cases The first operation uses the weighted average function to measure the field-value vectors and to obtain cases Field-value pairs represent both pre-defined and userdefined features The component directly assigns weight values to the pre-defined features, while the users indirectly assign weight values to the user-defined features The component can also assign average weight values to the user-defined features This operation finally selects few thousand cases from hundred thousand cases of the database based on the measurement results The second operation uses the Latent Semantic Indexing (LSI) method [44] to index the selected cases based on their textual descriptions, i.e., using the term vectors to generate the real-value vectors The indexing operation is time consuming due to computing singular value decomposition (SVD) for large matrices, thus demanding to function on a separate process Various SVD algorithms are implemented to achieve both precision and performance This operation finally uses the cosine function to evaluate the real-value vectors between the selected cases and the query, and select a set of few hundred cases The third operation again uses the weighted average function to aggregate values from the above two operations TRAN AND SCHÖNWÄLDER: DisCaRia—DISTRIBUTED CASE-BASED REASONING SYSTEM FOR FAULT MANAGEMENT 547 TABLE I S OME P OPULAR B UG T RACKING S ITES ( AS OF A PRIL 2013) A P LUS I NDICATES T HAT W E W ERE U NABLE TO G ET P RECISE N UMBERS AND O UR N UMBERS P RESENT A L OWER B OUND Fig Unified bug data model represented as a UML class diagram and provides the resulting set of few dozen cases with high relevance The component and algorithms are implemented in C with the support of the svdlibc [46] library for matrix computation to achieve high performance This library contains an implementation of the single-vector Lanczos algorithm E Storage Component This component exploits various fault resources to update the databases of super peers Fault resources can be available at bug tracking systems, online support forums and archives (also known as communities or knowledge bases) These systems support web-based crawling, and some systems also support RPC-based crawling This component contains crawlers to obtain bug reports from fault resources Since bug reports contain various information, the crawlers provide the HTML parsers to extract selective information based on the unified bug data model shown in Figure in order to generate cases The crawlers regularly check the modified time attribute of the bug reports to update the cases In addition, the cases can also be updated by the learning operation of super peers, i.e., super peers learn the solutions of the problem through the feedback messages The component is implemented in Python and cases are stored in the MySQL database [47] The database schema conforms to the unified data model The Porter stemming algorithm [48] helps to stem terms in bug contents F Web Interface Peers only contain the P2P component that communicates with the P2P network for sharing and searching fault resources They connect to one of super peers to join DisCaRia Note that super peers possess not only the P2P component but also the reasoning, computation and storage components Since peers have limited capability, they are used to send queries to and receive results from super peers They are also used to advertise fault resources on super peers Implementing these peers is simple, but users are usually reluctant to download and install them DisCaRia thus provides a web support component on super peers to enable a web interface that allows users to connects and forward queries to super peers This web interface can accept several kinds of symptoms, error messages and textual descriptions The web interface is built on Django [49], an open source web application framework written in Python Django provides several facilities for developing web applications and integrates well into web servers, such as Apache HTTP Servers [50] IV FAULT R ESOURCES Many fault resources are available on the Internet DisCaRia aims at crawling bug reports from BTSs, archives and forums These resources share the same purpose of reporting bugs for software and hardware components, but differ from data inclusion and presentation Investigating the features of several BTSs focuses on several properties that are important for obtaining data from them It is necessary to check BTS functionality supports with specific installations In order to understand the structure of the information stored in a BTS, the underlying data model is inspected and documented Some systems provide this information in a textual format while others provide graphical representations, usually in ad-hoc notations Some systems not provide a clear description of the data model underlying the BTS and it is necessary to reverse engineer the data model by looking at concrete bug reports Bug reports are dependent on other bugs if they cannot be resolved or acted upon, until the dependency itself is resolved or acted upon Tracking any dependency relations between bug reports is thus useful because it helps to correlate bugs This activity is usually challenging without expert and system supports Some systems allow full keyword search for their reports, while others only support searching via a set of pre-defined filters applied on the entire bug database The former is more useful for an automated system that aims to provide keyword search capabilities itself Table I reports several popular BTSs, sites and numbers of bugs With the exception of Debian BTS, which is only used for the Debian operating system, the other BTSs publish lists of known public sites that use them for bug tracking While some BTSs provide a machine-readable web service interface to their bug data, most not In all systems where such an interface is supported, it is an optional feature, and because 548 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL 12, NO 4, DECEMBER 2015 optional features require additional effort from an administrator to be set up, they are rarely available In addition, a web service interface often provides much less data than the human-readable web interface that is most commonly used Clearly, relying on the availability of a web service API is unrealistic To solve this problem, crawlers [51] were created to directly use the presentational HTML-based web interface in order to get as much access to information as ordinary users Other crawlers were built to exploit several methods provided by the Bugzilla web service interface, e.g., XML-RPC API The crawlers only submit a bug identifier to obtain the details of the bug from the Bugzilla server, i.e., a list of field-value pairs The database schemas of several BTSs and archives share several similar fields that can be classified in two main groups: (i) the administrative information associated with a bug is represented as field-value pairs, such as identity, severity, status, product, summary, among others; (ii) the description information detailing the bug and any follow-up discussion or actions is typically represented as textual attachments Figure shows the unified bug data model in the form of a UML class diagram [52] The central class is the Bug class The id attribute is the URL where a bug can be retrieved as its identifier Most of the attributes can be easily extracted from the retrieved data The severity attribute is probably the most interesting to fill correctly, because BTSs have very different severity classifications for bugs This unified model defines four severity values: critical, normal, minor and feature that can be used to map the severity values of the BTSs The status attribute only has two values: open represents what BTSs refer to unconfirmed, new, assigned, reopened bugs, while fixed represents what BTSs refer to resolved, verified, and closed bugs The textual description is modeled as the Attachment class Every attachment belongs to exactly one bug Some BTSs provide information about the platforms affected by a bug The Platform class represents platforms, e.g., Window or MacOS The Symptom class represents keywords used to describe and classify bugs Simple classifications such as severity, status, platform, etc can be provided by the information of bugs Symptoms provide complicated classifications such as problem scope, problem type, e.g., a bug related to network failure can be related to connectivity, authentication service, hardware and software configuration These classifications help to narrow down the scope of a bug and to figure out the related bugs of a bug Symptoms contain sets of distinct keywords, typical debugging messages or diagnosing patterns The left part of Figure models what piece of software a bug is concerned with While some BTSs are only concerned with bugs in a specific piece of software, software in larger projects is split into components and bugs can be related to specific components The Software and Component classes model this structure The Debian BTS is somewhat different from the other BTSs as it is primarily used to track issues related to software “packages”, that is software components packaged for end user deployment Since there is a large amount of meta information available for Debian software packages (dependency, maintainer and version information), we have introduced a separate Package class to represent packaged software Fig Average memory usage for a super peer with various datasets V S YSTEM E VALUATION The DisCaRia system contains several super peers with sufficient processing, storage, memory and bandwidth capabilities, and each super peer contains several components with various functionalities Our previous studies have evaluated these components and algorithms individually, such as the feedback scheme [13] for P2P component, the multiple vector representation method [10] for computation component, the two-process probabilistic reasoning method [11] for reasoning component, and the bug report crawlers [14] for storage component However, it is necessary to evaluate the performance and efficiency of the complete DisCaRia system The responsiveness metric assesses how fast the system responds to queries This metric also depends on various factors including processing power, bandwidth and the retrieval algorithm The quality metric measures how precise the system answer to problem queries This metric also depends on various factors including the size of the database and the reasoning algorithm The database has been populated by crawling various archives, forums and BTSs such as Mozilla Bugzilla, Red Hat Bugzilla, Launchpad Ubuntu, etc These software bug datasets contain different numbers of bug reports A super peer with normal hardware configuration (Intel Pentium 3.6 GHz, GB RAM) can generally accommodate more than 100.000 bug reports Figure shows that the average memory usage of a super peer increases slowly with the number of bug reports The reasoning component handles the classifying and selecting processes that work with the features of cases, and the computation component maintains the indexing and filtering processes that work with the textual descriptions of cases These two components allocate almost all the memory used by the super peer The super peer uses approximately 350 MB to manage 100.000 bug reports Super peers therefore can accommodate a large number of bug reports without the need of powerful computers DisCaRia with a large number of super peers can exploit leisure and shared resources to form a large knowledge database for problem resolution Note that the size of bug reports considerably fluctuates depending on BTSs Evaluating the responsiveness of DisCaRia requires a distributed environment exhibiting real network latency, peer heterogeneity and churn rate EmanicsLab [15] supported by European Network of Excellence for the Management of Internet Technologies and Complex Services (EMANICS) is a flexible and re-usable distributed computing and storage testbed TRAN AND SCHÖNWÄLDER: DisCaRia—DISTRIBUTED CASE-BASED REASONING SYSTEM FOR FAULT MANAGEMENT TABLE II E MANICS L AB S UPER P EERS C ONFIGURATION 549 TABLE III PARAMETER C ONFIGURATION Fig Average time for retrieving the different numbers of queryhits Fig DisCaRia deployment on EmanicsLab that fosters joint research activities of EMANICS partners EmanicsLab is based on a complete installation of PlanetLab Central, the backend management infrastructure of PlanetLab [53] The testbed contains 20 nodes located at 10 universities and research institutes in Europe DisCaRia has been installed on 11 EmanicsLab nodes as shown in Table II acting as super peers with a web interface running on web servers at Jacobs University Figure plots the DisCaRia peers deployed on EmanicsLab These peers randomly connect to each other and directly update their databases from BTSs EmanicsLab contains some degree of node heterogeneity since nodes possess varying processing, memory and bandwidth capabilities Nodes are dedicated to run some distributed applications from the partners, thus network latency is sufficient to evaluate this system All queries sent to DisCaRia are performed through the web interface The queries extracted from bug reports comprise textual descriptions, symptoms or debugging messages The queries are continuously fed to the web interface using a specific tool that connects directly to super peers The tool also receives the queryhits from super Fig Average time consumption per query over various sizes of the datasets peers for analysis Table III describes the parameter configuration of super peers The churn rate is set by 1% of the total number of super peers, and other parameters are the maximum numbers of connections or queryhits Figure plots average time for retrieving the different numbers of queryhits The basic or advance result contains queryhits returned from super peers running the retrieval algorithm or running the retrieval and reasoning algorithms respectively The limited or unlimited result contains queryhits returned from super peers connecting to the limited or unlimited number of super peers respectively Local super peers send queryhits within second, while remote super peers send queryhits within to seconds Running the retrieval and reasoning algorithms consumes more time than running the retrieval algorithm only Increasing the number of connections causes super peers to forward queries to further super peers, thus increasing communication time Figure plots the average time consumption of the computation component, the reasoning component and the super peer over the various sizes of the datasets The computation time increases logarithmically as the number of bugs increases The 550 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL 12, NO 4, DECEMBER 2015 Fig 10 Average amount of transferred data over various sizes of the query sets computation component consumes time to compute matrices, index and select bugs The reasoning component spends time to rank and classify the selected bugs that share the same symptoms with queries using Bayesian computation This component operates on a set of the selected bugs provided by the computation component Processing the set of the selected bugs is thus less time consuming than the whole datasets An observation shows that when performing a query on a dataset of 100.000 bugs, the computation component takes 3.5 seconds on average to select bugs, while the reasoning component takes 1.3 seconds on average to obtain bugs Among multiple components, the super peer mainly depends on these two components that achieve logarithmic time complexity over the number of bugs Figure 10 reports that data transfer between the remote super peers is trivial compared to the local super peer and the querying peer The amount of transferred data per query is calculated by the number of bugs replied by a super peer The local super peer directly associated with the querying peer usually replies the number of bugs times more than the remote super peers on average However, while the local super peer responds to the querying peer, the remote super peers respond to the local super peer, which usually causes substantial traffic on the P2P network The overall data transfer of DisCaRia peers heavily depends on the local super peer and the number of the remote super peers involved, which can reach to 350 MB approximately for 100 queries using the current setting of DisCaRia Evaluating the reliability of DisCaRia focuses on measuring super peer failure and reconnection rate over various sizes of the query sets A super peer is referred to as super peer failure if it fails to recover from a certain problem during its operation, while a super peer can recover and reconnect to the P2P network, it is called as peer reconnection The former causes resources unavailable and the later causes the topology change of the P2P network that affects the quality of the queryhits Since DisCaRia relies on super peers with high availability, the peer failure rate is thus low (less than 10% over 1000 queries approximately) as shown in Figure 11 However, the peer reconnection rate is relatively high (greater than 35% over 1000 queries approximately) To cope with the dynamic Internet environment, super peers regularly have to disconnect and reconnect the P2P network Fig 11 Super peer failure and reconnection rates over various sizes of the query sets Fig 12 Average probability of the top queryhit over the query sets Evaluating the efficiency of DisCaRia requires three sets of 100 queries An identical set contains queries selected from the bug reports of the database A familiar set contains queries extracted from the identical set by removing some keywords and symptoms A random set contains queries chosen from the bug reports of the forums and archives The following experiments focus on the capability of DisCaRia to find bug reports that are identical or similar to the queries Probability values can be used to measure the result of these experiments, i.e., the queryhit with a high probability value can be considered as a duplication of the query Super peers normally return a set of the queryhits and probability values for the query The probability value of the queryhit is the assessment of the individual super peer The number of the queryhits increases as the number of the super peers associated with the query increases Figure 12 depicts the average probability value of the top queryhit over the three query sets The probability values of the identical and familiar sets increase continuously to 0.88 and 0.81 respectively, while the probability value of the random set stays stably at 0.62 Almost all the queries in the first two sets can obtain the duplicated bug reports with high probability Some queries in the third set can obtain similar bug reports with average probability and cannot find out the resulting bug reports Super peers possess bug reports from few BTSs, they only focus on a certain query domain Therefore, the high probability value guarantees the high similarity between the query and the bug report, while the average probability value indicates the close relationship between the query and the bug report Similar to the above experiment, Figure 13 shows the average probability value of the top three queryhits over the three query sets The identical query set again outperforms the familiar and TRAN AND SCHÖNWÄLDER: DisCaRia—DISTRIBUTED CASE-BASED REASONING SYSTEM FOR FAULT MANAGEMENT Fig 13 Average probability of the top three queryhits over the query sets Fig 14 Average probability of the queryhits of arbitrary queries random query sets An observation shows that the probability values of the three query sets tend to move closer from 0.58 to 0.68 The advantage of the identical query set is only due to the influence of the high probability of the identical bug report Several super peers are unfamiliar with the queries, thus the queryhits from these peers obtain either high probability or low probability due to the distinct relationship with the query Among super peers associated with the queries, several super peers provide unrelated queryhits Figure 14 presents the average probability value of some arbitrary queries from the three query sets The querying peer needs to set a timeout value for receiving the queryhits because super peers possess different processing power A short timeout value reduces the number of queryhits, while a long timeout value increases the response time of DisCaRia To deal with this issue, super peers only select and return the selected number of bug reports as shown in Table III The querying peer therefore can receive many queryhits for each query The result only shows the top queryhits because the remaining queryhits possess low probability values With the three query sets, the queryhits are divided into two parts: the top five queryhits with probability values higher than 0.4 and the remaining queryhits with probability values lower than 0.4 Queries from the identical set also obtain higher probability values than queries from the other sets for the top five queryhits The result reaffirms the capability of DisCaRia to clearly distinguish the bug reports related to the queries from the bug reports unrelated the queries Each query in the query sets contains a resulting set of identical or related bug reports that can be used to evaluate the recall and precision rates The recall rate is the ratio of the number of related bug reports obtained to the pre-defined number of related bug reports The precision rate is the ratio of the number of related bug reports obtained to the total number of bug 551 Fig 15 Precision by various recall rates using the query sets reports obtained Figure 15 presents the precision rate of the three query sets over the recall rate A method is to obtain bug reports until the recall rate reaches 0.2, , 0.7, then compute the precision rates based on the total number of the bug reports obtained, respectively The precision rates of the three sets reduce as the recall rates increase, especially, the precision of the random set reduces quickly to 0.2 The queries in the random set are unrelated to bug tracking systems and mainly rely on textual description, thus DisCaRia cannot easily exploit symptoms and error messages in the queries The identical set outperforms the other sets because the queries in the identical set share several the same symptoms and error messages with bug reports According to the identical set, the high precision rate (0.75) at the low recall rate (0.2) matches with the previous result that the queryhits of this query set obtain possess high probability values VI C ONCLUSION We presented DisCaRia, a distributed case-based reasoning system combining the CBR approach with P2P technology to assist system administrators and network operators in the fault resolution process DisCaRia can explore fault knowledge resources in distributed environments by using P2P technology and it can exploit problem solving knowledge resources using a CBR methodology The system’s architecture is built around four main components: a peer-to-peer component, a reasoning component, a computation component, and a storage component These components communicate through an extended Gnutella protocol The system prototype has been implemented and deployed on EmanicsLab, a distributed computing and storage testbed Using responsiveness and quality metrics, several experiments have been performed using software bug reports from a BTSs and problem reports from forums and archives as the knowledge resources The experiements show that DisCaRia peers not require powerful computers for computation and reasoning operations The peers running on EmanicsLab nodes can achieve a plausible response time with a dataset of 100,000 bug reports DisCaRia obtains identical and closely related bug reports with high probability We note that there are no public annotated data sets that can be used to compare the performance of DisCaRia with similar systems Our usage of data from a BTSs and problem reports from public forums and archives can only be a start for defining a “standard” public data set for comparative evaluations 552 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL 12, NO 4, DECEMBER 2015 Our future work focuses on further improving DisCaRia by using an ontology-based representation method in the reasoning component This method employs semantic vectors to represent structured fault data Building these semantic vectors requires an ontology that defines the concepts and relationships of fault properties in the fault knowledge domain The advantage of this method is the capability of describing both textual descriptions and symptoms as concepts and constructing the relationships between these concepts A preliminary study [52] has investigated this method The relationships between concepts in the ontology can be exploited to ameliorate the retrieval and reasoning processes R EFERENCES [1] S Klinger, S Yemini, Y Yemini, D Ohsie, and S Stolfo, “A coding approach to event correlation,” in Proc 4th Int Symp Integr Netw Manage IV, 1995, pp 266–277 [2] S Kätker and M Paterok, “Fault isolation and event correlation for integrated fault management,” in Proc 5th Int Symp Integr Netw Manage V, 1997, pp 583–596 [3] G Jakobson and M Weissman, “Real-time telecommunication network management: Extending event correlation with temporal constraints,” in Proc 4th Int Symp Integr Netw Manage IV, 1995, pp 290–301 [4] Y A Nygate, “Event correlation using rule and object based techniques,” in Proc 4th Int Symp Integr Netw Manage IV, 1995, pp 278–289 [5] L Lewis, “A case-based reasoning approach to the resolution of faults in communication networks,” in Proc 3rd Int Symp Integr Netw Manage (IM’93), 1993, pp 671–682 [6] G Dreo and R Valta, “Using master tickets as a storage for problemsolving expertise,” in Proc 4th Int Symp Integr Netw Manage IV, 1995, pp 328–340 [7] C Melchiors and L Tarouco, “Fault management in computer networks using case-based reasoning: DUMBO system,” in Proc 3rd Int Conf Case-Based Reason Develop (ICCBR’99), 1999, pp 510–524 [8] A Aamodt and E Plaza, “Case-based reasoning: Foundational issues, methodological variations, and system approaches,” AI Commun., vol 7, no 1, pp 39–59, 1994 [9] R Buyya, R Ranjan, and R N Calheiros, “Intercloud: Utility-oriented federation of cloud computing environments for scaling of application services,” in Proc 10th Int Conf Algorithms Archit Parallel Process (ICA3PP’10), 2010, pp 13–31 [10] H M Tran and J Schönwälder, “Fault representation in case-based reasoning,” in Proc 18th IFIP/IEEE Int Workshop Distrib Syst.: Oper Manage (DSOM’07), 2007, pp 50–61 [11] H M Tran and J Schönwälder, “Fault resolution in case-based reasoning,” in Proc 10th Pac Rim Int Conf Artif Intell (PRICAI’08), 2008, pp 417–429 [12] H M Tran and J Schönwälder, “Evaluation of the distributed case-based reasoning system on a distributed computing platform,” in Proc 7th Int Symp Front Inf Syst Netw Appl (FINA’11), 2011, pp 53–58 [13] H M Tran and J Schönwälder, “Heuristic search using a feedback scheme in unstructured peer-to-peer networks,” in Proc 5th Int Workshop Databases Inf Syst Peer-to-Peer Comput (DBISP2P’07), 2007, pp 1–8 [14] H M Tran, G Chulkov, and J Schönwälder, “Crawling bug tracker for semantic bug search,” in Proc 19th IFIP/IEEE Int Workshop Distrib Syst.: Oper Manage (DSOM’08), 2008, pp 55–66 [15] D Hausheer and C Morariu, “Distributed test-lab: EMANICSLab,” in Proc 2nd Int Summer School Netw Serv Manage (ISSNSM’08), University of Zurich, Switzerland, Jun 2008 [16] ITU-T, “Trouble management function for ITU-T applications,” Recommendation X.790, 1995 [17] D Johnson, “NOC internal integrated trouble ticket system functional specification wishlist,” RFC 1297, 1992 [18] TMF, “Customer to service provider trouble administration business agreement,” NMF 501, Issue 1.0, 1996 [19] TMF, “Customer to service provider trouble administration information agreement,” NMF 601, Issue 1.0, 1997 [20] D Bloom, “Selection criterion and implementation of a trouble tracking system: What’s in a paradigm?” in Proc 22nd Annu ACM SIGUCCS Conf User Serv (SIGUCCS’94), 1994, pp 201–203 [21] G T Vesonder, S J Stolfo, J E Zielinski, F D Miller, and D H Copp, “ACE: An expert system for telephone cable maintenance,” in Proc Int Joint Conf Artif Intell (IJCAI’83), 1983, pp 116–121 [22] S K Goyal, D S Prerau, A V Lemmon, A S Gunderson, and R E Reinke, “Compass: An expert system for telephone switch maintenance,” Expert Syst., vol 2, no 3, pp 112–126, Apr 1985 [23] K Macleish, S Thiedke, and D Vennergrund, “Expert systems in central office switch maintenance,” IEEE Commun Mag., vol 24, no 9, pp 26– 33, Sep 1986 [24] T E Marques, “A symptom-driven expert system for isolating and correcting network faults,” IEEE Commun Mag., vol 26, no 3, pp 6–13, Mar 1988 [25] S Rabie, A Rau-Chaplin, and T Shibahara, “DAD: A real-time expert system for monitoring of data packet networks,” IEEE Netw., vol 2, no 5, pp 29–34, Sep 1988 [26] B L Gingrich and G J Minden, “MANDOLIN—A communications management expert system using a reduced form of the Dempster-Shafer uncertainty theory,” in Proc 3rd Int Conf Ind Eng Appl Artif Intell Expert Syst (IEA/AIE’90), 1990, pp 76–85 [27] S Berkovsky, T Kuflik, and F Ricci, “P2P case retrieval with an unspecified ontology,” in Proc 6th Int Conf Case-Based Reason., 2005, pp 91–105 [28] S Androutsellis-Theotokis and D Spinellis, “A survey of peer-to-peer content distribution technologies,” ACM Comput Surv., vol 36, no 4, pp 335–371, Dec 2004 [29] E K Lua, J Crowcroft, M Pias, R Sharma, and S Lim, “A survey and comparison of peer-to-peer overlay network schemes,” IEEE Commun Surv Tuts., vol 7, no 2, pp 72–93, Apr 2005 [30] S Ratnasamy, P Francis, M Handley, R Karp, and S Schenker, “A scalable content addressable network,” in Proc Conf Appl Technol Archit Protoc Comput Commun (SIGCOMM’01), 2001, pp 161–172 [31] I Stoica, R Morris, D Karger, M F Kaashoek, and H Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” in Proc Conf Appl Technol Archit Protoc Comput Commun (SIGCOMM’01), 2001, pp 149–160 [32] B Zhao, L Huang, J Stribling, S Rhea, A Joseph, and J Kubiatowicz, “Tapestry: A resilient global-scale overlay for service deployment,” IEEE J Sel Areas Commun., vol 22, no 1, pp 41–53, Jan 2004 [33] P Maymounkov and D Mazières, “Kademlia: A peer-to-peer information system based on the XOR metric,” in Proc 1st Int Workshop Peer-toPeer Syst (IPTPS’01), 2002, pp 53–65 [34] The Gnutella Developer Forum (2001) Gnutella Protocol Specification v0.4 [Online] Available: http://rfc-gnutella.sourceforge.net/developer/ stable/index.html, accessed on Aug 2015 [35] I Clarke, O Sandberg, B Wiley, and T W Hong, “Freenet: A distributed anonymous information storage and retrieval system,” in Proc Int Workshop Des Issues Anonym Unobserv., 2000, pp 46–66 [36] B Cohen, “Incentives build robustness in bitorrent,” in Proc 1st Workshop Econ Peer-to-Peer Syst., 2003, pp 1–5 [37] G P Jesi, A Montresor, and O Babaoglu, “Proximity-aware superpeer overlay topologies,” IEEE Trans Netw Serv Manage., vol 4, no 2, pp 74–83, Sep 2007 [38] I Tatarinov et al., “The piazza peer data management project,” SIGMOD Rec., vol 32, no 3, pp 47–52, 2003 [39] W Nejdl et al., “EDUTELLA: A P2P networking infrastructure based on RDF,” in Proc 11th Int Conf World Wide Web (WWW’02), 2002, pp 604–615 [40] P Haase et al., “Bibster—A semantics-based bibliographic peer-to-peer system,” in Proc 3rd Int Semant Web Conf (ISWC’04), 2004, pp 349– 363 [41] M Uddin, R Stadler, and A Clemm, “A query language for network search,” in Proc 13th IFIP/IEEE Int Symp Integr Netw Manage (IM’13), 2013, pp 109–117 [42] M Uddin, R Stadler, and A Clemm, “Scalable matching and ranking for network search,” in Proc 9th Int Conf Netw Serv Manage (CNSM’13), 2013, pp 251–259 [43] H M Tran, S V U Ha, L N Hoang, and A V T Tran, “Fault resolution system for inter-cloud environment,” J Sci Technol Vietnamese Acad Sci Technol (Spec Issue ACOMP’13), pp 272–281, 2013 [44] S Deerwester, S Dumais, T Landauer, G Furnas, and R Harshman, “Indexing by Latent semantic analysis,” J Amer Soc Inf Sci Technol., vol 41, no 6, pp 391–407, 1990 [45] T Cover and P Hart, “Nearest neighbor pattern classification,” IEEE Trans Inf Theory, vol 13, no 1, pp 21–27, Jan 1967 [46] D Rohde (2002) The SVDLIBC Library [Online] Available: http:// tedlab.mit.edu/∼dr/SVDLIBC/, accessed on Aug 2015 TRAN AND SCHÖNWÄLDER: DisCaRia—DISTRIBUTED CASE-BASED REASONING SYSTEM FOR FAULT MANAGEMENT [47] Oracle Corporation (1995) MySQL Database Software [Online] Available: http://www.mysql.com/, accessed on Aug 2015 [48] M F Porter, “An algorithm for suffix stripping,” in Readings in Information Retrieval San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1997, pp 313–316 [49] The Django Software Foundation (2005) Django—The Web Framework for Perfectionists With Deadlines [Online] Available: http://www djangoproject.com/, accessed on Aug 2015 [50] The Apache Software Foundation (1997) Apache—HTTP Server Project [Online] Available: http://httpd.apache.org/, accessed on Aug 2015 [51] G Chulkov, “Buglook: A search engine for bug reports,” Master’s thesis, Jacobs Univ Bremen, Bremen, Germany, Seminar Rep., May 2007 [52] H M Tran, C Lange, G Chulkov, J Schönwälder, and M Kohlhase, “Applying semantic techniques to search and analyze bug tracking data,” J Netw Syst Manage., vol 17, no 3, pp 285–308, 2009 [53] B Chun et al., “PlanetLab: An overlay testbed for broad-coverage services,” SIGCOMM Comput Commun Rev., vol 33, no 3, pp 3–12, 2003 Ha Manh Tran received the Ph.D degree from Jacobs University Bremen, Bremen, Germany He is a Lecturer of Computer Science with the International University—Vietnam National University, where he is leading the Networks and Distributed Systems Research Group His research interests include network management, fault management, parallel and distributed computing, P2P systems, and information retrieval He serves as Technical Program Member and Reviewer of several international conferences and journals in the area of network and system management 553 Jürgen Schönwälder (SM’04) received the Ph.D degree from the Technische Universitt Braunschweig, Braunschweig, Germany He is a Professor of computer science at Jacobs University Bremen, Germany, where he is leading the Computer Networks and Distributed Systems (CNDS) Research Group His research interests include network management, distributed systems, network measurements, embedded networked systems, and network security He is an active member of the Internet Engineering Task Force (IETF), where he has edited more than 30 network management related specifications and standards He has been co-chairing the ISMS Working Group of the IETF and currently serves as Co-Chair of the NETMOD Working Group Previously, he chaired the Network Management Research Group (NMRG) of the Internet Research Task Force (IRTF) He has been the Principal Investigator in several European research projects (Emanics, Flamingo, Leone) He contributed in various roles to IEEE and IFIP sponsored conferences He currently serves on the Editorial Boards of the Journal of Network and Systems Management (Springer) and the International Journal of Network Management (Wiley) He is the Co-Editor of the Network and Service Management series in the IEEE Communications Magazine Previously, he served on the Editorial Board of the IEEE T RANSACTIONS ON N ETWORK AND S ERVICE M ANAGEMENT ... DisCaRia—DISTRIBUTED CASE-BASED REASONING SYSTEM FOR FAULT MANAGEMENT • a real-value vector to represent fault details in the textual form, e.g., fault descriptions, discussions, and related fault features... expert system that allows to search for similar past problems, and that provides reports and statistics for performance evaluation of the services [20] While most expert systems proposed for fault. .. building a fault management system in the inter-cloud environment that supports applications and services for running across multiple cloud systems [43] This system recruits a P2P network of fault