A trace clustering solution based on using the distance graph model

A Trace Clustering Solution Based on Using the Distance Graph Model Quang-Thuy Ha1(&), Hong-Nhung Bui1,2, and Tri-Thanh Nguyen1 Vietnam National University (VNU), VNU-University of Engineering and Technology (UET), No 144, Xuan Thuy, Cau Giay, Hanoi, Vietnam {ntthanh,thuyhq}@vnu.edu.vn, nhungbh79@gmail.com Banking Academy of Vietnam, No.12, Chua Boc, Dong Da, Hanoi, Vietnam Abstract Process discovery is the most important task in the process mining Because of the complexity of event logs (i.e activities of several different processes are written into the same log), the discovered process models may be diffuse and unintelligible That is why the input event logs should be clustered into simpler event sub-logs This work provides a trace clustering solution based on the idea of using the distance graph model for trace representation Experimental results proved the effect of the proposed solution on two measures of Fitness and Precision, especially the effect on the Precision measure Keywords: Event log Á Process mining Á Fitness measure Á Precision measure Á Process discovering Á Trace clustering Á Distance graph model Introduction Process discovery is the most important task in process mining There exists some algorithms for discovering process models form event logs, such as a (Wil M P van der Aalst and Boudewijn F van Dongen [1]), a+ (A.K.A de Medeiros et al [9]), a++ (Lijie Wen et al [17]), and other algorithms [2] Due to the complexity of event logs, the discovered process models may be diffuse and unintelligible That is why the two-phase approach is proposed for process model discovering In the first phase, the input event log is refined, in which clustering algorithms are popularly used In the second phase, process discovering algorithms are run on the refined event log to find out the model There exists some works following this approach [4, 5, 8, 10, 13, 15, 16] The distance graph model for text processing has been proposed by Charu C Aggarwal and Peixiang Zhao in 2013 [3] Distance graphs of order k (k = 0, 1, 2, …) for a document (a string of words) D based on the corpus C is a useful representation of D for text mining tasks [3, 7] Because of the similar between the graph structure of process model and the Distance graph model, this work focuses on a trace clustering solution based on the idea of using the distance graph model for trace representation This study is oriented to contribute a new solution to trace clustering The rest of this article is organized as follows: In the next section, a trace clustering solution on using the distance graph model is showed This framework includes three phases: “Trace representation and Clustering”, “Process discovery”, and “Model © Springer International Publishing Switzerland 2016 N.T Nguyen et al (Eds.): ICCCI 2016, Part I, LNAI 9875, pp 313–322, 2016 DOI: 10.1007/978-3-319-45243-2_29 314 Q.T Ha et al Evaluation” Experiments and remarks are described in the third section In the fourth section, related work is introduced And conclusions are shown in the last section A Trace Clustering Solution Based on the Distance Graph Model 2.1 The Problem The paper proposes a solution to trace clustering in event logs based on the distance graph model [3] The problem is described as follows Let A be the activity-name universe in an organization and A A be the set of all activity-names for a business process in the organization A trace r is a sequence of activities, i.e., r A+ (where A+ is a set of non empty sequences of activities in A) Let L be a simple event log of a business process containing a set of traces constructed from A Process discovery algorithms transform event logs into process models represented in a process modeling language, e.g Petri nets (WorkFlow nets: WF-nets), BPMN (Business Process Modeling Notation), or YAWL (Yet Another Workflow Language), etc There exists some algorithms for discovering process models form event logs, such as a [1], a+ [9], a++ [17], and others [2] For example, let L = [abdeh, adceg, acdefbdeg, adbeh, acdefdcefcdeh, acdeg] (where a = “register request”, b = “examine thoroughly”, c = “examine casually”, d = “check ticket”, e = “decide”, f = “reinitiate request”, g = “pay compensation”, h = “reject request”) be an event log for the requests for compensation business process within an airline Figure describes the WorkFlow net discovered the event log L by applying the a algorithm [2] Due to the complexity of event logs, the discovered process models may be diffuse and unintelligible That is why the two-phase approach is proposed for process model discovering In the first phase, the input event log is refined, in which clustering algorithms are popularly used In the second phase, process discovering algorithms are run on the refined event log to find out the process model [6] Fig WorkFlow net discovered by the a-algorithm based on L [2] A Trace Clustering Solution Based on Using the DGM 2.2 315 The Distance Graph Model As mentioned in the introduction section, the distance graph model (“A distance graph of order k for a document D drawn from a corpus C”) for text processing was proposed by Charu C Aggarwal and Peixiang Zhao in 2013 Figure illustrates the distance graphs of orders 0, 1, and for the well-known nursery rhyme “Mary had a little lamb” [3] As stated in [3], the most common method of representing a document D is a vector of distinct terms generated from the corpus C, where each component of the vector is the frequency of a certain term appearing in D Charu C et al proposed to convert a distance graph into a vector-space representation, i.e each directed edge in the distance graph is used to create a new “token” or “pseudo-word” For example, the edge from MARRY to LITTLE (in the distance graph order 2) is used to create a new pseudo-word MARRY-LITTLE; the pseudo-word created from the edge from LAMB to itself (in the distance graph order 2) is LAMB-LAMB The frequency of the edge is used to denote the frequency of the pseudo-word These new pseudo-words preserve the order of words in the document, thus, when combined with distinct terms in the corpus C, they enhance the semantic of the document representation in the form of a vector Fig Illustration of distance graph representation [3] Charu C Aggarwal and Peixiang Zhao showed some interesting features of distance graph model, as well as the effectiveness of the model applied for text classification Since the order of activities within a trace plays an important role, one characteristic of distance graph which is considered to be suitable for trace representation is its ability to preserve the order of words in a document in the form of directed edges 316 Q.T Ha et al Fig A three-phase framework of process discovery 2.3 A Three-Phase Process Discovery Framework Figure describes a process discovery using trace clustering solution based on the distance graph model The framework includes “Trace representation and Clustering”, “Process discovery”, and “Model evaluation” Phases Trace representation and Clustering Phase includes two steps In the Trace Representation step, a dataset for clustering is created, in which a data point is a vector of distance graphs (with different orders) of a trace in the event log The set A of activities in the event log is considered as the set of “distinct words” in the corpus C, and a trace in the event log is considered as a document D, thus distance graphs for a trace can be constructed For the given trace , • Order distance graph is: a(1), c(1), d(2), e(2), f(1), b(1), h(1), where the number denotes the frequency of directed edges from the node to itself This graph contains unconnected components A Trace Clustering Solution Based on Using the DGM 317 • Order distance graph is constructed from order graph a(1), c(1), d(2), e(2), f(1), b(1), h(1) by adding more edges: ac(1), cd(1), de(1), ef(1), fd(1), db(1), be(1), eh(1), where the number denotes the frequency • Order distance graph is constructed from order graph a(1), c(1), d(2), e(2), f(1), b(1), h(1), ac(1), cd(1), de(1), ef(1), fd(1), db(1), be(1), eh(1) by adding more edges: ad(1), ce(1), df(1), ed(1), fb(1), de(1), bh(1), where the number denotes the frequency • etc We followed the method of [3] to decompose a distance graph into a set of features for vector representation with a small modification A feature is either the vertex or the directed edge of the graph Our modification is to ignore the edge from a vertex v to itself (i.e edge vv) in distance graph order 0, since every vertex in the graph order always has an edge from itself to itself (self-loop) In addition, an edge from vertex to itself, in a trace, should indicate an activity is repeated For the above order distance graph of the trace , the set of features is {a, c, d, e, f, b, h, ac, cd, de, ef, fd, db, be, eh} The frequency of the feature in each trace is preserved in vector representation Since a higher order distance graph of a trace includes all lower distance graphs using this representation, only the highest order distance graph is enough to represent the trace with consideration to distinguish the self-loop of distance graph order with the self-loop of higher order With this representation, if two graphs share common sub-graphs, it will be preserved in the representation Obviously, for another trace , its set of features {a, c, d, e, f, b, h, ac, cd, de, ef, fb, bh} is a subset of the above trace Consequently, the two vectors will be close to each other in the vector space Because event logs reflect the executions of business processes then all distance graphs of traces in an event log include some relation patterns in the discovered process model That is why the number of features generated from all the traces in an event log L is significantly less than (|A| + |A|*(|A|-1)/2) where |A| denoted the cardinality of set A of activities In the Clustering step, one clustering algorithm is applied on the dataset (e.g K-Modes and K-means algorithms) The output of the Trace Representation and Clustering Phase is a set of clusters (sub-logs) of traces (cases) of the event log In the Process Discovery Phase, a process discovery algorithm (i.e a-algorithm) is applied on the clusters (event sub-logs) to get process models The Model Evaluation shows the effect of result process models Though there are four common measures for evaluation, i.e Fitness, Precision, Generalization, and Simplicity [2, 11, 12], this work considers two measures: i.e Fitness and Precision, which had been described by A Rozinat and Wil M.P van der Aalst [11] The Fitness measure indicates that the discovered model should accept the behaviors seen in the event log, and the Precision measure means that the discovered model should not accept behaviors completely unrelated to what was seen in the event log Since these measures are calculated on each cluster, an aggregated value for whole event log should be calculated This work selects a weighted average value as follow: 318 Q.T Ha et al wavg ¼ k X ni n wi ð1Þ where wagv is the aggregated value of the fitness or precision measure, k is the number of clusters, n is the number of traces in the event log, ni is the number of traces in the ith cluster and wi is the value of the measure of the ith cluster Experiments and Results This work used the prBm6 event log in the “Conformance Checking in the Large”1 for experiments The event log includes 1200 cases with 37961 events In the Clustering step, two clustering algorithms: K-Modes and K-means were used In Process discovery and Model evaluation phrases, ProM [19] was used From several tests, we selected the maximum distance graph order of for all the experiments 3.1 The Experiment with K-Modes Algorithm Since a trace is a sequence of activities, from an event log, we have a set of activities, a common trace representation was proposed: binary vector activities, i.e a vector component is if the trace contains a certain activity, otherwise [2, 8] To evaluate the model, binary trace vector based on activity representation was implemented as a baseline The experiment results are described in the Table We consider the values of measures of Average-Fitness and Average-Precision (1) in the cases of the vector-based and the Distance graph order 2-based trace representation in columns titled “Avg” in the table After several runs, we found out the suitable number of clusters for the data set is Experiments on the Distance graph order 1-based also are implemented All experimental results on the vector-based, the Distance graph order 1-based, and the Distance graph order 2-based trace representations are also showed in the Fig 3.2 The Experiment with K-Means Algorithm In this experiment, the K-means clustering algorithm was used to run on the vector-based and distance graph-based trace representation The experiment results are described in the Table We also calculated the values of measures of Average-Fitness and Average-Precision (1) for activity-based (Vector) and the Distance graph-based (Distance graph) trace representation in columns titled “Avg” in the table Experiments on the Distance graph order 1-based also are implemented All experimental results on the vector-based, the Distance graph order 1-based, and the Distance graph order 2-based trace representations are also showed in the Fig http://data.3tu.nl/repository/uuid:44c32783-15d0-4dbd-af8a-78b97be3de49 A Trace Clustering Solution Based on Using the DGM 319 Table Using the K-modes clustering algorithm: the fitness and precision for all event sublogs (clusters) in the activity-based (Vector) and the distance graph order 2-based (Distance Graph) trace representation Method Measure #Traces Fitness Precision Vector Distance Graph order Clus1 Clus2 Clus3 Avg Clus1 Clus2 Clus3 Avg 326 0.9636 621 0.9450 0.6926 253 0.9629 0.5868 1200 0.9539 0.7538 326 0.9637 1.0 559 0.9876 0.9914 315 0.9500 0.6974 1200 0.9713 0.9165 Fig Comparison of the discovered process models on the measures of Fitness and Precision between Activity-based (Vector), Distance graph order 1-based (Distance Graph1), and Distance graph order 2-based (Distance Graph2) Representations with K-Modes clustering algorithm Table Using the K-means Clustering Algorithm: The Fitness and Precision for all event sublogs (clusters) in the Activity-based (Vector) and the Distance gpaph-based (Distance Gpaph) trace representation Method Measure #Traces Fitness Precision Vector Clus1 326 0.9637 Clus2 621 0.9787 0.6408 Clus3 253 0.9450 0.9763 Distance Graph order Avg 1200 0.9675 0.8091 Clus1 326 0.9637 Clus2 475 0.9680 0.7106 Clus3 399 0.9787 0.9908 Avg 1200 0.9704 0.8824 3.2.1 Discussions There are some findings from the results showed in Tables (Fig 4) and Table (Fig 5) as follows: 320 Q.T Ha et al Fig Comparison of the discovered process models on the measures of fitness and precision among activity-based (Vector), distance graph order 1-based (Distance Graph1), and distance graph order 2-based (Distance Graph2) representations with K-means clustering algorithm – In all cases, the performance of the distance graph based trace representation is better than that of the vector based trace representation on fitness and precision measures – The effect of the distance graph based trace representation on the precision measure is higher than that on the fitness measure – Distance graph order has a better effect on precision in comparison with distance graph order Related Work G Greco et al [8] proposed a clustering solution on traces in event log They used a vector representation for traces and the K-means algorithm This work is the first study on trace clustering within the process mining domain R P Jagadeesh Chandra Bose [6], R P Jagadeesh Chandra Bose et al [4, 5] proposed trace clustering solutions based on using some control-flow context information i.e “context-aware” The Levenshtein distance technique was used De Weerdt et al [15] proposed a two phase solution to combine of trace clustering and text mining for process discovering In the first phase, a MRA-based semi-supervised clustering technique (the SemSup-MRA algorithm) was applied After that, there are two kinds of clusters, clusters of standard behaviors, and clusters of atypical behaviors In the second phase, process mining and text-data mining techniques were applied After [15], De Weerdt et al [16] proposed the ActiTraC algorithm, a three-phase algorithm for clustering an event log into a collection of event logs (clusters) The ActiTraC algorithm includes three phases: Selection, Look ahead, and Residual trace resolution They also developed the ActiTraCMRA algorithm, a further version of the ActiTraC algorithm A Trace Clustering Solution Based on Using the DGM 321 T Thaler et al [14] provided a survey of trace clustering techniques They also analyzed and compared the investigated trace clustering techniques This work is the first study on using the distance graph model [3] for trace clustering Conclusions This work provided a trace representation solution based on the distance graph model [3] for clustering of traces in the event logs Experiments showed that the distance graph based is more effective than activity based trace representation In this work, experiments are limited There are several tasks needed to in the future Firstly, other distance measures between graphs, e.g distance in graph theory [18] should be studied to directly cluster traces in the form of graphs Secondly, more clustering algorithms, especially graph-based clustering algorithms, should be considered Thirdly, more event log datasets should be experimented to confirm the reliability of the method Acknowledgments This work was supported in part by VNU Grant QG-15- 22 References van der Aalst, W.M., van Dongen, B.F.: Discovering workflow performance models from timed logs In: Han, Y., Tai, S., Wikarski, D (eds.) EDCIS 2002 LNCS, vol 2480, pp 45–63 Springer, Heidelberg (2002) Van der Aalst, W.M.P.: Process Mining: Discovery, Conformance and Enhancement of Business Processes Springer, Heidelberg (2011) Aggarwal, C.C., Zhao, P.: Towards graphical models for text processing Knowl Inf Syst 36(1), 1–21 (2013) Bose, R.C., van der Aalst, W.M.: Trace clustering based on conserved patterns: towards achieving better process models In: Rinderle-Ma, S., Sadiq, S., Leymann, F (eds.) BPM 2009 LNBIP, vol 43, pp 170–181 Springer, Heidelberg (2010) Bose, R.P.J.C., van der Aalst, W.M.P.: Context aware trace clustering: towards improving process mining results In: SDM 2009, pp 401–412 (2009) Bose, R.P.J.C.: Process Mining in the Large: Preprocessing, Discovery, and Diagnostics Ph D thesis Eindhoven University of Technology (2012) Dai, Xin-Yu., Cheng, C., Huang, S., Chen, J.: Sentiment classification with graph sparsity regularization In: Gelbukh, A (ed.) LNCS, vol 9042, pp 140–151 Springer, Heidelberg (2015) Greco, G., Guzzo, A., Pontieri, L., Saccà, D.: Discovering expressive process models by clustering log traces IEEE Trans Knowl Data Eng 18(8), 1010–1027 (2006) de Medeiros, A.K.A., van Dongen, B.F., van der Aalst, W.M.P., Weijters, A.J.M.M.: Process mining: extending the alpha-algorithm to mine short loops BETA Working Paper Series (2004) 10 de Medeiros, A.K.A., Guzzo, A., Greco, G., van der Aalst, W.M., Weijters, A., van Dongen, B.F., Saccà, D.: Process mining based on clustering: a quest for precision In: Hofstede, A H., Benatallah, B., Paik, H.-Y (eds.) BPM Workshops 2007 LNCS, vol 4928, pp 17–29 Springer, Heidelberg (2008) 322 Q.T Ha et al 11 Rozinat, A., van der Wil, M.P.: Aalst Conformance checking of processes based on monitoring real behavior Inf Syst 33(1), 64–95 (2008) 12 Buijs, J.C., van Dongen, B.F., van der Aalst, W.M.: On the role of fitness, precision, generalization and simplicity in process discovery In: Meersman, R., Panetto, H., Dillon, T., Rinderle-Ma, S., Dadam, P., Zhou, X., Pearson, S., Ferscha, A., Bergamaschi, S., Cruz, I.F (eds.) OTM 2012, Part I LNCS, vol 7565, pp 305–322 Springer, Heidelberg (2012) 13 Song, M., Günther, C.W., van der Aalst, W.M.: Trace clustering in process mining In: Ardagna, D., Mecella, M., Yang, J (eds.) Business Process Management Workshops LNBIP, vol 17, pp 109–120 Springer, Heidelberg (2009) 14 Thaler, T., Ternis, S.F., Fettke, P., Loos, P.: A comparative analysis of process instance cluster techniques In: Wirtschaftsinformatik 2015, pp 423–437 (2015) 15 De Weerdt, J., van den Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Leveraging process discovery with trace clustering and text mining for intelligent analysis of incident management processes In: IEEE Congress on Evolutionary Computation, pp 1–8 (2012) 16 De Weerdt, J., vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery IEEE Trans Knowl Data Eng 25(12), 2708– 2720 (2013) 17 Wen, L., van der Aalst, W.M.P., Wang, J., Sun, J.: Mining process models with non-free-choice constructs Data Min Knowl Discov 15(2), 145–180 (2007) 18 Deza, M.M., Deza, E.: Distances in Graph Theory Springer, Heidelberg (2014) 19 http://www.processmining.org/prom/start ... [2] A Trace Clustering Solution Based on Using the DGM 2.2 315 The Distance Graph Model As mentioned in the introduction section, the distance graph model ( A distance graph of order k for a document... process models on the measures of fitness and precision among activity -based (Vector), distance graph order 1 -based (Distance Graph1 ), and distance graph order 2 -based (Distance Graph2 ) representations... for the data set is Experiments on the Distance graph order 1 -based also are implemented All experimental results on the vector -based, the Distance graph order 1 -based, and the Distance graph order

Định dạng
Số trang	10
Dung lượng	1,29 MB