nghĩa của kết quả thực nghiệm

Từ kho dữ liệu bán hàng của siêu thị Yên Bái với các mặt hàng được mã hóa thành các số tự nhiên. Chương trình tìm ra được nhóm mặt hàng mang lại lợi nhuận cao, từ đó hỗ trợ tốt cho các nhà quản lý siêu thị trong việc tổ chức kinh doanh.

Ví dụ với ngưỡng 500 ở trên, có nghĩa là chương trình sẽ đưa ra các nhóm mặt hàng có lợi nhuận lớn hơn hoặc bằng 500.000VND.

Sau khi ánh xạ ngược ta tìm đc nhóm các mặt hàng mang lại lợi nhuận cao bao theo ngưỡng 500 bao gồm { Sữa tiệt trùng Vinamilk, công thức ADM+, thùng 48hộpx110ml/Vinamilk, Mì Hảo Hảo tôm chua cay 75g, Bột canh I – ốt Hải Châu 190g,Bột giặt Omo Comfort 3kg, Dầu ăn Simply Đậu nành 1 Lít}.

Dựa vào kết quả trên, nhà quản lý kinh doanh có cái nhìn sâu hơn, tổng quát hơn về các mặt hàng được khách hàng lựa chọn để mua cùng nhau. Từ đó đưa ra kế hoạch, chiến lược phát triển tập trung nhiều hơn vào các nhóm mặt hàng đang mang lại lợi nhuận cao và đưa ra các chính sách ưu đãi để mở rộng thêm các nhóm mặt hàng tiềm năng này.

KẾT LUẬN 1. Những kết quả chính của luận văn

Luận văn tập trung vào bài toán tìm tập mục lợi ích cao và tìm hiểu một số thuật toán hiệu quả khai phá tập mục lợi ích cao. Luận văn đã đạt được những kết quả chính là:

- Tìm hiểu tổng quan về khai phá dữ liệu, đi sâu tìm hiểu bài toán khai phá tập mục phổ biến: các khái niệm, hai thuật toán điển hình Apriori và FP- Growth.

- Tìm hiểu một mở rộng của bài toán khai phá tập mục phổ biến là khai phá tập mục lợi ích cao: các khái niệm, phát biểu bài toán. Luận văn đi sâu tìm hiểu hai thuật toán điển hình theo hai cách tiếp cận khác nhau là thuật toán Hai pha (Two - Phase) và thuật toán HUI-Minner.

Thuật toán Hai pha thực hiện khai phá theo cách giống như thuật toán Apriori, đó là sinh ra các tập mục ứng viên rồi duyệt cơ sở dữ liệu để kiểm tra lợi ích theo giao tác của nó. Sau khi tìm được tập các tập mục lợi ích cao theo giao tác, thuật toán duyệt lại cơ sở dữ liệu 1 lần để xác định các tập mục lợi ích cao trong tập này.

Thuật toán HUI-Minner có nhiều ưu điểm hơn, khắc phục được những hạn chế của thuật toán Hai pha. Thuật toán sử dụng cấu trúc dữ liệu mới lấy tên là utility-list, xây dựng cấu trúc utility-list từ cơ sở dữ liệu với 2 lần duyệt, sau đó khai phá tập mục lợi ích cao từ các utility-list này.

- Xây dựng chương trình thực nghiệm dùng thuật toán HUI-Minner để tìm nhóm các mặt hàng mang lại lợi nhuận cao từ dữ liệu bán hàng của siêu thị.

2. Hướng nghiên cứu tiếp theo

Một số hướng phát triển của luận văn như là:

- Tiếp tục nghiên cứu sâu hơn các thuật toán khai phá tập mục lợi ích cao như FHM, FHM+…là các thuật toán cải tiến, mở rộng của thuật toán HUI- Minner.

- Tìm hiểu các phương pháp khai phá tập mục lợi ích cao trên cở sở dữ liệu gia tăng.

TÀI LIỆU THAM KHẢO

Tiếng Việt

[1] Vũ Đức Thi, Nguyễn Huy Đức (2008), “Thuật toán hiệu quả khai phá tập mục lợi ích cao trên cấu trúc dữ liệu cây”, Tạp chí Tin học và Điều khiển học, 24(3), tr. 204-216.

[2] Nguyễn Huy Đức (2009), “Khai phá tập mục cổ phần cao và lợi ích cao trong cơ sở dữ liệu”. Luận án tiến sĩ toán học, Viện Công nghệ Thông tin, Hà Nội 2009.

Tiếng Anh

[3] Agrawal R, Imielinski T, and Swami A.N (1993), “Mining association rules between sets of items in large databases”. In Proceedings of

the 1993 ACM SIGMOD International Conference on Management of Data,

Washington, D.C.

[4] Han J, M. Kamber, J. Pei, (2012). “Data Mining: Concepts and Techniques”. Third Edition, Morgan Kaufmann Publishers is an imprint of Elsevier, USA.

[5] Han J. (2004), “Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach”, Data Mining and Knowledge

Discovery, Vol.8, pp. 53–87.

[6] Philippe Fournier-Viger, Cheng-Wei Wu, Souleymane Zida, Vincent S.Tseng (2014) “FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning”. Proc. 21st International Symposium on Methodologies for Intelligent Systems (ISMIS 2014), Springer, LNAI, pp. 83-

92.

[7] Ying Liu, Wei-keng Liao, and Alok Choudhary: “A two-phase algorithm for fast discovery of high utility itemsets”. In: Proc. PAKDD 2005, pp. 689-695.

[8] Mengchi Liu, Junfeng Qu (2012). “Mining High Utility Itemsets without Candidate Generation”. In Proceedings of CIKM12, pp. 55-64.

[9] Yao H., Hamilton H. J., and Butz C. J. (2004), “A foundational Approach to Mining Itemset Utilities from Databases”, Proceedings of the

4th SIAM International Conference on Data Mining, Florida, USA.

[10] Yao H., Hamilton H. J. (2006), “Mining Itemsets Utilities from Transaction Databases”, Data and Knowledge Engeneering, Vol. 59, issue 3.

PHỤ LỤC Mã nguồn của chương trình:

import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileInputStream; import java.io.FileWriter; import java.io.IOException; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.util.HashMap; import java.util.List; import java.util.Map; import ca.pfv.spmf.tools.MemoryLogger; /**

* This is an implementation of the "HUI-MINER Algorithm" for High-Utility Itemsets Mining

* as described in the conference paper : <br/><br/> *

* Liu, M., Qu, J. (2012). Mining High Utility Itemsets without Candidate Generation.

* Proc. of CIKM 2012. pp.55-64. *

* @see Element

* @author Philippe Fournier-Viger */

public class AlgoHUIMiner {

/** the time at which the algorithm started */ public long startTimestamp = 0;

/** the time at which the algorithm ended */ public long endTimestamp = 0;

/** the number of high-utility itemsets generated */ public int huiCount =0;

/** Map to remember the TWU of each item */ Map<Integer, Integer> mapItemToTWU;

/** writer to write the output file */ BufferedWriter writer = null;

/** the number of utility-list that was constructed */ private int joinCount;

/** buffer for storing the current itemset that is mined when performing mining * the idea is to always reuse the same buffer to reduce memory usage. */ final int BUFFERS_SIZE = 200;

private int[] itemsetBuffer = null;

/** this class represent an item and its utility in a transaction */ class Pair{

int item = 0; int utility = 0; } /** * Default constructor */ public AlgoHUIMiner() { } /**

* Run the algorithm

* @param input the input file path * @param output the output file path

* @param minUtility the minimum utility threshold

* @throws IOException exception if error while writing the file */

public void runAlgorithm(String input, String output, int minUtility) throws IOException {

// reset maximum

MemoryLogger.getInstance().reset();

// initialize the buffer for storing the current itemset itemsetBuffer = new int[BUFFERS_SIZE];

startTimestamp = System.currentTimeMillis();

writer = new BufferedWriter(new FileWriter(output));

mapItemToTWU = new HashMap<Integer, Integer>();

// We scan the database a first time to calculate the TWU of each item. BufferedReader myInput = null;

String thisLine; try {

// prepare the object for reading the file

myInput = new BufferedReader(new InputStreamReader( new FileInputStream(new File(input))));

// for each line (transaction) until the end of file while ((thisLine = myInput.readLine()) != null) { // if the line is a comment, is empty or is a // kind of metadata if (thisLine.isEmpty() == true || thisLine.charAt(0) == '#' || thisLine.charAt(0) == '%' || thisLine.charAt(0) == '@') { continue; }

// split the transaction according to the : separator String split[] = thisLine.split(":");

// the first part is the list of items String items[] = split[0].split(" ");

// the second part is the transaction utility

int transactionUtility = Integer.parseInt(split[1]);

// for each item, we add the transaction utility to its TWU for(int i=0; i <items.length; i++){

Integer item = Integer.parseInt(items[i]); // get the current TWU of that item

Integer twu = mapItemToTWU.get(item);

// add the utility of the item in the current transaction to its twu

twu = (twu == null)?

transactionUtility : twu + transactionUtility; mapItemToTWU.put(item, twu); } } } catch (Exception e) {

// catches exception if error while reading the input file e.printStackTrace(); }finally { if(myInput != null){ myInput.close(); } }

// CREATE A LIST TO STORE THE UTILITY LIST OF ITEMS WITH TWU >= MIN_UTILITY.

List<UtilityList> listOfUtilityLists = new ArrayList<UtilityList>(); // CREATE A MAP TO STORE THE UTILITY LIST FOR EACH ITEM.

// Key : item Value : utility list associated to that item

Map<Integer, UtilityList> mapItemToUtilityList = new HashMap<Integer, UtilityList>();

for(Integer item: mapItemToTWU.keySet()){

// if the item is promising (TWU >= minutility) if(mapItemToTWU.get(item) >= minUtility){

// create an empty Utility List that we will fill later. UtilityList uList = new UtilityList(item);

mapItemToUtilityList.put(item, uList); // add the item to the list of high TWU items listOfUtilityLists.add(uList);

} }

// SORT THE LIST OF HIGH TWU ITEMS IN ASCENDING ORDER

Collections.sort(listOfUtilityLists, new Comparator<UtilityList>(){ public int compare(UtilityList o1, UtilityList o2) {

// compare the TWU of the items return compareItems(o1.item, o2.item); }

} );

// SECOND DATABASE PASS TO CONSTRUCT THE UTILITY LISTS

// OF 1-ITEMSETS HAVING TWU >= minutil (promising items) try {

// prepare object for reading the file

myInput = new BufferedReader(new InputStreamReader(new FileInputStream(new File(input))));

// variable to count the number of transaction int tid =0;

while ((thisLine = myInput.readLine()) != null) { // if the line is a comment, is empty or is a // kind of metadata if (thisLine.isEmpty() == true || thisLine.charAt(0) == '#' || thisLine.charAt(0) == '%' || thisLine.charAt(0) == '@') { continue; }

// split the line according to the separator String split[] = thisLine.split(":");

// get the list of items

String items[] = split[0].split(" ");

// get the list of utility values corresponding to each item // for that transaction

String utilityValues[] = split[2].split(" ");

// Copy the transaction into lists but // without items with TWU < minutility

int remainingUtility =0;

// Create a list to store items

List<Pair> revisedTransaction = new ArrayList<Pair>(); // for each item

for(int i=0; i <items.length; i++){ /// convert values to integers Pair pair = new Pair();

pair.item = Integer.parseInt(items[i]);

pair.utility = Integer.parseInt(utilityValues[i]); // if the item has enough utility

if(mapItemToTWU.get(pair.item) >= minUtility){ // add it revisedTransaction.add(pair); remainingUtility += pair.utility; } } Collections.sort(revisedTransaction, new Comparator<Pair>(){

public int compare(Pair o1, Pair o2) {

return compareItems(o1.item, o2.item);

}});

// for each item left in the transaction for(Pair pair : revisedTransaction){

// subtract the utility of this item from the remaining utility

remainingUtility = remainingUtility - pair.utility;

// get the utility list of this item

UtilityList utilityListOfItem = mapItemToUtilityList.get(pair.item);

// Add a new Element to the utility list of this item corresponding to this transaction

Element element = new Element(tid, pair.utility, remainingUtility);

utilityListOfItem.addElement(element);

}

tid++; // increase tid number for next transaction

}

} catch (Exception e) {

// to catch error while reading the input file e.printStackTrace(); }finally { if(myInput != null){ myInput.close(); } }

// check the memory usage

MemoryLogger.getInstance().checkMemory();

// Mine the database recursively

huiMiner(itemsetBuffer, 0, null, listOfUtilityLists, minUtility);

// check the memory usage again and close the file. MemoryLogger.getInstance().checkMemory(); // close output file

writer.close(); // record end time

endTimestamp = System.currentTimeMillis(); }

private int compareItems(int item1, int item2) {

int compare = mapItemToTWU.get(item1) - mapItemToTWU.get(item2);

// if the same, use the lexical order otherwise use the TWU return (compare == 0)? item1 - item2 : compare;

} /**

* This is the recursive method to find all high utility itemsets. It writes * the itemsets to the output file.

* @param prefix This is the current prefix. Initially, it is empty.

* @param pUL This is the Utility List of the prefix. Initially, it is empty. * @param ULs The utility lists corresponding to each extension of the prefix. * @param minUtility The minUtility threshold.

* @param prefixLength The current prefix length * @throws IOException

private void huiMiner(int [] prefix,

int prefixLength, UtilityList pUL, List<UtilityList> ULs, int minUtility)

throws IOException {

// For each extension X of prefix P for(int i=0; i< ULs.size(); i++){ UtilityList X = ULs.get(i);

// If pX is a high utility itemset. // we save the itemset: pX if(X.sumIutils >= minUtility){

// save to file

writeOut(prefix, prefixLength, X.item, X.sumIutils); }

// If the sum of the remaining utilities for pX

// is higher than minUtility, we explore extensions of pX. // (this is the pruning condition)

if(X.sumIutils + X.sumRutils >= minUtility){

// This list will contain the utility lists of pX extensions. List<UtilityList> exULs = new ArrayList<UtilityList>(); // For each extension of p appearing

// after X according to the ascending order for(int j=i+1; j < ULs.size(); j++){

UtilityList Y = ULs.get(j);

// we construct the extension pXY

// and add it to the list of extensions of pX exULs.add(construct(pUL, X, Y));

joinCount++;

}

// We create new prefix pX

itemsetBuffer[prefixLength] = X.item;

// We make a recursive call to discover all itemsets with the prefix pXY

huiMiner(itemsetBuffer, prefixLength+1, X, exULs, minUtility);

} }

}

/**

* This method constructs the utility list of pXY * @param P : the utility list of prefix P.

* @param px : the utility list of pX * @param py : the utility list of pY * @return the utility list of pXY */

private UtilityList construct(UtilityList P, UtilityList px, UtilityList py) { // create an empy utility list for pXY

UtilityList pxyUL = new UtilityList(py.item); // for each element in the utility list of pX for(Element ex : px.elements){

// do a binary search to find element ey in py with tid = ex.tid Element ey = findElementWithTID(py, ex.tid);

if(ey == null){ continue; }

// if the prefix p is null if(P == null){

// Create the new element

Element eXY = new Element(ex.tid, ex.iutils + ey.iutils, ey.rutils);

// add the new element to the utility list of pXY pxyUL.addElement(eXY);

}else{

// find the element in the utility list of p wih the same tid Element e = findElementWithTID(P, ex.tid);

if(e != null){

Element eXY = new Element(ex.tid, ex.iutils + ey.iutils - e.iutils,

ey.rutils);

// add the new element to the utility list of pXY pxyUL.addElement(eXY);

}

} }

// return the utility list of pXY. return pxyUL;

} /**

* Do a binary search to find the element with a given tid in a utility list * @param ulist the utility list

* @param tid the tid

* @return the element or null if none has the tid. */

private Element findElementWithTID(UtilityList ulist, int tid){ List<Element> list = ulist.elements;

// perform a binary search to check if the subset appears in level k-1. int first = 0;

int last = list.size() - 1;

// the binary search while( first <= last ) {

if(list.get(middle).tid < tid){

first = middle + 1; // the itemset compared is larger than the subset according to the lexical order

}

else if(list.get(middle).tid > tid){

last = middle - 1; // the itemset compared is smaller than the subset is smaller according to the lexical order

} else{ return list.get(middle); } } return null; } /**

* Method to write a high utility itemset to the output file. * @param the prefix to be writent o the output file

* @param an item to be appended to the prefix

* @param utility the utility of the prefix concatenated with the item * @param prefixLength the prefix length

private void writeOut(int[] prefix, int prefixLength, int item, long utility) throws IOException {

huiCount++; // increase the number of high utility itemsets found

//Create a string buffer

StringBuilder buffer = new StringBuilder(); // append the prefix

buffer.append(prefix[i]); buffer.append(' ');

}

// append the last item buffer.append(item); // append the utility value buffer.append(" #UTIL: "); buffer.append(utility); // write to file writer.write(buffer.toString()); writer.newLine(); } /**

* Print statistics about the latest execution to System.out. */

public void printStats() {

System.out.println("============= HUI-MINER ALGORITHM =============");

System.out.println(" Total time ~ " + (endTimestamp - startTimestamp) + " ms");

System.out.println(" Memory ~ " + MemoryLogger.getInstance().getMaxMemory() + " MB");

System.out.println(" Số nhóm các mặt hàng lợi ích cao : " + huiCount); System.out.println(" Join count : " + joinCount);

System.out.println("======================================== ===========");

} }

Bài toán tập mục lợi ích cao

Các bước thực hiện của thuật toán Haipha