Xây dựng chương trình

- Nhập dữ liệu vào SQL Server 2008 trở lên

- Chương trình được chạy trên Visual Studio 2012 trở lên

3.2.1. Các chức năng của chương trình

Trong luận văn đã sử dụng được viết trên ngôn ngữ lập trình C# xây dựng chương trình sử dụng giải thuật di truyền để phân cụm dữ liệu sinh viên trường Cao Đẳng Y Tế Yên Bái:

- Đọc số liệu phân cụm - Xây dựng cấu trúc dữ liệu - Chọn số cụm đánh giá

- Chọn các môn đánh giá điểm rèn luyện - Hiển thị kết quả.

- Phân tích kết quả để đưa ra các nhận xét, đánh giá

3.2.2. Giao diện chương trình

Từ việc khảo sát, thống kê tập hợp dữ liệu điểm rèn luyện của học sinh, sinh viên trong luận văn đã xây dựng được chương trình tương đối hoàn chỉnh để giải quyết được bài toán khảo sát, đánh giá, thống kê đảm bảo những yêu cầu đã đề ra ban đầu. Sau đây là giao diện kết quả chương trình ứng dụng được cài đặt và chạy chương trình:

Màn hình khởi động

Hình 3.3. Màn hình khởi động

Phân cụm dữ liệu: thực hiện việc phân cụm dữ liệu

Hình 3.4. Màn hình phân cụm dữ liệu

Gồm các chức năng chính: • Chọn số cụm

• Phân cụm theo điểm các đánh giá rèn luyện • Hai đánh giá hoặc hai nhóm đánh giá cùng lúc

3.2.3. Kết quả thực nghiệm.

Dữ liệu đầu vào dựa trên 24 học sinh, sinh viên của trường Cao đẳng Y tế Yên Bái để phân cụm đánh giá về ý thức học tập, ta chia ra 3 cụm, “Đánh giá về ý thức học tập” như sau:

Cụm 1: Trong cụm này điểm trung bình của 6 phần tử học sinh, sinh viên đạt điểm học tập với tâm cụm là 7.95, chiếm tỷ lệ 25%.

Cụm 2: Ở đây ta thấy có 7 phần tử học sinh, sinh viên đạt điểm học tập với tâm cụm là 7.97, chiếm tỷ lệ 29,1%.

Cụm 3: Ta thấy có 11 phần tử học sinh, sinh viên đạt điểm học tập với tâm cụm là 8.14 là điểm cao nhất chiếm tỷ lệ 45,8%.

Theo bảng phân tích trên ta có thể đánh giá được học sinh, sinh viên với điểm ý thức học tập trong rèn luyện, tìm ra những học sinh sinh viên yếu kém bồi dưỡng, học tập lại để đạt kết quả cao hơn.

KẾT LUẬN 1. Những kết quả chính của luận văn

- Trình bày khái niệm cơ sở lý thuyết của khai phá dữ liệu và phân cụm dữ liệu. - Giới thiệu giải thuật chung cho giải thuật phân cụm sử dụng giải thuật di truyền.

- Thực hiện cài đặt thử nghiệm giải thuật phân cụm Kmeans sử dụng giải thuật di truyền.

2. Hướng phát triển của luận văn

Trên cơ sở các kết quả đã đạt được, có thể tiếp tục nghiên cứu một số vấn đề như sau:

- Xây dựng tiếp các chương trình thử nghiệm các thuật toán phân cụm và các giải thuật phân cụm có sử dụng giải thuật di truyền.

- Tìm thêm các ứng dụng giải thuật vào thực tiễn.

Mặc dù em đã rất cố gắng nhưng do thời gian và hiểu biết về lĩnh vực khai phá dữ liệu còn hạn chế nên luận văn chắc chắn sẽ không tránh khỏi những khuyết điểm nhất định. Trong tương lai, em sẽ cố gắng khắc phục những hạn chế, tiếp tục nghiên cứu những vấn đề đã nêu ở trên. Rất mong nhận được ý kiến đóng góp của các quý thầy cô và độc giả để luận văn được hoàn thiện hơn.

TÀI LIỆU THAM KHẢO I. TÀI LIỆU TIẾNG VIỆT

[1] Nguyễn Đình Thúc (2000). “Trí tuệ nhân tạo - Lập trình tiến hóa”. NXB Giáo dục.

[2] Nguyễn Đình Thúc (2000). “Mạng Nơ ron phương pháp và ứng dụng”. Nhà

XB Giáo dục.

[3] An Hồng Sơn, (2008). “Nghiên cứu một số phương pháp phân cụm mờ và ứng dụng”, Luận văn thạc sĩ, Trường Đại học Thái Nguyên.

[4] PGS.TS Đỗ Phúc (2006). “Giáo trình Khai thác Dữ liệu, Trường Đại học Công nghệ thông tin TP. Hồ Chí Minh”. Đại học Quốc gia TP. Hồ Chí Minh. [5] Nguyễn Nhật Quang (2011). “Khai Phá Dữ Liệu, Trường Đại học Bách khoa

Hà Nội”.

II. TÀI LIỆU TIẾNG ANH

[7] David A.Coley, “an introduction to genetic algorithms for scientists and enginer”, Copyright Q 1999 by World Scientific Publishing Co. Pte. Ltd. [8] Han J, M. Kamber, J. Pei, (2012). “Data Mining: Concepts and Techniques”.

Third Edition, Morgan Kaufmann Publishers is an imprint of Elsevier, USA. [9] Goldberg D. E., (1989). “Genetic algorithm in search, optimization and

machine learning”. Addison-Wesley, Reading, Massachusets.

[10] Qin Ding and Jim Gasvoda (2005). “A Genetic Algorithm for Clustering image data”.

[11] Luiz Antonio Nogueira Lorena Luiz Antonio Nogueira Lorena. “Using Genetic Algorithms in Clustering Problems”.

[12] Jay N Bhuyan, Vijay V. Raghavan, Venkatesh K. Elayavalli. “Genetic Algorithm for Clustering with an Ordered Representation”.

[13] Miki AOYAGI and Kumiko Tsuji (2004), “ A modiﬁed genetic algorithm for image segmentation based on feature clustering”.

[14] Dr. (Mrs.) R.Sukanesh, R. Harikumar Member, IAENG (2007). “A Comparison of Genetic Algorithm & Neural Network (MLP) In Patient Specific Classification of Epilepsy Risk Levels from EEGSignals”.

[15] S. Rajasekaran, G. A. Vijayalakshmi Pai (2004). “ Neural Networks, Fuzzy Logic and Genetic Algorithms”.

NY (2002). “ Combinations of genetic algorithms and neural networks”. [17] Richard S. Segall , Qingyu Zhang (2006), “Applications of Neural Network

and Genetic Algorithm Data Mining Techniques in Bioinformatics Knowledge Discovery - A Preliminary Study”.

[18] Shoa-Yei Yeong, Al-Salihy (2009). “Combination of neural network based clustering and genetic algorithm for multi-objective 802.11n planning”. [19] Zhan-hong Xin, Hai-jun Zhang (2002). “Neural Network and Genetic

Algorithms for Topology Optimization of the CCS7 Network”.

[20] Guojun Gan, Chaoqun Ma, Jianhong Wu (2007). “Data Clustering Theory, Algorithms, and Applications”. ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA.

PHẦN PHỤ LỤC Chương trình nguồn namespace GAs_KMean { class GAs {

//Thuật toán phân cụm dữ liệu, đầu vào là mảng 2 chiều và số cụm, đầu ra là mảng các cụm

public static int[] Cluster(double[][] rawData, int numClusters) {

double[][] data = Normalized(rawData); // so large values don't dominate bool changed = true; // was there a change in at least one cluster assignment?

bool success = true; // were all means able to be computed? (no zero- count clusters)

int[] clustering = InitClustering(data.Length, numClusters, 0); // semi- random initialization

double[][] means = Allocate(numClusters, data[0].Length); // small convenience

int maxCount = 100; // sanity check int ct = 0;

while (changed == true && success == true && ct < maxCount) {

++ct; // k-means typically converges very quickly success = UpdateMeans(data, clustering, means); changed = UpdateClustering(data, clustering, means); }

return clustering; }

public static double[][] Normalized(double[][] rawData) {

// normalize raw data by computing (x - mean) / stddev // primary alternative is min-max:

// v' = (v - min) / (max - min)

// make a copy of input data

{

result[i] = new double[rawData[i].Length];

Array.Copy(rawData[i], result[i], rawData[i].Length); }

for (int j = 0; j < result[0].Length; ++j) // each col {

double min = 10000, max = -1; double colSum = 0.0;

for (int i = 0; i < result.Length; ++i) {

colSum += result[i][j];

if (result[i][j] > max) max = result[i][j]; if (result[i][j] < min) min = result[i][j]; }

if (colSum != 0&& min< max) {

//double mean = colSum / result.Length; //double sum = 0.0;

//for (int i = 0; i < result.Length; ++i)

// sum += (result[i][j] - mean) * (result[i][j] - mean); //double sd = sum / result.Length;

for (int i = 0; i < result.Length; ++i) {

//result[i][j] = (result[i][j] - mean) / sd; result[i][j] = (result[i][j] - min) / (max - min); }

} }

return result; }

private static int[] InitClustering(int numTuples, int numClusters, int randomSeed)

{

Random random = new Random(randomSeed); int[] clustering = new int[numTuples];

for (int i = 0; i < numClusters; ++i) // make sure each cluster has at least one tuple

clustering[i] = i;

for (int i = numClusters; i < clustering.Length; ++i)

clustering[i] = random.Next(0, numClusters); // other assignments random

return clustering; }

private static double[][] Allocate(int numClusters, int numColumns) {

// convenience matrix allocator for Cluster() double[][] result = new double[numClusters][]; for (int k = 0; k < numClusters; ++k)

result[k] = new double[numColumns]; return result;

}

private static bool UpdateMeans(double[][] data, int[] clustering, double[][] means)

{

int numClusters = means.Length;

int[] clusterCounts = new int[numClusters]; for (int i = 0; i < data.Length; ++i) {

int cluster = clustering[i]; ++clusterCounts[cluster]; }

for (int k = 0; k < numClusters; ++k) if (clusterCounts[k] == 0)

return false; // bad clustering. no change to means[][] // update, zero-out means so it can be used as scratch matrix for (int k = 0; k < means.Length; ++k)

for (int j = 0; j < means[k].Length; ++j) means[k][j] = 0.0;

for (int i = 0; i < data.Length; ++i) {

int cluster = clustering[i];

for (int j = 0; j < data[i].Length; ++j)

means[cluster][j] += data[i][j]; // accumulate sum }

for (int k = 0; k < means.Length; ++k)

for (int j = 0; j < means[k].Length; ++j)

means[k][j] /= clusterCounts[k]; // danger of div by 0 return true;

}

private static bool UpdateClustering(double[][] data, int[] clustering, double[][] means)

{

int numClusters = means.Length; bool changed = false;

int[] newClustering = new int[clustering.Length]; // proposed result Array.Copy(clustering, newClustering, clustering.Length);

double[] distances = new double[numClusters]; // distances from curr tuple to each mean

for (int i = 0; i < data.Length; ++i) // walk thru each tuple {

for (int k = 0; k < numClusters; ++k)

distances[k] = Distance(data[i], means[k]); // compute distances from curr tuple to all k means

int newClusterID = MinIndex(distances); // find closest mean ID if (newClusterID != newClustering[i])

{

changed = true;

newClustering[i] = newClusterID; // update }

}

if (changed == false)

return false; // no change so bail and don't update clustering[][] // check proposed clustering[] cluster counts

int[] clusterCounts = new int[numClusters]; for (int i = 0; i < data.Length; ++i) {

int cluster = newClustering[i]; ++clusterCounts[cluster]; }

for (int k = 0; k < numClusters; ++k) if (clusterCounts[k] == 0)

return false; // bad clustering. no change to clustering[][] Array.Copy(newClustering, clustering, newClustering.Length); // update return true; // good clustering and at least one change

}

private static double Distance(double[] tuple, double[] mean) {

// Euclidean distance between two vectors for UpdateClustering() // consider alternatives such as Manhattan distance

double sumSquaredDiffs = 0.0;

for (int j = 0; j < tuple.Length; ++j)

sumSquaredDiffs += Math.Pow((tuple[j] - mean[j]), 2); return Math.Sqrt(sumSquaredDiffs);

}

private static int MinIndex(double[] distances) {

// index of smallest value in array // helper for UpdateClustering()

int indexOfMin = 0;

double smallDist = distances[0];

for (int k = 0; k < distances.Length; ++k) { if (distances[k] < smallDist) { smallDist = distances[k]; indexOfMin = k; } } return indexOfMin; } // ============================================================================ // misc display helpers for demo

static void ShowData(double[][] data, int decimals, bool indices, bool newLine)

{

for (int i = 0; i < data.Length; ++i) {

if (indices) Console.Write(i.ToString().PadLeft(3) + " "); for (int j = 0; j < data[i].Length; ++j)

{ if (data[i][j] >= 0.0) Console.Write(" "); Console.Write(data[i][j].ToString("F" + decimals) + " "); } Console.WriteLine(""); } if (newLine) Console.WriteLine(""); } // ShowData

static void ShowVector(int[] vector, bool newLine) {

for (int i = 0; i < vector.Length; ++i) Console.Write(vector[i] + " "); if (newLine) Console.WriteLine("\n"); }

static void ShowClustered(double[][] data, int[] clustering, int numClusters, int decimals)

{

Console.WriteLine("==================="); for (int i = 0; i < data.Length; ++i) {

int clusterID = clustering[i]; if (clusterID != k) continue;

Console.Write(i.ToString().PadLeft(3) + " "); for (int j = 0; j < data[i].Length; ++j) { if (data[i][j] >= 0.0) Console.Write(" "); Console.Write(data[i][j].ToString("F" + decimals) + " "); } Console.WriteLine(""); } Console.WriteLine("==================="); } // k } } }

Tóm tắt giải thuật di truyền

Chọn lọc Roulette (Roulette Wheel Selection)