Tổng kết chƣơng 2 - Một số kỹ thuật phân cụm dữ li- 123docz.net

Chƣơng này trình bày một số phƣơng pháp phân cụm dữ liệu phổ biến nhƣ phân cụm phân hoạch, phân cụm phân cấp, phân cụm dƣa trên mật độ, phân cụm dựa trên lƣới.

Phƣơng pháp phân cụm phân hoạch dựa trên ý tƣởng ban đầu tạo ra k phân hoạch, sau đó lặp lại nhiều lần để phân bố lại các đối tƣợng dữ liệu giữa

các cụm nhằm cải thiện chất lƣợng phân cụm. Một số thuật toán điển hình nhƣ k-means, PAM, CLARA, CLARANS,...

Phƣơng pháp phân cụm phân cấp dựa trên ý tƣởng cây phân cấp để phân cụm dữ liệu. Có hai cách tiếp cận đó là phân cụm dƣới lên (Bottom up) và phân cụm trên xuống (Top down). Một số thuật toán điển hình nhƣ BIRCH, CURE,..

Phƣơng pháp phân cụm dựa trên mật độ, căn cứ vào hàm mật độ của các đối tƣợng dữ liệu để xác định cụm cho các đối tƣợng. Một số thuật toán điển hình nhƣ DBSCAN, DENCLUE, OPTICS,...

Phƣơng pháp phân cụm dựa trên lƣới, ý tƣởng của nó là đầu tiên lƣợng hoá không gian đối tƣợng vào một số hữu hạn các ô theo một cấu trúc dƣới dạng lƣới sau đó thực hiện phân cụm dựa trên cấu trúc lƣới đó. Một số thuật toán tiêu biểu của phƣơng pháp này là STING, CLIQUE,...

Phƣơngphápphâncụmdựatrênmôhình,ýtƣởngchínhcủaphƣơngphápnàylàgiả thuyếtmộtmôhìnhchomỗicụmvàtìmkiếmsựthíchhợpnhấtcủađốitƣợngdữliệuvớimô hìnhđó,cácmôhìnhtiếpcậntheothốngkêvàmạngNơron.Mộtsốthuậttoánđiểnhìnhcủa phƣơngphápnàycóthể kể đếnnhƣEM,COBWEB,... MộtcáchtiếpcậnkháctrongPCDL đólà hƣớngtiếpcậnmờ,trongphƣơngphápphâncụmmờphảikểđếncácthuậttoánnhƣFC M, FCM,...

Chƣơng tiếp theo

luậnvăntrìnhbàyquátrìnhápdụngthuậttoánPAMvàobàitoánphâncụmdữ liệu khách hàng sử dụng dịch vụ viễn thông tại VNPT Hải Phòng.

CHƢƠNG 3: ỨNG DỤNG PHÂN CỤM DỮ LIỆU ĐỂ PHÂN LOẠI KHÁCH HÀNG SỬ DỤNG DỊCH VỤ VIỄN THÔNG 3.1 Đặt vấn đề bài toán Đốivớimộtdoanhnghiệpthôngtin diđộngviệcpháttriểnthuêbaomới đểkiếmtìmlợinhuậnvàothờiđiểmhiệntạiđãkhôngcònđemlạihiệuquả.Thayvàođólà mộtphƣơngánkinhdoanhtiếnđếnpháttriểnchấtlƣợngdịchvụ vàcungcấpthêmnhiềudịchvụgiátrịgiatăng.Tuynhiêncácdịchvụtruyềnthốngnhƣtho ại,nhắntinvẫncóthểđemlạinguồnlợinhuậncaohơnnếukíchthíchđƣợcnhucầusửdụng củakháchhàng. Để thực hiện đƣợc điều đó, các doanh nghiệp phải không ngừng giữ vững đƣợc khách hàng hiện có mà còn phải đƣa ra đƣợc các chiến lƣợc phát triển kinh doanh dài hạn, phân loại đƣợc các nhóm khách hàng đang sử dụng để từ đó có chính sách phân khúc thị trƣờng hợp lý.

Trong chƣơng này, em sử dụng thuật toán PAM để phân loại khách hàng sử dụng dịch vụ viễn thông dựa trên tổng số lƣu lƣợng cuộc gọi trong một thời gian cụ thể.

Trong phạm vi của đề tài, phần ứng dụng thuật toán chỉ mang tính chất thử nghiệm để xem xét tính hiệu quả của thuật toán này đối với dữ liệu thực, do đó em đề xuất sử dụng dữ liệu của 50 khách hàng trong 1 năm.

3.2 Cài đặt Cơ sở dữ liệu

Để cài đặt cơ sở dữ liệu trên vào cơ sở dữ liệu em sử dụng hệ quản trị cơ sở dữ liệu là SQL Server. Vì thuật toán em sử dụng bằng ngôn ngữ lập trình Visual studio. Việc cài đặt thuật toán trên môi trƣờng windown form mang lại nhiều lợi ích: cài đặt thuật toán dễ hiểu, dễ phân tích; khả năng phát triển ứng dụng lớn để áp dụng cho mô hình tổng thể.

3.3 Cài đặt thuật toán

Thuật toán PAM đã đƣợc giới thiệu trong chƣơng 2 đƣợc tiến hành cài đặt bằng ngôn ngữ lập trình C# trong bộ công cụ Visual Studio. Với bảng dữ liệu nhƣ trên, khi tiến hành phân cụm, số lƣợng cụm do ngƣời dùng lựa chọn, ở đây em lựa chọn là 3 cụm, thuật toán PAM sẽ tự tính toán khách hàng vào các cụm có giá trị tƣơng ứng.

Hình ảnh minh họa cài đặt thuật toán đƣợc trình bày dƣới đây:

Hình 3.4Giao diện chính của chương trình nhập dữ liệu

Khi nhập dữ liệu đầu vào cần phân cụm cho thuật toán, em tiến hành lựa chọn các tham số cho thuật toán bao gồm:

+Số cụm:Lựa chọn số cụm muốn tiến hành phân cụm + Chọn năm, tháng: phân cụm dữ liệu theo năm và tháng

Hình 3.5 Giao diện chọn các tham số cho thuật toán

Hình 3.6 Giao diện phân cụm theo thời lượng cuộc gọi

Ở hình 3.6, kết quả phân cụm đƣợc thể hiện qua 3 Form chính: + Form thông tin phân cụm: Hiển thị thông tin số cụm

+ Form biểu đồ phân cụm: Thống kê số phần tử tƣơng ứng với các cụm + Form chi tiết cụm: Liệt kê chi tiết danh sách các khách hàng của một cụm.

Hình 3.7 Danh sách các khách hàng thuộc cụm 1 theo thời lượng cuộc gọi

Hình 3.8 Danh sách các khách hàng thuộc cụm 2 theo thời lượng cuộc gọi

Hình 3.10 Giao diện phân cụm theo tiền dịch vụ

Hình 3.12 Danh sách các khách hàng thuộc cụm 2 theo tiền dịch vụ

Hình 3.13 Danh sách các khách hàng thuộc cụm 3 theo tiền dịch vụ

3.4 Đánh giá kết quả phân cụm bằng thuật toán PAM

Từ kết quả phân cụm với dữ liệu đầu vào là 50 khách hàng, kết quả của thuật toán đã chia thành 3 cụm với đúng yêu cầu. Nhìn vào 3 cụm đã đƣợc phân chia đó ta có nhận xét nhƣ sau:

Cụm 1:Có 17 phần tử chính là 17 khách hàng nằm trong khoảng từ 992 đến 7663 phút. Đây là nhóm khách hàng sử dụng ít nhƣng lại chiếm đa số về số lƣợng khách hàng đang sử dụng dịch vụ. Chiếm 56,67% tổng số lƣợng khách hàng hiện có. Dựa vào dữ liệu này, các doanh nghiệp Viễn thông sẽ đƣa ra chính sách phù hợp để kích cầu nhóm khách hàng này sử dụng nhiều hơn nữa. Ví dụ nhƣ: giảm cƣớc cuộc gọi ngoài giờ cao điểm, giảm giá cƣới nội mạng, tăng thêm chƣơng trình khuyến mãi...

Cụm 2: Có 6 phần tử, chính là 6 khách hàng nằm trong khoảng 8418 đến 13276 phút gọi. Nhóm khách hàng này sử dụng không nhiều và số lƣợng khách hàng cũng ít, chỉ chiếm 20% so với tổng số lƣợng khách hàng hiện có của nhà mạng. Vì vậy doanh nghiệp cần có chính sách ƣu đãi để tăng số lƣợng khách hàng, kích cầu ngƣời tiêu dùng.

Cụm 3: Có 7 phần tử chính là 7 khách hàng nằm trong khoảng 22248 đến 60831 phút gọi. Số lƣợng chiếm 23,33 thị phần khách hàng đang sử dụng dịch vụ. Đây là nhóm khách hàng tiềm năng và đem lại lợi nhuận cao cho doanh nghiệp. Dựa vào đó, doanh nghiệp đƣa ra các chính sách ƣu đãi, chăm sóc khách hàng tốt hơn nhằm giữ chân các khách hàng tiềm năng này. Vì đây là nhóm khách hàng phát sinh số phút gọi nhiều nhất nên đƣa ra khuyến mại giảm cƣớc dịch vụ hay khuyến mại giá trị thẻ nạp là không hợp lý. Thay vào đó khách hàng cần chính sách chăm sóc tốt hơn. Ví dụ nhƣ: gọi điện, nhắn tin, tặng quà trong ngày sinh nhật, chất lƣợng cuộc gọi luôn đƣợc thông suốt và tốt nhất....

Kết quả phân cụm dữ liệu trong việc phân tích, đánh giá kết quả dựa trên lƣu lƣợng cuộc gọi của các khách hàng bƣớc đầu cũng giúp các nhà quản lý kinh doanh có cái nhìn sâu hơn, tổng quát hơn, nhiều góc cạnh hơn về các khách hàng đang sử dụng dịch vụ của doanh nghiệp. Từ đó đề ra kế hoạch, chiến lƣợc phát triển tập trung vào các nhóm khách hàng đang mang lại lợi nhuận cao và đƣa ra các chính sách ƣu đãi nhằm phát triển mở rộng thêm các khách hàng tiềm năng.

3.5 Kết luận chƣơng 3

Trongchƣơng3,luậnvănđãphântíchchitiếtbàitoánphâncụmphân loại khách hàng sử dụng dịch vụ Viễn thông tại VNPT Hải Phòng.

- Khảosátnguồndữliệulịchsửcuộc gọi.

- Quátrìnhtiềnxửlýdữliệulịchsử,tạoranguồndữliệuđầu vào

phùhợpchothuậttoánPAM

- TriểnkhaithuậttoánphâncụmPAM

- Đánhgiákếtquảthuđƣợcsauquátrìnhphâncụm.

KẾT LUẬN

 Các vấn đề đã đƣợc tìm hiểu trong luận văn

Luận văn tập trung nghiên cứu tổng quan về KPDL nói chung và PCDL nói riêng và áp dụng giải thuật PAM để phân loại khách hàng sử dụng dịch vụ Viễn thông. Đây là bƣớc khởi đầu trong quá trình tìm hiểu những vấn đề cần quan tâm khi giải quyết các bài toán khai phá dữ liệu trong thực tế. Những kết quả mà luận văn đã thực hiện:

+ Về lý thuyết, luận văn tập trung tìm hiểu một số kỹ thuật phân cụm. + Về thực tiễn, luận văn đã đƣa ra các kết quả cài đặt thử nghiệm của bài toán phân cụm dịch vụ khách hàng sử dụng Viễn thông bằng thuật toán PAM. Đây là một bài toán ứng dụng phân cụm dựa trên lƣu lƣợng cuộc gọi và dịch vụ giá trị gia tăng. Tùy vào từng bài toán thực tế mà có thể phát triển thành sản phẩm hoàn chỉnh có thể ứng dụng rộng rãi.

Qua quá trình thực nghiệm và nghiên cứu lý thuyết có thể đƣa ra một số kết luận nhƣ sau:

• Mỗi một giải thuật phân cụm áp dụng cho một số mục tiêu và kiểu dữ liệu nhất định.

• Mỗi giải thuật có một mức độ chính xác riêng và khả năng thực hiện trên từng kích thƣớc dữ liệu là khác nhau. Điều này còn tuỳ thuộc vào cách thức tổ chức dữ liệu ở bộ nhớ chính, bộ nhớ ngoài... của các giải thuật.

• Khai phá dữ liệu sẽ hiệu quả hơn khi bƣớc tiền xử lý, lựa chọn thuộc tính, mô hình đƣợc giải quyết tốt.

 Hƣớng nghiên cứu tiếp theo

Hƣớng phát triển tiếp theo của đề tài là đi sâu vào nghiên cứu các kỹ thuật phân cụm áp dụng với cơ sở dữ liệu phức tạp hơnsẽphát triển thành bài toán vớidữliệu lớn hơn, bao quát hơn, nhiều chọn lựa hơn nhƣ phân cụm dựa trên các loại hình doanh nghiệp khác nhau, với dữ liệu khác nhau...

TÀILIỆUTHAMKHẢO Tài liệu tiếng Việt

[1].Lê Thu Trang,“Phương pháp phân cụm dữ liệu và ứng dụng”,Luậnvănthạcsĩ,TrƣờngĐạihọcCông nghệ Thông tin và Truyền thông – Đại học Thái Nguyên,2008.

[2].Nguyễn Văn Sự,“Khai phá dữ liệu bằng cây quyết định và ứng dụng trong hệ hỗ trợ quyết định”,Luậnvănthạcsĩ, TrƣờngĐạihọcCông nghệ Thông tin và Truyền thông – Đại học Thái Nguyên,2010.

Tài liệutiếng Anh

[3].JiaweiHan,JianPei.DataMining:ConceptsandTechniquesSecondEditio n.DianeCerra(2006). [4].ZhexueHuang.Extensionstothek- MeansAlgorithmforClusteringLargeDataSetswithCategoricalValues.KluwerAca demic(1998). [5].N.Hussein.AFastGreedyk-meansAlgorithm(2002). [6]HoTuBao,KnowledgeDiscoveryandDataMining,2000.

[7].Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. (2009). "NP- hardness of Euclidean sum-of-squares clustering".Machine Learning 75: 245– 249.

[8]. B. Mirkin, Clustering for data mining: A data recovery approach. London: Chapman and Hall, 2005.

[9]. R. Maitra, “Initializing partition-optimization algorithms,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, pp. 144–157, 2009.

[10]. Elkan, C. (2003). "Using the triangle inequality to accelerate k- means". Proceedings of the Twentieth International Conference on Machine Learning (ICML).

[11]. Kanungo, T.; Mount, D. M.; Netanyahu, N. S.; Piatko, C. D.; Silverman, R.; Wu, A. Y. (2002). "An efficient k-means clustering algorithm:

Analysis and implementation". IEEE Trans. Pattern Analysis and Machine Intelligence 24: 881–892.

[12]OrenZamirandOrenEtzioni,WebdocumentClustering:AFeasibilityDe monstration, Universityof Washington,USA, ACM,1998.

[13] Raymond T. Ng and Jiawei Han. CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE Transactions on Knowledge and Data Engineering,14(5):1003-016, 2002.

[14] Anomaly Detection in Temperature Data Using DBSCAN

Algorithm: Erciyes Univeristy,

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5946052&tag=1Mete

PHỤ LỤC

Mã nguồn chƣơng trình phân cụm bằng thuật toán PAM

using System;

using System.Collections.Generic; namespace AlgorithmPAM

{

public class PAM {

protected BaseMatrix data;

protected DistanceMatrix distances; protected int nClusters;

protected Clusters clusters;

// ordered index of element subset

// all immediate data and output should have the same size as idx

// (only the input BaseMatrix data has the full original set of data elements) // (indexing of DistanceMatrix is handled by the class itself)

int[] idx;

// clustering cost, to be minimized double cost;

// distance between element and closest medoid protected double[] nearestDistances;

// distance between element and second closest medoid double[] nextNearestDistances;

// nearest medoid of each element int[] nearestMedoids;

66 // next-nearest medoid of each element int[] nextNearestMedoids;

// set of medoids

HashSet<int> medoids;

// set of non-meoids (maintain for finding swap candidates) HashSet<int> nonmedoids;

// set of all indexed elements

// required since Java's HashSet cannot use native types int[] elements;

int maxSwaps = 1000;

public PAM(BaseMatrix data) : this(data, null, null)

{ }

public PAM(BaseMatrix data, DistanceMatrix distances, int[] idx) {

this.data = data;

if (data == null || data.NRows() == 0) {

throw new Exception("Data matrix is empty."); }

if (idx == null) {

// initially, use index all data elements in original order int m = data.NRows();

idx = new int[m];

67 { idx[i] = i; } } this.idx = idx; if (distances == null) {

this.distances = new DistanceMatrix(data, idx); } else { this.distances = distances.subset(idx); } this.clusters = null; }

public Clusters cluster(int k) {

int n = size(); if (n == 0) {

throw new Exception("No data elements are indexed."); }

if (k > n) {

throw new Exception("Number of clusters must be less than the number of data elements.");

68 }

else if (k == n) {

// build trivial single clusters return new Clusters(k); }

this.nClusters = k; initialize();

buildPhase(); swapPhase();

clusters = new Clusters(nearestMedoids, getCost()); clusters.center = medoids;

return clusters; }

/**

* Size. Number of data elements. */

public int size() {

return idx.Length; }

/**

* Calculate the clustering cost: sum of distances to cluster medoids. * @return cost

private double getCost() {

69 double c = 0;

for (int i = 0; i < nearestDistances.Length; ++i) {

c += nearestDistances[i]; }

return c; }

private void initialize() {

int m = size();

nearestDistances = new double[m]; nextNearestDistances = new double[m]; nearestMedoids = new int[m];

nextNearestMedoids = new int[m]; elements = new int[m];

medoids = new HashSet<int>(); nonmedoids = new HashSet<int>(); for (int ii = 0; ii < m; ++ii)

{

// initialize distances to infinity

nearestDistances[ii] = nextNearestDistances[ii] = Double.PositiveInfinity;

// initialize medoids to non-valid indices, s.t. unexpected bugs trigger indexing error

nearestMedoids[ii] = nextNearestMedoids[ii] = -1; elements[ii] = ii;

70 nonmedoids.Add(elements[ii]); }

} /**

* BUILD phase. Select a initial set of k medoids. */

private void buildPhase() {

int m = size(); // select first medoid

// find element with minimum total distance to all other elements double[] totalDistances = new double[m];

for (int ii = 0; ii < m; ++ii) {

// sum distances to all other elements // assume distance to itself is 0

double d = 0; for (int jj = 0; jj < m; ++jj) { d += distances.getValue(ii, jj); } totalDistances[ii] = d; }

double minDistance = totalDistances[0]; int minIndex = 0;

for (int ii = 0; ii < m; ++ii) {

71 if (totalDistances[ii] < minDistance) { minDistance = totalDistances[ii]; minIndex = ii; } }

// add element to medoid set addMedoid(minIndex);

// select remaining k - 1 medoids double[] gains = new double[m];

for (int kk = 1; kk < nClusters; ++kk) {

// consider each i as medoid candidate for (int ii = 0; ii < m; ++ii)

{

// if ii is already a medoid, it has negative gain to prevent it from being selected again

if (medoids.Contains(elements[ii])) { gains[ii] = -1.0; } else { double gain = 0;

// for each non-medoid j != i, calculate the gain for (int jj = 0; jj < m; ++jj)

if (jj == ii || medoids.Contains(elements[jj])) continue; if (nearestDistances[jj] > distances.getValue(ii, jj)) {

// add i will improve j's nearest distances

// (if selected, i will be the new nearest neighbour of j) gain += nearestDistances[jj] - distances.getValue(ii, jj); }

}

gains[ii] = gain; }

}

// select candidate with maximum gain double maxGain = Double.NegativeInfinity; int maxIndex = -1;

for (int ii = 0; ii < m; ++ii) { if (gains[ii] > maxGain) { maxGain = gains[ii]; maxIndex = ii; } }

// add element to medoid set addMedoid(maxIndex);

}

// check that the number of medoids match the expected if (nClusters != medoids.Count)

73 {

throw new Exception("Expected error in BUILD phase: Number of medoids does not match parameter k.");

} } /**

* SWAP phase. Attempt to improve clustering quality by exchanging medoids with non-medoids.

private void swapPhase() {

bool notConverged = true; bool continueLoop = true; int nSwaps = 0;

while (notConverged && continueLoop) {

notConverged = false; continueLoop = false;

IEnumerator<int> medIt = medoids.GetEnumerator(); while (medIt.MoveNext() && continueLoop)

{

int ii = medIt.Current;

IEnumerator<int> nonmedIt = nonmedoids.GetEnumerator(); while (nonmedIt.MoveNext())

{

int hh = nonmedIt.Current;

// by calculating gains by all other elements

// Calculate cumulative change to distance to nearest medoid for all nonmedoids j != h

double change = 0;

IEnumerator<int> nonmedIt2 = nonmedoids.GetEnumerator(); while (nonmedIt2.MoveNext()) { int jj = nonmedIt2.Current; if (jj == hh) continue; double d = nearestDistances[jj]; if (distances.getValue(ii, jj) > d) {

// if removed, i will have no impact if (distances.getValue(jj, hh) < d) {

// if selected, h will improve nearest distance for j change += distances.getValue(jj, hh) - d;

} } else {

// i cannot be closer than the nearest neighbour for j; // therefore, distances[i][j] == d

// and i is currently the nearest neighbour for j double e = nextNearestDistances[jj];

if (distances.getValue(jj, hh) < e) {

// if i and h are swapped, h will become the nearest neighbour

// nearest distance for j may improve or worsen change += distances.getValue(jj, hh) - d;

} else {

// if i is removed, the current next-nearest of j will be promoted to nearest change += e - d; } } } if (change < 0) {

// distance to nearest medoid summed over all nonmedoids is improved: swap

swap(hh, ii);

//System.out.print("Swap " + hh + " and " + ii + " for change = " + change + "\n");

// non-convergence if any swap occurs, up to a maximum number of swaps (to guard against swap cycles)

if (nSwaps++ < maxSwaps) { notConverged = true; } else {

76 continueLoop = false;

}

// reset iterator medIt = medoids.GetEnumerator(); // break out of inner loop to consider next medoid break; } } } } }

private void addMedoid(int add) {

medoids.Add(elements[add]); nonmedoids.Remove(elements[add]); updateNearest(add, -1); }

private void swap(int add, int remove) {

medoids.Add(elements[add]);

nonmedoids.Remove(elements[add]); medoids.Remove(elements[remove]); nonmedoids.Add(elements[remove]);

77 updateNearest(add, remove); }

/**

* Update nearest and next-nearest distances.

* Does not check whether {@code added} or {@ removed} have been added to or removed from the medoid set.

* FIXME optimize

* @param added Index of element added to medoid set (-1 for none)

* @param removed Index of element removed from medoid set (-1 for none)

private void updateNearest(int added, int removed) {

int m = size(); if (added >= 0) {

// added index is valid

// check if any nearest distance improves for (int ii = 0; ii < m; ++ii)

{

double d = distances.getValue(ii, added); if (d < nearestDistances[ii])

{

// element i is nearer to added medoid than previous nearest: update

int oldMedoid = nearestMedoids[ii]; nearestMedoids[ii] = added;

nearestDistances[ii] = d;

// pump nearest distance to next-nearest distance nextNearestMedoids[ii] = oldMedoid;

nextNearestDistances[ii] = oldDistance; }

else if (d < nextNearestDistances[ii]) {

// element i is nearer to added medoid than previous next-nearest: update nextNearestMedoids[ii] = added; nextNearestDistances[ii] = d; } } } if (removed >= 0) {

// removed index is valid

// check if the removed medoid is the nearest or next-nearest of any element

for (int ii = 0; ii < m; ++ii) {

if (nearestMedoids[ii] == removed) {

// promote next-nearest to nearest

nearestMedoids[ii] = nextNearestMedoids[ii]; nearestDistances[ii] = nextNearestDistances[ii]; // find new next-nearest

updateNextNearest(ii); }

else if (nextNearestMedoids[ii] == removed) {

// find new next-nearest updateNextNearest(ii); } } } } /**

* Update next nearest for element i. * Assume nearest medoid is already set. * @param ii element index to be updated */

private void updateNextNearest(int ii) {

int nearestMedoid = nearestMedoids[ii]; // find the next-nearest

IEnumerator<int> it = medoids.GetEnumerator(); double minDistance = Double.PositiveInfinity;

80 int nextNearestMedoid = -1; while (it.MoveNext())

{

int jj = it.Current;

// ignore if j is the nearestMedoid, since we are interested in the next-