INTRODUCTION
Motivation
Warehousing plays a vital role in logistics and supply chain management, especially in Viet Nam In detail, 53.7% of logistics enterprises in Viet Nam provide warehousing services [1] Additionally, the industry of e-commerce in Viet Nam has recently experienced significant developments [2], leading to a dramatically increasing demand for various items but in small volumes In such a trend, since warehouses have been playing a vital role as buffering spaces in supply chains and ensuring the uncertainty of consumer demand has as few adverse effects as possible on the stability of production systems [3 - 6], improvement activities in the warehouse operations are still a valuable and practical area to study
There are six fundamental warehouse processes, including receiving, putaway, storage, picking, packing, and shipping [7] In which, storage is vital in the warehouse operation While the other steps ensure smooth material flows inside and outside a warehouse, storage is where the buffering role of a warehouse in a supply chain presents apparently, placing items after the put-away process and before specific quantities of them are picked in the order-picking process to cover demands [8] As a result, a mechanism set in the storage phase, like Storage Location Assignment (SLA) problems, has various effects on the efficiency of both inbound and outbound flows of a warehouse regarding several key performance indicators such as picking time and cost, productivity, delivery and inventory accuracy [5, 8, 9]
The order-picking process accounts for a large proportion, about 55% [7], of the operational cost of a warehouse Therefore, this study focuses on SLA problems in the area of its relationship to order-picking processing efficiency and inherits the idea, from the study of Ene and ệztỹrk [5], of employing order-picking processing time (OPT) as the measure of the efficiency According to Tompkins et al [10], OPT is constructed mainly (about 95%) from four components: traveling time, searching time, picking time, and set-up time Traveling time accounts for the most significant proportion (approximately 50%), so it is chosen as the indicator of the efficiency of the order- picking process in this study
To enhance the efficiency of the order-picking process in terms of the OPT, synchronized order-picking (SOP), or multiple items of the same picking order (PO) are collected simultaneously by pickers from pickers’ assigned zones, utilized [11] Because of its operational advantages, the SOP warehouses are selected to be the scope of this study
An SLA solution that targets improving the order-picking process can be broken down into two sub-processes: family grouping and storage allocation [12, 13] In more detail, while the family grouping sub-process includes an analysis of correlation (or similarity) between items and a clustering step to formulate clusters of items that are highly similar in terms of some selected characteristics, the storage allocation sub- process includes drafting a priority list and a position assignment result based on this list for items Since the reality is that the development of the storage allocation sub-process can be enormously affected by various factors when real-time assignment happens after receiving cargos from inbound docks and this sub-process does better adapt quickly to changes of inbound flow, it is more proper to spend more focus on the family grouping sub-process Because of its structure, the family grouping sub-process can be operated by a suitable clustering algorithm And when it comes to clustering algorithms applied to SLA in SOP warehouses with the objective function of order-picking efficiency, it is evident that this topic has drawn little attention in research The author has found two studies related to the topic [11, 14].
Problem statement
When it comes to warehouse operations, the order-picking processing time is contributed by the traveling time the most In synchronized order-picking warehouses, traveling time depends on the level that the items that are often demanded in the same order are located in different zones In this study, a clustering algorithm is developed to provide separated lists of items that should be located in different zones, aiming at decreasing the traveling time and then decreasing order-picking processing time in synchronized order-picking warehouses.
Goal and scope
For short motivation in above, this research aims to develop a clustering algorithm to tackle the SLA problems, within the scope of SOP warehouses, improving order- picking traveling time.
Thesis layout
The thesis comprises six chapters:
• Chapter 1 gives some introduction and determines the goal and scope of the study;
• Chapter 2 reports the literature review and shows the methodology of the study;
• Chapter 3 presents the problem statement and the modified design of the clustering algorithm;
• Chapter 4 shows the experimental validation and analysis results to prove the ability and extent of the proposed algorithm to the problem;
• Chapter 5 gives conclusion statements and some suggestions for future studies.
LITERATURE REVIEW AND METHODOLOGY
Literature review and contributions
The SLA problems focus on allocating product items into storage locations targeting optimized material handling cost and storage space utilization Two common approaches aim to improve warehouse space utilization and the cycle time of the order fulfillment process Typical constraints include the capacity values of available storage and resources and dispatching policies [15] SLA plays an important role in operations of warehouses, because it provides a basis for various improvements in inventory management system of warehouses: reducing the complexity of inventory management via standardizing how a new item unit is located in storage systems of warehouses, giving optimal solutions in terms of the utilization rate of resources like space or handling devices, and considering demand pattern, separately or simultaneously [9, 16- 18] More details can be seen in Table 2.1 The column of “Targeted function” in Table 2.1 is based on the structure of two components, family grouping and storage location, proposed by Bindi et al [12] as mentioned in Chapter 1
Table 2.1 Literature review on Storage Location Assignment
Multi-criteria of order- picking process efficiency
Classification and their storage location assignment
• Storage location (determining location in shelf)
Guerriero et al., 2013 [19] the total inbound cost
• Family grouping (based on characteristics)
• Storage location (based on physical compatibility)
Total cost of order picking and storage space
Establishing classes of products and allocating them to storage locations
• Storage location (“class” plays the role of an intermediate factor between two functions)
Demand correlation pattern (The DCP of item I is set of I and items that are frequently ordered with
Family grouping and Storage location
Notes: ILP- Integer linear programming model, MCDM- Multi-criteria decision-making; LP- Linear programming; SA-Simulated annealing; H&SA- Heuristic and Simulated annealing
As can be seen in Table 2.1, when it comes to SLA problems:
• Targets are related to improving order-picking efficiency (cost or time);
• The most common approach is characteristics-based classification or grouping;
• An algorithm that targets optimal or near-optimal solutions is applied
In this study, the author tries to propose an approach to solve a subset of SLA problems in warehouses in which the synchronized order picking system is applied The common objective orientation of targeting improvements in order-picking efficiency via grouping is still pursued, in the form of clustering problems Additionally, a k-means algorithm is applied to solve to widen the area of algorithm classes that can be considered when it comes to the problem under discussion
Synchronized order picking is a type of order-picking process in a warehouse in which pickers simultaneously fulfilling the demands of the same PO from different zones (see Figure 2.1)
Figure 2.1 Illustration of Synchronized Order Picking (source: [14])
To the best of the author’s knowledge, a limited number of research focus on SLA problems in SOP warehouses For example, Jane and Laith (2005) formulated a model to minimize the total similarity between items in the same zone, measuring the similarity by the co-appearance of items in the same order The main idea of this study is that the higher the similarity between two items is, the lower the likelihood that these items are in the same zone Kuo et al [11] proposed two metaheuristics, one is based on the particle swarm optimization algorithm, and another is based on the genetic algorithm, the measurement of similarity is similar to the study of Jane and Laith [14] The objective function reflects properly the purpose of reducing the waiting time between pickers in SOP warehouses However, the measurement of similarity in these two studies still have a limitation In cases when two items appear together in the same PO but the picking amounts of them are far different from each other, the number of times that two items co-appears would not lead to any considerable improvement in the probability that the processing time of a PO could be cut down
In this study, therefore, the level of difference between items’ quantities in the same PO is used instead, aiming at grouping items that could highly make two pickers spend similar picking time into clusters, which then plays the role of lists of items that should be in different zones In more detail, given groups of O POs, groups of I items, and the number of expected zones (and pickers as well) is N ; quantity of item I in PO o , picking time of picker m for PO o , picking time for item I corresponding to order o , and the binary signal whether item I is assigned to picker m or not are denoted by
𝑸 𝒊𝒐 , 𝑻 𝒎𝒐 , 𝑻 𝒊𝒐 , 𝑨 𝒎𝒊 respectively ( I ∈ I , o ∈ 𝑶, m ∈ 𝑵), the idea can be mathematically described as follows:
• Because there is a covariate relationship between the quantity and the predicted picking time corresponding to an item in a PO,
• The picking time of a picker is the cumulative picking time of items that this picker is assigned, or 𝑻 𝒎𝒐 = 𝑨 𝒎𝒊 × 𝑻 𝒊𝒐 , so
• Eventually, from (a) and (b) above, ∆𝑻 𝒎𝒐 × ∆𝑸 𝒊𝒐 ≥ 0, or the less the quantities of two items in a PO are different, the less the picking time the assigned picker will spend for this PO, assuming that each item can be located in only one zone
Clustering algorithms operate based on the principle that entities whose similarity level of some characteristics is relatively high must be grouped into a cluster so that specific operations can treat them to achieve some benefits [14, 19] A typical clustering algorithm often includes the following steps (“pattern” has been mentioned as an
“object” or “data point” recently):
1 Pattern representation (optionally including feature extraction and/ or selection)
2 Definition of a pattern proximity (or distance, similarity) measure that is appropriate to the data domain
Regarding categorizing clustering algorithms, the most commonly used framework is dividing them into two groups: hierarchical and partitional [20] The main difference between the two groups is whether resulting clusters are nested (in hierarchical ones) or separated (in partitional ones)
The central concept of a hierarchical clustering algorithm is that a group of data patterns should be divided in a structure as meaningful as possible in terms of the number of clusters and the interrelationship between clusters without trying to separate these clusters or clusters appears in forms of a multi-level dendrogram [21, 22] Although hierarchical clustering algorithms show advantages in terms of illustrating the nature and the meaningfulness of the clustering results, their drawbacks regarding controlling the clustering process by parameters and high computational complexity [23] Therefore, they may not be applied in some situations which request a relatively highly strictness of how clustering processes happen like SLA problems
On the other hand, the partitional clustering algorithm tries to develop distinct clusters, so it needs some initial input data, such as the number of clusters or the threshold for point density in clusters [21, 23] The partitional clustering algorithm is benefit when applied in cases involving large data sets However, it turn to sensitivity to outliers of data sets and also is complicated to define at set of appropriate parameters [24] In this research, clustering-driven parameters are known in advance, like SLA problems in SOP warehouses; partitional clustering algorithms should be used because the low computational complexity will be utilized
2.1.5 k -means clustering algorithms k-means clustering has been the most common partitional clustering algorithm for
50 years since it was introduced [25] According to Jain et al [21], a basic k-means clustering algorithm includes four steps:
1 Choose k initial cluster centers from the data set
2 Assign each pattern (a data point or a group of data points) to the closest cluster center
4 Repeat step 2 if the convergence selection is not met
MacKay [26] defined a standard version of k-means clustering is an algorithm that changes an initial set of k means m 1 (1), ,m k (1) by repeating two steps until reaching covergence criteria:
• Assignment step: Assign each observation to the cluster 𝑆 𝑖 (𝑡) with the nearest mean 𝑥 𝑝 that with the least squared Euclidean distance, as in equation (2.1)
• Update step: Recalculate means (centroids) 𝑚 𝑖 (𝑡+1) for observations assigned to each cluster, as in equation (2.2)
Within the scope of the literature review results, this study contributes to some common issues of k-means, including:
• Proposing a proper measurement for clustering performance: The clustering evaluation function in this study is based on the distance measure as in previous related works
• Designing an algorithm for the initial cluster centers selection phase: In this study, the author tries to tackle the issues of Elbow method and random initialization, which are commonly applied in this phase of clustering algorithms, in terms of execution time and clustering performance The concept of nearest-neighborhood-radius is utilized to calculate the candidate measure of every object in the data set, ensuring the requirements on the size and density of a cluster
• Determining an appropriate convergence condition: In this study, the author inherits the idea that the clustering evaluation function must be integrated into the convergence checking function so that the stopping crtiterion is aligned with the solution quality Additionally, the idea of a consecutive interval of indicator exceeding a threshold is applied to decrease the probability of being stuck in a local optimum;
The core orientation of the contributions in terms of the three mentioned points is that the author tries to consider whether solutions in previous studies can be applied to the case of SLA in SOP warehouses and modified where necessary.
Methodology
The methodology is summarized in Figure 2.2
Concluding and Proposing future directions
First of all, the problem statement is given to clarify what problem is targeted and the expected outcomes of the study Based on this problem statement, the concept of an object to be clustered will be defined, coupling with the selections of feature and distance measure After that, the traditional general processes of a typical clustering algorithm (in sub-section 2.1.3) and a k-means algorithm (in sub-section 2.1.5) are utilized with some modifications resulting from referring to related studies Some experimental analysis is then conducted to check the practicability and the possible scope of applying the modified algorithm to the problem studied Finally, the author summarizes the critical points of the study and gives some suggestions for future studies.
PROBLEM AND ALGORITHM
Definitions
• Object or Data point: the smallest unit of data to provide the material for the clustering algorithm; in this study, an object or a data point is the expression of an item’s problem-related information
• Data set: The whole set of objects or data points;
• Item : the concept indicates a group of similar products that are managed together in a warehouse (equivalent to a Stock-Keeping-Unit, or SKU, code shown in a Warehouse-Management-System, or WMS);
• Storage location : the smallest unit of warehouse space to keep item units;
• Zone: a group of storage locations that will be assigned to a picker.
Solution orientation and concept of flow
As presented in Chapter 1, traveling time is selected to indicate the order-picking efficiency In order to optimize the traveling time in SOP warehouses, from the viewpoint of SLA, as also proved in Chapter 2, a solution needs to be developed so that if two items are more frequently scheduled in the same PO, they must be less likely located in the same zone This way, a PO’s makespan can be minimized [14], leading to a more efficient order-picking process
As mentioned in Chapter 2, the flow of the algorithm in this study will be based on the traditional flows of a typical clustering algorithm and a k-means algorithm, being integrated with some modifications from the literature review and operational context- based information Some essential points of the concept of flow are:
• “Object” (or sometimes “data point” in this study) is defined by an item;
• The features (or “dimensions” in this thesis) to be selected and the measurement of distance (or similarity) between objects must reflect the solution orientation of “more frequently scheduled” mentioned previously, which will be clarified in sub-section 3.3
• Feature extraction (or dimension reduction) is considered conducting based on the results of analysing operational context of SLA in SOP warehouses
• Methods for selecting the number of clusters, determining and re- determining the cluster centers, clustering, assessing clustering results, and the convergence condition must be incorporated with each other so that they can support the idea that “if two items are more frequently scheduled in the same PO, they must be less likely located in the same zone” mentioned above These contents will be presented in sub-sections 3.5, 3.6, and 3.7 respectively.
Feature selection and distance measure
Let the traveling speeds of pickers have the same values as an assumption The traveling time then depends on the total distance that pickers must pass to complete their assigned routes To measure conveniently, the unit of distance is “location length”, equal to the size of a storage location in the aisle dimension of a rack Therefore, the traveling time corresponding to a PO is produced by multiplying the “location length” and the item quantity in the PO
Since the “location length” is a layout-based value, it is proper to choose the quantity as the only feature type Noticing that the number of features is the number of POs that appeared in the data set As a result, the calculation of the absolute difference of distance between those quantities of two items in the same PO reflects the level of being frequently scheduled together reasonably
Euclidean distance has various applications of clustering algorithms in the realm of logistics and supply chain management such as area segmentation [27], decision- making support [28], facilities planning [29] Therefore, in this study, the second-order Euclidean distance is selected as the method of aggregating distances between item pairs from multiple POs, as in equation (3.1)
• 𝐷 𝑖𝑗 is the Euclidean distance between two items I and j
• 𝑜 is the index of a PO
• 𝑂 is the number of POs to be considered
• 𝑞 𝑖𝑜 is the picking quantity of item i in PO o
Pseudo-code of the function responsible for calculating distance is in Table 3.1
Table 3.1 Pseudo-code of Distance calculation function
2 # ListOfFeatureValues_1 and ListOfFeatureValues_2 have the same length
4 FOR each i = 1 to length( ListOfFeatureValues_1 ):
Clustering evaluation function
Because the clustering result decides whether some items must be stored in different zones, it is vital to develop a suitable cluster evaluation function that can be used to assess the quality of the clustering process and the location assignment process based on that
As mentioned earlier, in the context of the SLA problems in SOP warehouses, the most critical quality the clustering result needs to meet is that the quantities of intra- cluster items must be as identical as possible, i.e., the distances between pairs of cluster members must be as small as possible
Therefore, in this study, the evaluation function value for a cluster is constructed from the square root of the summation of square distances between pairs of items in a cluster, not including any computations regarding the relationships between a cluster center and this cluste’'s members, as well as between cluster centers It is also the difference between the proposed method and previous methods, for example methods of Hatamlou et al [30] and Tzortzis and Likas [31], whose intra-cluster clustering quality is measured by the total square errors between a cluster center and each cluste’'s members The square root operator is utilized to support the intuitive evaluation of the function value compared to the feature values of objects, via turning the evaluation function back to the same unit with feature values
The expression of the clustering evaluation function is in equation (3.2)
• C is the index of a cluster;
• 𝐸 𝐶 is the clustering evaluation function corresponding to cluster C;
• i, j are the indices of items (objects) that belong to cluster C;
• 𝐷 𝑖𝑗 is the distance between items i and j
To the extent of the whole data set, i.e., all the clusters, the clustering evaluation function must be toward the high possibility that all clusters have good problem-driven clustering quality Therefore, the overall value of the clustering evaluation function is calculated by the equation (3.3)
Initial cluster centers selection method
Coupling with several applications of k-means clustering algorithms, the topic of initial cluster center selection methods for this clustering algorithm group has also drawn a considerable amount of attention, because of its important role in the overall process of a k-means clustering algorithm, for example in studies of Khan and Ahmad [32], Frey and Dueck [33], Cao et al [34] Random selection is the most basic and prevalent method of determining initial cluster centers However, this method remains limitations: it can lead to a time-consuming iterative process of clustering before reaching an optimal or acceptable solution, due to the fact that its nature is a trial-and-error and uncertain mechanism
In previous related studies, some modifications have been proposed to cover the problems of randomness mentioned above, for example in studies of Su and Dy [35], Erisoglu et al [36], Xiong et al [37] It can be seen that there are three common issues when it comes to random initialization for a k-means clustering algorithm, they are (1) large search space of initialization results, (2) the risk of being stuck in a local optimum, and (3) the risk of establishing empty clusters due to effects of isolated objects, i.e.“outlier” (see Table 3.2) To cover these issues of the random initialization method, several solutions are proposed These solutions try to develop a method that picks out a set of k initial centers without any further need to rerun to tackle the issue of reaching a local optimum accidentally In this way, both the problem of various possible initialization results and its results regarding being stuck in a local optimum or establishing empty clusters are prevented from occurring
Table 3.2 Literature review on issue groups of random initialization
Large search space Risk of being stuck in a local optimum
Risk of empty clusters/ effects of isolated objects
The popular main idea of the proposed methods is ensuring the widely accepted intuitive concept of a cluster center, to the best knowledge of the author, that is an object surrounded by a high density of other objects within a small radius threshold and located relatively far from other cluster centers [37, 38] From the viewpoint of this concept, previous related studies tries to clarify three components “high density”, “small radius threshold”, “relatively far” From the results of literature review, there are four orientations that have followed to quantify these three components (see Table 3.3):
• Dimension-reduction-based: This orientation results from the widely accepted notion that in high-dimensional vector space, there are a small number of principal dimension that contains most of the information of the data set Erisogul et al [36] choose two of all dimensions that best describe the spread of the data set to project objects onto, via the application of the variation coefficient and the correlation coefficient Su and Dy [35] choose the dimension that contributes the most to the largest value of clustering evaluation function of a cluster and partitioning the data set based on this dimension
• Previous-centers-based: This orientation determines a new cluster center based on information that is related to chosen cluster centers, considering the interaction between clusters Erisogul et al [36] use the cumulative distance between an object and chosen center candidates (between an object and the mean of the data set for the special case of the first candidate) to ensure that the nearest neighbors of the object cannot become a center candidate Xiong et al [37] select the data object that is furthest from the set of previous chosen cluster centers (distance between an object and a set, in this study, is defined as the minimum value of distances between the object and each element of the set)
• Cluster-radius-threshold-based: This orientation focuses on assuring an intuitively acceptable threshold of radius for all established clusters The method proposed by Xiong et al [37] induces the mean distance of the data set as a cluster radius threshold, then a variable to categorize whether an object is an isolated object (which will be removed from the candidate set) or not
• Density-based: This orientation considers density (within a determined radius from an object) as one of the control parameters of the initial selection method, in addition to distances between objects As a result, this orientation strongly supports the selection process of cluster centers in terms of fitting the concept of “cluster center” as mentioned above Xiong et al [37] define the concept of density with respect to the radius threshold determined by mean distance of the data set, and then using density as a criterion to select cluster centers sequentially
Table 3.3 Summary of quantification orientations for initial selection of cluster centers
Cluster-radius- threshold-based Density- based
In this study, the author tries to inherit contributions in terms of quantifying these three components as well as tackle some extant issues of the previous studies, while ensuring the suitability to the operational context of SLA problems in SOP warehouses at the same time Two components of the initialization phase need to be developed: dimension reduction (or feature extraction) and initialization
When it comes to dimension reduction component, the Principal Component Analysis (PCA) method, which was invented in 1901 by Karl Pearson [39], is a prevalent solution To the best knowledge of the author, this method has three main characteristics:
• Spread representation : The dimensions (or features) chosen to implement dimension reduction – which will be called “pricipal component” from now on – must best reflect the spread (or variances) of the data set;
• Orthogonalit y: The pricipal components must be orthogonal to each other
• Information loss minimization: The lower-dimensional space where the data objects are projected must ensure minimizing information loss
Additionally, three characteristics mentioned above also appear in two related studies of Su and Dy [35] and Erisoglu et al [36] Therefore, it is evident to apply the PCA method to develop the dimension reduction component For the first time of applying this method, the author plans to apply the PCA process presented in a previous study of [40] via the dedicated function of PCA in the Scikit-learn library for Python language programming (sklearn.decomposition.PCA) However, due to the duration limitation of the thesis, the dimension reduction component is partially conceptualized as mentioned in this thesis Its details will be considered a future direction in the following works
Because SLA results are requested right after receiving items from inbound docks, a decision of determining where an item will be placed must be given after a relatively short time, ensuring warehouse workers can be released as soon as possible for other receiving requests, or other activities of the warehouse such as the order- picking, packing, or checking processes Therefore, the practical requirement is that a fast cluster center selection algorithm initializes the applied k-means clustering algorithm To achieve the characteristic of fast operation, the applied selection algorithm should not contain any highly complex calculation or analysis Still, a cluster cente’'s intuitive concept must be ensured simultaneously by an appropriate measurement
According to Li et al [38], a cluster center is an object surrounded by a high density of other objects within a small radius threshold, located relatively far from other cluster centers Li et al [38] proposed an algorithm that targets radius threshold and density Considering the same idea in the context of the problem statement as in sub- section 3.1, the radius threshold models the acceptable maximum level of difference in terms of item quantities between two items, and the density models the number of items whose quantities are highly close to the figure for the cluster center Due to the suitability between the algorithm and the problem, the author decides to try to apply the algorithm in this study for initial cluster center selection
Additionally, the algorithm from Li et al [38] also tackles uncertainty level from random initializing to some extent and ensures the concept of a cluster center at the same time It is also run by easy comparison and calculation operators, thus cover the need of a fast algorithm Detailed steps of the algorithm, with notices for the problem, include:
1 Calculating distance values (based on the chosen method in sub-section 3.1) between pairs of objects
2 Calculate sets of object’' distinct neighborhood radius values based on an m-nearest-neighbor (m-NN) algorithm called “m-NN radius” In this study, the m-NN radius is determined as the highest value among m objects with the smallest distance from the current object Additionally, the value of parameter m is determined by the number of pickers that will participate in the shift considered; because, in this way, there will be more possibility that the pickers finish a PO after an SOP process at the same time, cutting down the tardiness of this PO
3 Calculating the average of all m-NN radius values from step 2, called
Determination of the number of clusters
In the class of k-means algorithms, to the best knowledge of the author, Elbow method is known and used widely when determining or predicting the number of clusters, for example in studies of González-Cabrera et al [41], Amin et al [42], Hassan et al [43] The theory behind the technique was first put forth by Robert L Thorndike in 1953 [44] The main idea of this method is choosing a number of clusters so that adding another cluster doesn not improve much better the total value of overall within- cluster variance [45], called the “elbow point” In spite of its widely acknowledged advantages, this method has two problems:
(1) Intense iteration : They require iterative sequential operations over different values of the number of clusters, so they spend large amounts of execution time if the value range that needs to be tested is large;
(2) Need of human intervention: To apply the results of these methods, a user must have the ability – with the need of some basic statistical knowledge – to understand the visualization output of the method, extract necessary insights, and to decide whether more runs will be executed or not; so this method requires being applied by specialists and taking a specific amount of delay in the overall process of the algorithm for human intervention
Therefore, when it comes to determining the number of clusters in this study, the author tries to build a method that satisfies both of two requirements:
• Release an acceptable value : Being able to release an acceptable value for the number of clusters in terms of the clustering evaluation function that plays the role of the unique value of k in the following steps of the k-means algorithm without any try over other values;
• No human intervention: Introducing the acceptable value above without no need of human intervention
One method found that could cover two requirements above is the “rate of cumulative frequency distribution change”, or RCFDC, coming from the study of Othata and Pantaragphong [46] The idea of these authors is the cumulative frequency distribution contains information on both the current value and smaller values of the feature, any change in this distribution reflects the appearance of new values As a result, the border between two clusters can be formed in any interval where there is no number of consecutive changes in the value of cumulative frequency distribution, without any try over other values of cluster number and without any human intervention when executing However, because the algorithm has loops over feature, this method’s operating time depends heavily on the number of features Therefore, in the case of this thesis, because each PO will be a feature, and the number of POs is dramatically increasing in the era of E-commerce nowadays, the author believes that the algorithm must be inherited with some modifications to cut down the effect of the number of POs
Consequently, the author develops a method that can give an acceptable value of k in an easier and more stable way by modifying the method of Othata and
Pantaragphong above Modification points include:
1 Replacing the RCFDC by the cumulative rate of change of candidate measure (see sub-section 3.5 for the calculation and the meaning of
2 Instead of determining breaking points from intervals whose RCFDCs are relatively small, resulting in the number of clusters, the author picks the number of clusters based on a threshold of CRCCM In more detail, all the objects before a CRCCM exceeds 𝜀 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 for the first time will be considered clusters’ centers, and the number of clusters will be determined based on that (the value of 𝜀 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 is designated in advance) If necessary, one additional cluster will be formed to cover the remaining objects (the number of remaining objects is determined by the corresponding cumulative density compared to the value of chosen k)
The pseudo-code of the function to determine the number of clusters – k – is presented in Table 3.6
Table 3.6 Pseudo-code of the function to determine the number of clusters
2 SORT_DESCENDING initialClusterCenterData [Candidate Measure]
5 IF i is located in or after the second order THEN
9 IF CRCCM for i ≤ epsilonInitial THEN
To illustrate the method above, an example is given in Table 3.7 and Figure 3.1 Table 3.7 shows the output of calculating m-NN radius, density, and candidate measure (as described in sub-section 3.5) of a data set of 16 items The candidate measure is sorted in descending order There are four items before the CRCCM exceeds 𝜀 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 10%, so the value of k is 4 However, because the cumulative value of density is 12, there needs to be one more cluster to contain the remaining four items, leading to an adjusted k equaling 5
Table 3.7 An example of the method of determining the number of clusters
Item m-NN Radius Density Candidate
Figure 3.1 visualizes the idea of the CRCCM-based method above by using data from Table 3.7 As can be seen in the figure, from item 12, the value of CRCCM increases dramatically, which means that the Candidate Measure decreases significantly (the concept of significance is quantified and controlled by 𝜀 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 ) The four first items (10, 16, 8, 12) are chosen because they are of top Candidate Measure and they do not reach the significant threshold of change
Figure 3.1 An example of the method of determining the number of clusters
Convergence condition
Convergence conditions decide how the algorithm stops at a proper time, contributing to the well-being of the clustering evaluation function Pérez et al [47] proposed a convergence condition that aims to reduce the number of iterations and improve the quality of the clustering process and discussed that most of the previous convergence conditions – such as “stopping when reaching a given number of iterations, stopping when there is no exchange of objects among groups, or stopping when the difference among centroids at two consecutive iterations is smaller than a given threshold” – had not included a sum of intra-cluster square errors as a component, despite this summation is the general objective function of k-means algorithms, and then cover this issue in their paper
The solution orientation of Pérez et al above is used to develop the reasonable convergence condition in this thesis The clustering evaluation function, as expressed in sub-section 3.4, must be incorporated into the convergence condition by choosing an
Candidate Measure CRCCM appropriate number of iterations based on analyzing the experimental relationship between this figure and the clustering evaluation function Because the equation of the clustering evaluation function is context-driven, it is highly believed that the convergence condition will meet the concept of flow in sub-section 3.1
The idea of integrating a threshold of indicator change in sub-section 3.6 is also reused for the convergence condition In particular, when the change amount of the clustering evaluation function value exceeds 𝜀𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 for a number of consecutive times, that equals 𝛿 (𝜀𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 and 𝛿 are pre-determined parameters.)
Additionally, because of some isolated data objects, the number of non-empty clusters (clusters that have one center and at least one member) can decrease under the effect of the clustering evaluation function, one more stopping condition is applied if the number of non-empty clusters is smaller than the selected number of clusters
To summarize, the algorithm will stop when at least one of two conditions is met:
• The convergence condition is met (when the change amount of the clustering evaluation function value exceeds a threshold for a number of consecutive times)
• The number of non-empty clusters is smaller than the selected number of clusters.
Flow of the algorithm
The master flow of the k-means algorithm applied in this study inherits the general 4-step-process in Chapter 2 and is depicted in Figure 3.2 And the detailed content of each component of the master flow will be enumerated in Table 3.8
Initializing the set of cluster centers
Formulating clusters based on defined distance measure
Figure 3.2 The master flow of the k-means algorithm
Table 3.8 The detailed content of each component of the master flow
• Quantities (of items in POs)
Formulating the matrix of distance values between objects based on the equation (3.1)
Distance value matrix (Matrix of distance values between items)
Initializing the set of cluster centers
The structure of this component follows the idea presented in sub- section 3.5
Formulating clusters based on a defined distance measure
Grouping each object into the cluster whose center has the smallest distance values to this object
• A clustered object database (Object database with clustering result)
This step produces the current clustering resul’'s overall evaluation value based on the calculation
The overall clustering evaluation value
Component Input Process Output method presented in sub-section
• The overall clustering evaluation value
Based on the idea in sub-section 3.7
The signal of whether the convergence condition is satisfied or not
• The signal that the convergence condition is not satisfied
After this step, new cluster centers are determined from the means of objects within these clusters, supplying the material for a new clustering process
EXPERIMENTAL RESULTS AND ANALYSIS
Operational context and notations
Le’'s consider a warehouse that operates an SOP system At the beginning of each shift, the supervisor must assign a zone (which combines a number of lines) to each of
𝑵 𝒑 pickers The supervisor exports data from the WMS (Warehouse Management System) and knows that there are 𝑵 𝒐 POs belonging to the shift, regarding 𝑵 𝒊 items To target a high level of efficiency in terms of the OPT of each PO, the supervisor needs to increase the level that if two items are more frequently requested together, they will be more likely located in different zones (as in the problem statement) To achieve this expectation, he must know which items are considered highly frequently requested together To align with the problem and algorithm proposed in Chapter 3, the frequency of being requested together is measured by quantities of items in POs–- 𝑸 𝒊𝒐 (𝑖 ∈{1; 2; 3; … ; 𝑁 𝑖 }; 𝑜 ∈ {1; 2; 3; … ; 𝑁 𝑜 }), and the indicator to make decisions is the clustering evaluation function 𝑬 𝒐𝒗𝒆𝒓𝒂𝒍𝒍 value as in equation (3.3).
Input data
From the operational context depicted above and the flowchart of the algorithm in sub-section 3.8, the needs of input data are in the list below:
2 The number of POs (features) 𝑵 𝒐 ;
3 The number of items (objects) 𝑵 𝒊 ;
4 Matrix Q of quantity of each item in each PO:
Implementation and results
The efficiency and effectiveness of an algorithm can be different at different sizes of input data sets Therefore, in this study, the author tests the running process of the proposed algorithm in three versions of input data, corresponding to six sets of parameters in Table 4.1 (see Appendix A for full versions of data), by using Python programming language via the Google Colaboratory environment (see the Python program in Appendix B) Five parameter sets are encoded from an actual data set due to corporate information security, and the remaining comes from the study of Chuang et al (2012) Table 4.1 records the value of adjusted k (based on the process presented in sub- section 3.6), the number of iterations the algorithm has spent before terminating Table 4.1 also shows the initial and final values of the overall clustering evaluation function
𝑬 𝒐𝒗𝒆𝒓𝒂𝒍𝒍 , corresponding to each version of the input data Parameter settings are:
Table 4.1 Summary of experimental results
As can be seen in Table 4.1, in the cases of low-dimensional data sets (version 1,
2, 3 and 4), the algorithm experiences a high level of improvement in terms of 𝑬 𝒐𝒗𝒆𝒓𝒂𝒍𝒍 , from 13.08% to 87.75% Although high-dimensional data sets like versions 5 and 6 do not see any percentage of improvement in the clustering evaluation values, their figures are unchanged As a result, the algorithm can be considered powerfully driving the set of solutions to the optimum in the cases of low-dimensional data sets and indicating a potential to gain acceptable solutions for high-dimensional data sets.
Managerial insights
This section tries to depict in detail key managerial insights indicated from the experimental results presented earlier in section 4.3
Considering the expression of 𝑬 𝒐𝒗𝒆𝒓𝒂𝒍𝒍 in equation (3.3) and that of its component
– 𝐸 𝐶 – in equation (3.2), it can be realized that any amount of improvement in the value of 𝑬 𝒐𝒗𝒆𝒓𝒂𝒍𝒍 indicates a specific amount of decrease in terms of intra-differences of clusters In mathematical language, this statement can be proved as follows:
⇒ If ∆𝐸 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 < 0, at least one of ∆√∑ 𝑖,𝑗 ⊂𝐶 𝐷 𝑖𝑗 2 components is negative (a);
• According to the relationship between the square root quantity and the quantity under square root:
• Eventually, from (a) and (b), if ∆𝐸 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 < 0, at least one cluster C that satisfies ∆𝐷 𝑖𝑗 2 < 0 with 𝑖, 𝑗 ⊂ 𝐶, or at least one cluster C that experiences intra-difference decrease
On the one hand, as presented earlier in sub-section 2.1.1, if the items that have highly differences in terms of PO quantities are located in different zones, the idle time between pickers when they simultaneously fulfill the a PO will be cut down, leading to improving the makespan of a PO On the other hand, the managerial insight presented above indicates that clusters, which are the results of the proposed algorithm, greatly I items into groups that have highly similar values in dimensions of POs’ demand quantities Consequently, the proposed algorithm can be considered orientating the clustering result toward converging in terms of differences in POs’ demand quantities between items, increasing the likelihood that the completed time of a PO is nearer to the PO’s due time, contributing some suggestions for managers and supervisors in SOP warehouses to deal with the service level agreement problems of order-picking process (in situations as mentioned in sub-section 4.1).
Discussions on computational complexity
In this thesis, the author apply an integrated method of two concept m-NN neighbor and CRCCM to obtain simultaneously the determined value for k (number if clustres) and the initial set of cluster centers This integrated method is believed to be able to tackle issues of previous methods, the random initialization and the Elbow method, in terms of computational complexity
Take a brief remind from section 3.5, the proposed initialization method tries to address three common problems of random initialization for a k-means clustering including: various initialization results without any control parameters, risk of being stuck in a local optimum, and risk of establishing empty clusters because of the effect of isolated data objects
Additionally, as discussed in section 3.6, both the Elbow method and the proposed CRCCM-based method aim at providing a proper value for the number of cluster centers, but they are located in different positions of a k-means clustering algorithm, and thus require different patterns of resource usage Although the Elbow method gives a more direct relationship between the number of cluster centers and the clustering performance function, it requires more operations in terms of the algorithm (i.e it is a reflection of the prevalent exponential relationship between the efficiency of computational resources usage and the quality of the obtained solution) In more detail, to plot a data point in the Elbow-method-based relationship graph, a completed loop of clustering algorithm must be run Conversely, the proposed CRCCM-based method does not require a full loop (as can be seen in the flowchart in Figure 3.2), but it could lead to an inefficient value of the evaluation function of clustering.
CONCLUSION AND FUTURE DIRECTIONS
Conclusion
In the era of E-commerce, as the number of short-cycle-time orders increases but the number of item quantities in orders becomes smaller, warehousing becomes more important in global logistics and supply chain management activities Therefore, warehouse management is a practical and impressive field for researchers Zoning is a useful policy in warehouse management because it supports managers to separate operations into different areas of warehouse layout so that it can contribute to performance indicators’ quality The synchronized zone order picking system targets in increasing resource utilization and cutting down the idle time between pickers when completing a PO To enhance this advantage, this thesis proposes a k-means clustering algorithm to provide warehouse staff with lists of items that should be located in different zones In this way, the SLA result can contribute to improving the efficiency of the order- picking process, ensuring the service level of the warehouse
The main idea of the proposed algorithm is that when two items are demanded in similar picking quantities, they should not be planned to be located in the same zone The design of the proposed k-means clustering algorithm inherits several ideas from related previous studies and is added some modifications to make the algorithm suitable to the operational context of SOP warehouses Each data object presents an item and features to be selected, therefore, are item quantities in POs Euclidean distance is applied to calculate the distance between two data objects The number of clusters and the set of initial cluster centers are determined based on a metric called “candidate measure”, which ensures cluster center candidates are data objects that are surrounded by a high density of other objects and are enough far from other candidates The clustering evaluation function is based on intra-cluster distances and is proven to be able to improve the order-picking efficiency The experimental results show that the algorithm can make the value of the clustering evaluation function decrease or at least does not make this value be worsened.
Future directions
Several enhancements should be applied in the design of the proposed k-means clustering algorithm such as PCA to reduce the dimensionality of the data set before the initialization process occurs, machine-learning-based methods to automatically determine the values for controlling parameters of steps in the algorithm based on the information from the data set, instead of arbitrarily providing them
1 P T Huong, L Q Anh, L T T Thao, N T Phuong, H H Duc, and L
K Kien, “Redesigning the warehouse's material storage location: a case study of the warehouse of an electric motorcycle assembly company,” in
Proceedings of International Conference on Logistics and Industrial Engineering 2021, Ho Chi Minh City, 2021, pp 167-173
2 D H Phuoc, N P Long, T Q Cuong, N D Huy, H H Duc, and L K Kien, “Enhancing the garment company’s manufacturing facility layout,” in Proceedings of International Conference on Logistics and Industrial Engineering 2021, Ho Chi Minh City, 2021, pp 174-179
[1] Viet Nam Ministry of Industry and Trade, Logistics Report 2021 Publishing
House for Industry and Trade, 2021, pp.71
[2] N T Minh, “Factors affecting logistics performance in e-commerce: Evidence from Vietnam,” Uncertain Supply Chain Management, vol 9, pp 957-962, 2021 [3] R J Morton, “An approach to warehouse design: matching the concept to the throughput,” Retail and Distribution Management, vol 2, pp 42-45, 1974 [4] R L Harper, “Warehouse technology in the supply chain management systems,” in Proceedings-Annual Reliability and Maintainability Symposium (RAMS), San Jose, California, USA: IEEE, 2010, pp 1-5
[5] S Ene and N ệztỹrk, “Storage location assignment and order picking optimization in the automotive industry,” The international journal of advanced manufacturing technology, vol 60, pp 787-797, 2012
[6] N Faber, M B M De Koster, and A Smidts, “Organizing warehouse management,” International Journal of Operations & Production Management, vol 33, pp 1230-1256, 2013
[7] J J Bartholdi and S T Hackman, Warehouse & distribution science: release 0.98 The Supply Chain and Logistics Institute, 2017
[8] E H Frazelle, World-class warehousing and material handling McGraw-Hill
[9] F Guerriero, R Musmanno, O Pisacane, and F Rende, “A mathematical model for the Multi-Levels Product Allocation Problem in a warehouse with compatibility constraints,” Applied Mathematical Modelling, vol 37, pp 4385-
[10] J A Tompkins, J A White, Y A Bozer, and J M A Tanchoco, Facilities planning John Wiley & Sons, 2010
[11] R J Kuo, P H Kuo, Y R Chen, and F E Zulvia, “Application of metaheuristics-based clustering algorithm to item assignment in a synchronized zone order picking system,” Applied Soft Computing, vol 46, pp 143-150, 2016 [12] F Bindi, R Manzini, A Pareschi, and A Regattieri, “Similarity-based storage allocation rules in an order picking system: an application to the food service industry,” International Journal of Logistics: Research and Applications, vol 12, pp 233-247, 2009
[13] Y Zhang, “Correlated storage assignment strategy to reduce travel distance in order picking,” IFAC-PapersOnLine, vol 49, pp 30-35, 2016
[14] C C Jane and Y W Laih, “A clustering algorithm for item assignment in a synchronized zone order picking system,” European Journal of Operational
[15] J Reyes, E Solano-Charris, and J Montoya-Torres, “The storage location assignment problem: A literature review,” International Journal of Industrial Engineering Computations, vol 10, pp 199-224, 2019
[16] M E Fontana and V S Nepomuceno, “Multi-criteria approach for products classification and their storage location assignment,” The International Journal of Advanced Manufacturing Technology, vol 88, pp 3205-3216, 2017
[17] V R Muppani and G K Adil, “Efficient formation of storage classes for warehouse storage location assignment: a simulated annealing approach,”
[18] R Q Zhang, M Wang, and X Pan, “New model of the storage location assignment problem considering demand correlation pattern,” Computers & Industrial Engineering, vol 129, pp 210-219, 2019
[19] E Zhu and R Ma, “An effective partitional clustering algorithm based on new clustering validity index,” Applied soft computing, vol 71, pp 608-621, 2018 [20] A Saxena, M Prasad, A Gupta, , and C T Lin, “A review of clustering techniques and developments,” Neurocomputing, vol 267, pp 664-681, 2017
[21] A K Jain, M N Murty, and P J Flynn, “Data clustering: a review,” ACM computing surveys (CSUR), vol 31, pp 264-323, 1999
[22] F Murtagh and P Contreras, “Algorithms for hierarchical clustering: an overview,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol 2, pp 86-97, 2012
[23] J Sander, X Qin, Z Lu, N Niu, and A Kovarsky, “Automatic extraction of clusters from hierarchical clustering representations,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, Berlin, Heidelberg: Springer, 2003, pp 75-87
[24] S B Salem, S Naouali, and Z Chtourou, “A fast and effective partitional clustering algorithm for large categorical datasets using a k-means based approach,” Computers & Electrical Engineering, vol 68, pp 463-483, 2018 [25] S J Nanda and G Panda, “A survey on nature inspired metaheuristic algorithms for partitional clustering,” Swarm and Evolutionary computation, vol 16, pp 1-
[26] D J Mac Kay, Information theory, inference and learning algorithms
[27] B Zheng, F Z Tang, and H L Yang “Tobacco distribution based on improved
K-means algorithm,” in 2009 IEEE/INFORMS International Conference on Service Operations, Logistics and Informatics, Chicago, Illinois: IEEE, 2009, pp
[28] G Zhang, J Shang, and W Li, “An information granulation entropy-based model for third-party logistics providers evaluation,” International Journal of Production Research, vol 50, pp 177-190, 2012
[29] P T Mativenga, “Sustainable Location Identification Decision Protocol
(SuLIDeP) for determining the location of recycling centres in a circular economy,” Journal of cleaner production, vol 223, pp 508-521, 2019
[30] A Hatamlou, S Abdullah, and H Nezamabadi-Pour, “A combined approach for clustering based on K-means and gravitational search algorithms,” Swarm and Evolutionary Computation, vol 6, pp 47-52, 2012
[31] G Tzortzis and A Likas, “The MinMax k-Means clustering algorithm,” Pattern recognition, vol 47, pp 2505-2516, 2014
[32] S S Khan and A Ahmad, “Cluster center initialization algorithm for K-means clustering,” Pattern recognition letters, vol 25, pp 1293-1302, 2004
[33] B J Frey and D Dueck, “Clustering by passing messages between data points,”
[34] F Cao, J Liang, G Jiang, “An initialization method for the K-Means algorithm using neighborhood model,” Computers & Mathematics with Applications, vol
[35] T Su and J Dy, “A deterministic method for initializing k-means clustering,” in
16 th IEEE international conference on tools with artificial intelligence, Boca
Raton, FL, USA: IEEE, 2004, pp 784-786
[36] M Erisoglu, N Calis, and S Sakallioglu, “A new algorithm for initial cluster centers in k-means algorithm,” Pattern Recognition Letters, vol 32, pp 1701-
[37] C Xiong, Z Hua, K Lv, and X Li, “An Improved K-means text clustering algorithm By Optimizing initial cluster centers,” in 2016 7th International Conference on Cloud Computing and Big Data (CCBD), Macau, China: IEEE,
[38] Y Li, J Cai, H Yang, J Zhang, and X Zhao, “A novel algorithm for initial cluster center selection,” IEEE Access, vol 7, pp 74683-74693, 2019
[39] Wikipedia “Principal component analysis.” Internet: https://en.wikipedia.org/wiki/Principal_component_analysis, Dec 23, 2022
[40] H Zhao, J Zheng, J Xu, and W Deng, “Fault diagnosis method based on principal component analysis and broad learning system,” IEEE Access, vol 7, pp 99263-99272, 2019
[41] N González-Cabrera, J Ortiz-Bejar, A Zamora-Mendez, and M R A Paternina,
“On the Improvement of representative demand curves via a hierarchical agglomerative clustering for power transmission network investment,” Energy, vol 222, pp 119989, 2021
[42] W Amin, F Hussain, and S Anjum, “iHPSA: An improved bio-inspired hybrid optimization algorithm for task mapping in Network on Chip,” Microprocessors and Microsystems, vol 90, pp 104493, 2022
[43] M M Hassan, S Mollick, and F Yasmin, “An unsupervised cluster-based feature grouping model for early diabetes detection,” Healthcare Analytics, vol 2, pp
[44] Wikipedia “Elbow method (clustering).” Internet: https://en.wikipedia.org/wiki/Elbow_method_(clustering), Dec 23, 2022
[45] M Pacella, “Unsupervised classification of multichannel profile data using PCA:
An application to an emission control system,” Computers & Industrial Engineering, vol 122, pp 161-169, 2018
[46] P Othata and P Pantaragphong, “Number of cluster for k-means clustering by
RCFDC method,” in the 22nd Annual Meeting in Mathematics (AMM),
Department of Mathematics, Faculty of Science Chiang Mai University, Chiang Mai, Thailand, 2017
[47] O J Pérez, R R Pazos, R L Cruz, S G Reyes, T R Basave, and H H Fraire,
“Improving the efficiency and efficacy of the k-means clustering algorithm through a new convergence condition,” in International Conference on Computational Science and Its Applications, 2007, Berlin, Heidelberg: Springer, pp 674-682
[48] Y F Chuang, H T Lee, and Y C Lai, “Item-associated cluster assignment model on storage allocation problems,” Computers & industrial engineering, vol
Version 1
It em O rde r1 O rde r2 O rde r3 O rde r4 O rde r5 O rde r6 O rde r7 O rde r8 O rde r9 O rde r10
Version 2
It em O rde r1 O rde r2 O rde r3 O rde r4 O rde r5 O rde r6 O rde r7 O rde r8 O rde r9 O rde r10
It em O rde r1 O rde r2 O rde r3 O rde r4 O rde r5 O rde r6 O rde r7 O rde r8 O rde r9 O rde r10
Versions 3-4-5-6
Access this link: https://drive.google.com/drive/folders/18BMBwceUwJzgiBwXDIEo3sKjzvuD0Kt5?us p=sharing or scan the QR code below:
Appendix B PYTHON CODE import pandas as pd import numpy as np
#data about similarityDim of two datapoints is stored as two tuples, then these two tuples are passed to a function def EuclideanDis(order, similarityDimTuple1, similarityDimTuple2): tuple1 = similarityDimTuple1 tuple2 = similarityDimTuple2 kq = 0 for i in range(0, len(tuple1)): kq = kq + pow((tuple1[i]-tuple2[i]), order) kq = pow(kq, 1/order) return kq def DistanceDfOps(ObjDf, simDim, ItemList, orderQty): outDf = pd.DataFrame(float(0), index = ItemList, columns = ItemList) for i1 in ItemList: for i2 in ItemList: df = ObjDf.copy() indexList = df.index[df['Item'] == i1].tolist() df1 = df.loc[indexList[0], list(simDim[0:orderQty])] tp1 = tuple(df1) indexList = df.index[df['Item'] == i2].tolist() df2 = df.loc[indexList[0], list(simDim[0:orderQty])] tp2 = tuple(df2) outDf.loc[i1][i2] = EuclideanDis(2, tp1, tp2)
#print("Khoảng cách giữa ", i1, " và ", i2, " là ", outDf.loc[i1][i2]) return outDf
##mNN-based Initialization def mNNDf(objDf, disDf, m): outDf = pd.DataFrame(float(0), index = range(0, NofItems), columns = ["Item", "mNN Radius"]) outDf["Item"] = objDf["Item"].copy() for i in outDf["Item"]: tempSr = disDf.nsmallest(m+1, i, keep='first') indexList = outDf.index[outDf['Item'] == i].tolist() mNNr = tempSr.loc[:, i].max() if mNNr == 0: mNNr = 1/1000000 outDf.loc[indexList[0], "mNN Radius"] = mNNr outTempSr = outDf["mNN Radius"] outArray = outTempSr.array outAVR = np.mean(outArray) outDf["Density"] = 'NaN' df = disDf for i in ListOfItems: indexList = outDf.index[outDf['Item'] == i].tolist() density = len(df[df[i]