Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 18 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
18
Dung lượng
407,42 KB
Nội dung
Neurocomputing 48 (2002) 199–216 www.elsevier.com/locate/neucom Uncovering hierarchical structure in data using the growing hierarchical self-organizing map Michael Dittenbach, Andreas Rauber ∗ , Dieter Merkl Institute of Software Technology, Vienna University of Technology, Favoritenstr 9-11=188, A-1040 Vienna, Austria Received 31 October 2000; accepted June 2001 Abstract Discovering the inherent structure in data has become one of the major challenges in data mining applications It requires stable and adaptive models that are capable of handling the typically very high-dimensional feature spaces In particular, the representation of hierarchical relations and intuitively visible cluster boundaries are essential for a wide range of data mining applications Current approaches based on neural networks hardly fulÿll these requirements within a single model In this paper we present the growing hierarchical self-organizing map (GHSOM), a neural network model based on the self-organizing map The main feature of this novel architecture is its capability of growing both in terms of map size as well as in a three-dimensional tree-structure in order to represent the hierarchical structure present in a data collection during an unsupervised training process This capability, combined with the stability of the self-organizing map for high-dimensional feature space representation, makes it an ideal tool for data analysis and exploration We demonstrate the potential of the GHSOM with an application from the information retrieval domain, which is prototypical both of the high-dimensional feature spaces frequently encountered in today’s applications as well as of the hierarchical nature of data c 2002 Elsevier Science B.V All rights reserved Keywords: Self-organizing map (SOM); Unsupervised hierarchical clustering; Document classiÿcation; Data mining; Exploratory data analysis ∗ Corresponding author E-mail addresses: mbach.@ifs.tuwien.ac.at (M Dittenbach), andi@ifs.tuwien.ac.at (A Rauber), dieter@ifs.tuwien.ac.at (D Merkl) 0925-2312/02/$ - see front matter c 2002 Elsevier Science B.V All rights reserved PII: S - ( ) 0 5 - 200 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 Introduction Today’s information age may be characterized by the increasingly massive production and dissemination of written information More powerful tools for organizing, searching, and exploring the available mass of information are needed to cope with this situation An attractive way to assist the user in document archive exploration is based on unsupervised artiÿcial neural networks, especially the self-organizing map (SOM) [8], for document space representation It has shown to be well-suited for mapping high-dimensional data into a two-dimensional representation space A number of research publications show that this idea has found appreciation [9,10,13–15,17,20,22], with self-organizing maps being used to visualize the similarity between documents in terms of distances within the two-dimensional map display Hence, similar documents may be found in neighboring regions of the map However, with the increasing amount of information available some limitations of the SOM have to be addressed One of these disadvantages is its ÿxed size in terms of the number of units and their particular arrangement Without a priori knowledge about the type and the organization of the data it is di cult to predeÿne the network’s size in order to reach satisfying results Thus, it might be helpful if the neural network would be able to determine this number of units as well as their arrangement during its learning process Second, hierarchical relations between the input data are not mirrored in a straight-forward manner Obviously, we should expect such hierarchical relations in many data collections such as, e.g document archives where di erent subject matters are covered The identiÿcation of these hierarchical relations remains a highly important data mining task that cannot be addressed conveniently within the framework of SOM usage In order to overcome these two limitations of self-organizing maps we propose a novel neural network architecture, i.e the growing hierarchical self-organizing map (GHSOM) This neural network architecture is capable of identifying the required number of units during its unsupervised learning process Additionally, the data set is clustered hierarchically by relying on a layered architecture comprising a number of independent self-organizing maps within each layer The actual structure of the hierarchy is determined dynamically to resemble the structure of the input data as accurately as possible Starting from a rather small high-level SOM, which provides a coarse overview of e.g., the various topics present in a document collection, subsequent layers are added where necessary to display a ÿner subdivision of topics Each map in turn grows in size until it represents its topic to a su cient degree of granularity This allows the user to approach and intuitively browse a document collection in a way similar to conventional libraries The remainder of this paper is organized as follows Section provides a brief introduction into the principles of the SOM A review of related architectures is provided in Section This is followed by a detailed presentation of the architecture and training procedure of the new GHSOM model, analyzing the growth M Dittenbach et al / Neurocomputing 48 (2002) 199–216 201 process of each SOM as well as the dynamic creation of its hierarchical architecture in Section Section demonstrates the automatic hierarchical organization of a document archive comprising articles from Der Standard, a daily Austrian newspaper, followed by some conclusions in Section The self-organizing map The SOM, as proposed in [6] and described thoroughly in [7,8], is one of the most distinguished unsupervised artiÿcial neural network models It basically provides cluster analysis by producing a mapping of high-dimensional input data x; x ∈ Rn onto a usually two-dimensional output space while preserving the topological relationships between the input data items as faithfully as possible Being a decidedly stable and exible model, the SOM has been employed in a wide range of applications, ranging from ÿnancial data analysis, via medical data analysis, to time series prediction, industrial control, and many more [2,8,25] A fairly recent bibliography [5] lists more than 3000 papers published on SOM-related research and applications Basically, the SOM consists of a set of units i, which are arranged according to some topology, where the most common choice is a two-dimensional grid Each of the units i is assigned a weight vector mi of the same dimension as the input data, mi ∈ Rn In the initial setup of the model prior to training, the weight vectors might either be ÿlled with random values, or more sophisticated strategies, such as, for example, Principle Component Analysis, may be applied to initialize the weight vectors In the following expressions we make use of a discrete time notation, with t denoting the current training iteration During each learning step t, an input pattern x(t) is randomly selected from the set of input vectors and presented to the map The unit c with the highest activity level, i.e the winner c(t) with respect to the randomly selected input pattern x(t), is adapted in a way that it will exhibit an even higher activity level at future presentations of that speciÿc input pattern Commonly, the activity level of a unit is based on the Euclidean distance between the input pattern and that unit’s weight vector, i.e the unit showing the smallest Euclidean distance between its weight vector and the presented input vector is selected as the winner Hence, the selection of the winner c may be written as given in Expression (1) c(t) : x(t) − mc (t) = min{ x(t) − mi (t) }: i (1) Adaptation takes place at each learning iteration and is performed as a gradual reduction of the di erence between the respective components of the input vector and the weight vector The amount of adaptation is guided by a learning-rate that is gradually decreasing in the course of time This decreasing nature of adaptation strength ensures large adaptation steps in the beginning of the learning process where the weight vectors have to be tuned from their random initialization towards the actual requirements of the input space The ever smaller adaptation 202 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 steps towards the end of the learning process enable a ÿne-tuned input space representation As an extension to standard competitive learning, units in a time-varying and gradually decreasing neighborhood around the winner are adapted, too Pragmatically speaking, during the learning steps of the SOM a set of units around the winner is tuned towards the currently presented input pattern This enables a spatial arrangement of the input patterns such that alike inputs are mapped onto regions close to each other in the grid of output units Thus, the training process of the SOM results in a topological ordering of the input patterns According to [21] the SOM can be viewed as a neural network model performing a spatially smooth version of k-means clustering The neighborhood of units around the winner may be described implicitly by means of a neighborhood-kernel hci taking into account the distance—in terms of the output space—between unit i under consideration and unit c, the winner of the current learning iteration This neighborhood-kernel assigns scalars in the range of [0; 1] that are used to determine the amount of adaptation, ensuring that nearby units are adapted more strongly than units further away from the winner A Gaussian may be used to deÿne the neighborhood-kernel as given in Expression (2) where rc − ri denotes the distance between units c and i within the output space, i.e with ri representing the two-dimensional vector pointing to the location of unit i within the grid hci (t) = exp − rc − ri 2 · (t)2 : (2) It is common practice that in the beginning of the learning process the neighborhoodkernel is selected large enough to cover a wide area of the output space The spatial width of the neighborhood-kernel is reduced gradually during the learning process such that towards the end of the learning process just the winner itself is adapted Such a reduction is done by means of the time-varying parameter in Expression (2) This strategy enables the formation of large clusters in the beginning and ÿne-grained input discrimination towards the end of the learning process In combining these principles of SOM training, we may write the learning rule as given in Expression (3) Please note that we make use of a discrete time notation with t denoting the current learning iteration The other parts of this expression are representing the time-varying learning-rate, hci representing the time-varying neighborhood-kernel, x representing the currently presented input pattern, and ÿnally mi denoting the weight vector assigned to unit i mi (t + 1) = mi (t) + (t) · hci (t) · [x(t) − mi (t)]: (3) A simple graphical representation of a self-organizing map’s architecture and its learning process is provided in Fig In this ÿgure the output space consists of a square of 36 units, depicted as circles, forming a grid of × units One input vector x(t) is randomly chosen and mapped onto the grid of output units In the second step of the learning process, the winner c showing the highest activation is selected Consider the winner being the unit depicted as the black unit labeled M Dittenbach et al / Neurocomputing 48 (2002) 199–216 203 Fig SOM training process: adaptation of weight vectors in the ÿgure The weight vector of the winner, mc (t), is now moved towards the current input vector This movement is symbolized in the input space in Fig As a consequence of the adaptation, unit c will produce an even higher activation with respect to the input pattern x at the next learning iteration, t + 1, because the unit’s weight vector, mc (t + 1), is now nearer to the input pattern x in terms of the input space Apart from the winner, adaptation is performed with neighboring units, too Units that are subject to adaptation are depicted as shaded units in the ÿgure The shading of the various units corresponds to the amount of adaptation, and thus, to the spatial width of the neighborhood-kernel Generally, units in close vicinity of the winner are adapted more strongly, and consequently, they are depicted with a darker shade in the ÿgure Related architectures A number of extensions and modiÿcations have been proposed over the years in order to enhance the applicability of SOMs to data mining, speciÿcally cluster identiÿcation Some of the approaches, such as the U-matrix [27], or the adaptive coordinates and cluster connection techniques [16] focus on the detection and visualization of clusters in conventional SOMs Similar cluster information can also be obtained using our LabelSOM method [19], which automatically describes the characteristics of the various units Grouping units that have the same descriptive keywords assigned to them allows to identify topical clusters within the self-organizing map However, none of the methods identiÿed above facilitates the detection of hierarchical structure inherent in the data The hierarchical feature map [18] addresses this problem by modifying the SOM network architecture Instead of training a at SOM, a balanced hierarchical structure of SOMs is trained Similar to our GHSOM model, the data mapped onto one single unit is represented at some further level of detail in the lower-level map assigned to this unit However, this model somehow pretends to represent the data in a hierarchical way, rather than really re ecting the structure of the data This is due to the fact that the architecture of the network has to be deÿned in 204 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 advance, i.e the number of layers and the size of the maps at each layer is ÿxed prior to network training This leads to the deÿnition of a balanced tree which is used to represent the data What we want, however, is a network architecture deÿnition based on the actual data presented to the network This requires the SOM to actually use the data available to deÿne its architecture, the appropriate depth of the hierarchy, and the sizes of the maps at each level, none of which is present in the hierarchical feature map model Another hierarchical modiÿcation of the SOM is constituted by the TS-SOM [11,12] Yet, the focus of this model lies primarily with speeding up the training process by providing faster winner selection using a tree-based organization of units However, it does not focus on providing a hierarchical organization of data as all data are organized on one single at map The necessity of having to deÿne the size of the SOM in advance has been addressed in several models, such as the incremental grid growing [1] or growing grid [3] models The latter, similar to our GHSOM model, adds rows and columns during the training process, starting with an initial × SOM However, the main focus of this model lies with an equal distribution of input signals across the map, adding units in the neighborhood of units that represent an unproportionally high number of data points It thus does not primarily re ect the concept of representation at a certain level of detail, which is rather expressed in the overall quantization error rather then the number of data points mapped onto certain areas The incremental grid growing model, on the other hand, can add new units only on the borders of the map Neither of these models, however, takes the inherently hierarchical structure of data into account The growing hierarchical self-organizing map 4.1 The principles While the SOM has proven to be a very suitable tool for detecting structure in high-dimensional data and organizing it accordingly on a two-dimensional output space, some shortcomings have to be mentioned These include its inability to capture the inherent hierarchical structure of data Furthermore, the size of the map has to be determined in advance ignoring the characteristics of an (unknown) data distribution These drawbacks have been addressed separately in several modiÿed architectures of the SOM as outlined in Section However, none of these approaches provides an architecture which fully adapts itself to the characteristics of the input data To overcome the limitations of both ÿx-sized and non-hierarchically adaptive architectures we developed the GHSOM, which dynamically ÿts its multi-layered architecture according to the structure of the data The GHSOM has a hierarchical structure of multiple layers where each layer consists of several independent growing self-organizing maps Starting from a top-level map, each map, similar to the growing grid model [3], grows in size in order to represent a collection of data at a particular level of detail As soon as a cer- M Dittenbach et al / Neurocomputing 48 (2002) 199–216 205 Fig GHSOM architecture: the GHSOM evolves to a structure of SOMs re ecting the hierarchical structure of the input data tain improvement of the granularity of data representation is reached, the units are analyzed to see whether they represent the data at a speciÿc minimum level of granularity Those units that have too diverse input data mapped onto them are expanded to form a new small SOM at a subsequent layer, where the respective data shall be represented in more detail The growth process of these new maps continues again in a growing-grid like fashion Units representing an already rather homogeneous set of data, on the other hand, will not require any further expansion at subsequent layers The resulting GHSOM thus is fully adaptive to re ect, by its very architecture, the hierarchical structure inherent in the data, allocating more space for the representation of inhomogeneous areas in the input space A graphical representation of a GHSOM is given in Fig The map in layer consists of × units and provides a rather rough organization of the main clusters in the input data The three independent maps in the second layer o er a more detailed view on the data The input data for one map is the subset which has been mapped onto the corresponding unit in the upper layer One unit in the ÿrst layer map has not been expanded into a map in the second layer because the data representation quality was already accurate enough It has to be noted that the maps have di erent sizes according to the structure of the data, which relieves us from the burden of predeÿning the structure of the architecture The layer is necessary for the control of the growth process and will be explained later in Section 4.2 4.2 Training algorithm 4.2.1 Initial setup Prior to the training process a “map” in layer consisting of only one unit is created This unit’s weight vector m0 is initialized as the mean of all input vectors and its mean quantization error mqe0 is computed 206 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 Basically, the mean quantization error mqei of a unit i is the deviation between its weight vector and the input vectors mapped onto this very unit It is calculated as the mean Euclidean distance between its weight vector mi and the input vectors xj which are elements of the set of input vectors Ci that are mapped onto this unit i as given in Eq (4), with | · | denoting the cardinality of a set mqei = · mi − xj ; nCi = |Ci |: (4) nCi x ∈C j i Speciÿcally, the mean quantization error of the single unit at layer is computed as detailed in Expression (5), where nI is the number of all input vectors x of the input data set I m0 − xj ; nI = |I |: (5) mqe0 = · nI x ∈I j The value of mqe0 can be regarded as a measurement of the overall dissimilarity of input data This measure will play a critical role during the growth process of the neural network, as will be described later 4.2.2 Training and growth process of a map Beneath the layer map a new SOM is created with a size of initially × units This ÿrst layer map is trained according to the standard SOM training procedure described in Section After a ÿxed number of training iterations the mqe’s of all units are analyzed A high mqe shows that for this particular unit the input space is not represented accurately enough Therefore, new units are needed to increase the quality of input space representation The unit with the highest mqe is thus selected as the error unit e A new row or column of units is inserted in between this error unit and its most dissimilar neighbor The weight vectors of the new units are initialized as the average of their corresponding neighbors More formally, the growth process of a map can be described as follows Let Ci be the subset of vectors xj of the input data that is mapped onto unit i, i.e Ci ⊆ I ; and mi the weight vector of unit i Then, the error unit e is determined as the unit with the maximum mean quantization error: · mi − x j ; nCi = |Ci |: (6) e = arg max i nCi xj ∈Ci The selection of its most dissimilar neighbor d is determined by the maximum distance between the weight vector of unit e and the weight vectors of the neighboring units A complete row or column of units is inserted in between d and e Fig shows a graphical representation of the insertion process of our realization of a growing SOM, with the newly inserted units being depicted as shaded circles The arrows point to the respective neighboring units used for weight vector initialization The growth process continues until the map’s mean quantization error, referred to as MQE in capital letters, reaches a certain fraction of the mqeu of the M Dittenbach et al / Neurocomputing 48 (2002) 199–216 207 Fig Insertion of units: a row (a) or a column (b) of units (shaded gray) is inserted in between error unit e and that neighboring unit d with the largest distance between its weight vector and the weight vector of e in the Euclidean space corresponding unit in the upper layer, i.e the unit constituting the layer map for the ÿrst layer map The MQE of a map is computed as the mean of all units’ mean quantization errors mqei (cf Expression (4)) of the subset U of the maps’ units onto which data is mapped: · mqei ; nU = |U |: (7) MQEm = nU i∈U In general terms, the stopping criterion for the training of a single map m is deÿned as MQEm ¡ · mqeu ; (8) where mqeu is the mean quantization error of the corresponding unit u in the upper layer Obviously, the smaller the parameter is chosen, the longer the training will last, and the larger the resulting map will be In case of the ÿrst layer map the stopping criterion for the training process is MQE1 ¡ · mqe0 The parameter thus serves as the control parameter for the ÿnal size of each map by deÿning the degree to which each map has to represent the information mapped onto the unit it is based upon in greater detail 4.2.3 Re ection of the hierarchical structure When the training of the map is ÿnished, every unit has to be checked for expansion, i.e to be further reÿned in a map on the next layer This means that for units representing a set of too diverse input vectors a new map in the next layer will be created The threshold for this expansion decision is determined by a second parameter which deÿnes the data representation granularity requirement which has to be met by every unit It thus constitutes the global stopping criterion by deÿning a minimum quality of data representation required for all units as a fraction of the dissimilarity of all input data described by mqe0 If Expression (9) is false for unit i, i.e mqei is greater or equal than ·mqe0 , then a new small map in the next layer will be created, whereas if the stopping criterion given in Expression (9) holds true for a given unit, no further expansion is required Please note, that, unlike the stopping criterion for horizontal map growth determined 208 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 by and the mqe of the according upper layer unit, this second criterion is based solely on mqe0 , i.e the mqe of layer 0, for every unit on all maps mqei ¡ · mqe0 : (9) The input vectors to train the newly added map are the ones mapped onto the unit which has just been expanded This map will again continue to grow following the procedures detailed in the previous subsection The whole process is repeated for the subsequent layers until the criterion given in Expression (9) is met by all units in the lowest layers The parameter thus deÿnes the minimum quality of representation for all units in the lowest layer of each branch This guarantees, that the quality of data representation fulÿlls a minimum criterion for all parts of the input space, with the GHSOM automatically providing the required number of units in the respective areas 4.2.4 Some thoughts on map size, hierarchy depth, and granularity of data representation It should be noted, that the training process does not necessarily lead to a balanced hierarchy in terms of all branches having the same depth This is one of the main advantages of the GHSOM, because the structure of the hierarchy adapts itself according to the requirements of the input space Therefore, areas in the input space that require more neural computation units for appropriate data representation create deeper branches than others The growth process of the GHSOM is mainly guided by the two parameters and , which merit further considerations • : This parameter controls the minimum granularity of data representation, i.e no unit may represent data at a coarser granularity If the data mapped onto one single unit still has a larger variation (a) either a new row or column will be added to the same map during the horizontal growth process to make more units available for representing the data, or (b) a new map will be added originating from this unit, representing this unit’s data in more detail at a subsequent layer This absolute granularity of data representation will usually be speciÿed as a fraction of the inherent dissimilarity of the data collection as such, which is expressed in the mean quantization error of the single unit in layer representing all data points In principle, we could also choose an absolute value as a minimal quality criterion However, we feel that since with most datasets it is di cult to estimate information on its distribution and value ranges without thorough analysis, a percentual, data-driven threshold is more convenient Furthermore, if we decided after the termination of the training process, that a yet more detailed representation would be desirable, it is basically possible to resume the training process from the respective lower level maps, continuing to both grow them horizontally as well as to add new lower level maps until a stricter quality criterion is satisÿed This parameter thus represents a global termination and quality criterion for the GHSOM M Dittenbach et al / Neurocomputing 48 (2002) 199–216 • 209 : This parameter controls the actual growth process of the GHSOM Basically, hierarchical data can be represented in di erent ways, favoring either (a) lower hierarchies with rather detailed reÿnements presented at each subsequent layer, or (b) deeper hierarchies, which provide a stricter separation of the various sub-clusters by assigning separate maps In the ÿrst case we will prefer larger maps in each layer, which explain larger portions of the data in their at representation, allowing less hierarchical structuring As an extreme example we might consider a single SOM trying to explain the complete structure of the data in one single at map, ignoring all hierarchical information or rather trying to preserve it in the mapping of various clusters on the at structure In the second case, however, we will usually prefer rather small maps, each of which describes only a small portion of the characteristics of the data, and rather emphasize the detection and representation of hierarchical structure Basically, the total number of units at the lowest level maps may be expected to be similar in both cases as this is the number of neural processing units necessary for representing the data at the required level of granularity Thus, the smaller the parameter , the larger the degree to which the data has to be explained at one single map will be This results in larger maps as the map’s mean quantization error (MQE) will be lower the more units are available for representing the data If is set to a rather high value, the MQE does not need to fall too far below the mqe of the upper layer’s unit it is based upon Thus a smaller map will satisfy the stopping criterion for the horizontal growth process, requiring the more detailed representation of the data to be performed in subsequent layers In a nutshell we can say, that, the smaller the parameter value , the atter the hierarchy, and that the lower the setting of parameter , the larger the number of units in the resulting GHSOM network will be A hierarchical newspaper archive 5.1 Document representation In the experiments presented hereafter we use an article collection of the daily Austrian newspaper Der Standard covering the second quarter of 1999 In a ÿrst step, the documents have to be mapped onto some representation language in order to enable further analysis This process is termed indexing in the information retrieval literature A number of di erent strategies have been suggested over the years of information retrieval research Still one of the most common representation techniques is single term full-text indexing, where the text of the documents is accessed and the various words forming the document are extracted These words may be reduced to their (often just approximate) word stem yielding the so-called terms used to represent the documents The resulting set of terms is 210 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 usually cleared from so-called stop-words, i.e words that appear either too often or too rarely within the document collection and thus have only little in uence on discriminating between di erent documents These would just unnecessarily increase the computational load during classiÿcation In the vector-space model of information retrieval the documents contained in a collection are represented by means of feature vectors x of the form x = [ ; ; : : : ; n ]T In such a representation, the l, l n, correspond to the index terms extracted from the documents as described above The speciÿc value of l corresponds to the importance of index term l in describing the particular document at hand One might ÿnd a lot of strategies to prescribe the importance of an index term for a particular document [24] Without loss of generality, we may assume that this importance is represented as a scalar in the range of [0; 1] where zero means that this particular index term is absolutely unimportant to describe the document Any deviation from zero towards one is proportional to the increased importance of the index term at hand In such a vector-space model, the similarity between two documents corresponds to the distance between their vector representations [26] The indexing process for the articles from the Der Standard collection identiÿed 1104 content terms, i.e terms used for document representation, by omitting words that appear in more than 60% or less than 2% of the documents The terms are roughly stemmed and weighted according to a tf × idf weighting scheme [23], i.e term frequency times inverse document frequency Such a weighting scheme favors terms that appear frequently within a document yet rarely within the document collection The 11,627 vectors representing the documents are further used for neural network training 5.2 A guided tour through the archive Although no general objective evaluation of clustering results is possible, numerous approaches to cluster validity analysis exist [4] Basically, we can distinguish between external assessment, internal examination, and relative tests comparing two clustering results For the experiments presented below, we start with a general overview of the hierarchical cluster structure Since it is impossible to present the complete topic hierarchy of three months of news articles, we will further concentrate on some sample topical sections, providing an internal examination of cluster validity The top-level map of the GHSOM, which evolved to the size of × units during training by adding two rows and one column, is depicted in Fig This map represents the organization of the main subject areas such as the war on the Balkan on unit (1=1), Austrian politics (2=1), other national a airs (2=2), economy (3=1) and the European Union on unit (3=2) Short headline articles have been mapped onto unit (1=3) Further, we can ÿnd computers and Internet on unit We refer to a unit located in column x and row y as unit (x=y), starting with (1=1) in the upper left corner M Dittenbach et al / Neurocomputing 48 (2002) 199–216 211 Fig Top-layer map: × units; organization of main subject areas Fig Second and third-layer maps: The map on the right-hand side (Fig 5(b)) corresponds to unit (2=2) of the second-layer map (Fig 5(a)) The articles are mapped onto this third-layer map (2=3), and articles on crime and police on unit (3=3) Cultural topics are located on the three units in the lower left corner, representing sports, theater and fashion, respectively Personal comments are located on unit (3=4) in the bottom right corner of the map If we want to take a closer look at the Balkan War subject, we discover a more detailed map of size × in the next layer (see Fig 5(a)), where every unit represents a speciÿc sub-topic of this subject The units in the top row deal 212 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 Fig Second and third-layer maps: The map on the right-hand side (Fig 6(b)) corresponds to unit (2=2) of the second-layer map (Fig 6(a)) with the general political and social situation in Kosovo, while the main focus of the documents mapped onto the three units (3=2), (2=3), and (3=3) in the lower right corner is Slobodan Milosevic On unit (1=3) articles about the situation of the refugees can be found, whereas unit (2=2) concentrates on the Russian involvement in the Kosovo War If this Russian involvement is our major interest, again, a more detailed map (see Fig 5(b)) in the third layer exhibits the most granular view on this sub-topic The dominant topic of this map is the Russian participation in the Kosovo War The three units in the upper left corner cover G-8 foreign minister meetings, where the documents on unit (1=1) are especially about the establishment of a safe corridor in the Kosovo Articles located on units (1=2) and (2=1) discuss other aspects of these meetings On unit (2=2) and (1=3) we can ÿnd articles about reactions of the German foreign minister Fischer and Austrian chancellor Klima, respectively European comments have been mapped onto unit (2=3) and letters to the editor onto unit (3=3) The European Union topic mapped onto unit (3=2) in the ÿrst-layer map has been expanded to a second-level map depicted in Fig 6(a) It has been organized into sub-topics like EU–USA connections on unit (1=1), the European Commission on units (2=1) and (2=2), or a large cluster of articles about the relationship between Austria and the EU in the middle of the map Unit (2=4) concentrates on economic aspects of the Austrian membership in the European Union, and unit (2=5) represents documents covering Italian a airs The most granular view with regard to the setting of as described in Section 4.2 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 213 Fig 6(b) exhibits a view on the ordering of documents mapped onto unit (2=2) in the corresponding second-layer map described before In the upper half we ÿnd articles about a scandal in Belgium, where chicken meat was contaminated by large amounts of dioxin, and the subsequent discussions about a ban on meat exports In the bottom row of the map documents concentrating on European Union subsidies and discussions about the distribution of these fundings between di erent Austrian regions are located In between, on unit (2=3), reports on genetically modiÿed food and arguments about sale stops of Belgian goods in Austrian supermarkets can be found We thus ÿnd the layered architecture of the GHSOM to faithfully re ect the hierarchy of topics found in the document collection Articles on one common subject are to be found within the same branch of maps Each map in turn provides a more detailed representation of the topics subsumed by the unit it is based upon At the lowest layer of each branch of the GHSOM we ÿnd the actual articles, again organized according to their similarity The granularity by which topics are identiÿed and articles thus clustered at the lowest layer is determined by parameter as a fraction of the overall dissimilarity of the complete data set If we should ÿnd the representation provided by a GHSOM to be too coarse, we could resume training of the GHSOM architecture by setting a lower value for Starting with each of the units residing in a lowest layer maps, by identifying those units that require expansion, we can add additional layers to provide further, more detailed representations of the articles With the resulting maps at all layers of the hierarchy being rather small, orientation for the user is highly improved as compared to rather huge maps, which cannot be easily comprehended as a whole By topically structuring a collection a user is guided through the various topical sections allowing him or her to build a mental model of a document archive, facilitating convenient exploration of unknown collections Yet the user is relieved from having to deÿne the actual architecture of maps in advance Rather, only an indication of the desired ÿnal granularity of data representation is required, with another parameter controlling the preference towards more structured, deeper hierarchies consisting of smaller maps as opposed to atter hierarchies with more details to be conveyed by rather larger maps, respectively Conclusions We have presented the GHSOM, a neural network based on the SOM, a model that has proven to be e ective for cluster analysis of very high-dimensional feature spaces Its main beneÿts are due to the model’s capabilities to (a) determine the number of neural processing units required in order to represent the data at a desired level of detail and (b) to create a network architecture re ecting the hierarchical structure of the data The resulting beneÿts are numerous: ÿrst, the processing time is largely reduced by training only the necessary number of units for a certain degree of detail representation Second, the GHSOM by its very 214 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 architecture resembles the hierarchical structure of data, allowing the user to understand and analyze large amounts of data in an explorative way Third, with the various emergent maps at each level in the hierarchy being rather small, it is easier for the user to keep an overview of the various clusters identiÿed in the data and to build a cognitive model of it in a very high-dimensional feature space We have demonstrated the capabilities of this approach by an application from the information retrieval domain, where text documents, which are located in a high-dimensional feature space spanned by the words in the documents, are clustered by their mutual similarity and where the hierarchical structure of these documents is re ected in the resulting network architecture References [1] J Blackmore, R Miikkulainen, Incremental grid growing: encoding high-dimensional structure into a two-dimensional feature map, in: Proceedings of the IEEE International Conference on Neural Networks (ICNN’93), Vol 1, San Francisco, CA, USA, 1993, pp 450 – 455 [2] G DeBoeck, T Kohonen (Eds.), Visual Explorations in Finance, Springer, Berlin, Germany, 1998 [3] B Fritzke, Growing Grid—a self-organizing network with constant neighborhood range and adaption strength, Neural Process Lett (5) (1995) 1–5 [4] A.K Jain, M.N Murty, P.J Flynn, Data clustering: a review, ACM Comput Surv 31 (3) (September 1999) 264–323 [5] S Kaski, J Kangas, T Kohonen, Bibliography of self-organizing map (SOM) papers 1981–1997, Neural Comput Surv (3&4) (1998) 1–176 [6] T Kohonen, Self-organized formation of topologically correct feature maps, Biol Cybernet 43 (1982) 59–69 [7] T Kohonen, Self-Organization and Associative Memory, 3rd Edition, Springer, Berlin, Germany, 1989 [8] T Kohonen, Self-Organizing Maps, Springer, Berlin, 1995 [9] T Kohonen, Self-organization of very large document collections: state of the art, in: Proceedings of the International Conference on Artiÿcial Neural Networks, Skovde, Sweden, 1998, pp 65 –74 [10] T Kohonen, S Kaski, K Lagus, J Salojarvi, J Honkela, V Paatero, A Saarela, Self-organization of a massive document collection, IEEE Trans Neural Networks 11 (3) (2000) 574–585 [11] P Koikkalainen, Fast deterministic self-organizing maps, in: Proceedings of the International Conference on Neural Networks, Vol 2, Paris, France, 1995, pp 63– 68 [12] P Koikkalainen, E Oja, Self-organizing hierarchical feature maps, in: Proceedings of the International Joint Conference on Neural Networks, Vol 2, San Diego, CA, 1990, pp 279 –284 [13] X Lin, A self-organizing semantic map for information retrieval, in: Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR91), ACM, Chicago, IL, October 13–16, 1991, pp 262–269 [14] D Merkl, Text classiÿcation with self-organizing maps: some lessons learned, Neurocomputing 21 (1–3) (1998) 61–77 [15] D Merkl, Text data mining, in: Handbook of Natural Language Processing: Techniques and Applications for the Processing of Language as Text, Marcel Dekker, New York, 2000, pp 889 –903 [16] D Merkl, A Rauber, Alternative ways for cluster visualization in self-organizing maps, in: T Kohonen (Ed.), Proceedings of the Workshop on Self-Organizing Maps (WSOM97), Helsinki University of Technology, HUT, Espoo, Finland, June – 6, 1997, pp 106–111 [17] D Merkl, A Rauber, Uncovering associations between documents, in: R Feldman (Ed.), Proceedings of the International Joint Conference on Artiÿcial Intelligence (IJCAI99) Workshop on Text Mining, Stockholm, Sweden, July 31–August 6, 1999, pp 89–98 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 215 [18] R Miikkulainen, Script recognition with hierarchical feature maps, Connection Sci (1990) 83–101 [19] A Rauber, D Merkl, Automatic labeling of self-organizing maps: making a treasure map reveal its secrets, in: N Zhong, L Zhou (Eds.), Proceedings of the Third Paciÿc-Asia Conference on Knowledge Discovery and Data Mining (PAKDD99), number LNCS=LNAI 1574 in Lecture Notes in Artiÿcial Intelligence, Springer, Beijing, China, April 26 –29, 1999, pp 228–237 [20] A Rauber, D Merkl, The SOMLib digital library system, in: S Abiteboul, A.M Vercoustre (Eds.), Proceedings of the Third European Conference on Research and Advanced Technology for Digital Libraries (ECDL99), number LNCS 1696 in Lecture Notes in Computer Science, Springer, Paris, France, September 22–24, 1999, pp 323–342 [21] B.D Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, UK, 1996 [22] D Roussinov, M Ramsey, Information forage through adaptive visualization, in: Proceedings of the ACM Conference on Digital Libraries 98 (DL98), Pittsburgh, PA, USA, 1998, pp 303–304 [23] G Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, MA, 1989 [24] G Salton, C Buckley, Term weighting approaches in automatic text retrieval, Inform Process Manage 24 (5) (1988) 513–523 [25] O Simula, P Vasara, J Vesanto, R.R Helminen, The self-organizing map in industry analysis, in: L.C Jain, V.R Vemuri (Eds.), Industrial Applications of Neural Networks, CRC press, Washington, DC, 1999 [26] H.R Turtle, W.B Croft, A comparison of text retrieval models, Comput J 35 (3) (1992) 279–290 [27] A Ultsch, Self-organizing neural networks for visualization and classiÿcation, in: O Opitz, B Lausen, R Klar (Eds.), Information and Classiÿcation Concepts, Methods and Application, Studies in Classiÿcation, Data Analysis, and Knowledge Organization, Springer, Dortmund, Germany, April 1–3, 1992, pp 307–313 Michael Dittenbach received his diploma in computer science from Vienna University of Technology He is Junior Researcher at the Department of Software Technology at the Vienna University of Technology He is also a Research Assistant in the Adaptive Multilingual Interfaces group at the Electronic Commerce Competence Center in Vienna He has published several papers in international conferences His current research interests include text mining, neural computation and natural language interfaces Andreas Rauber received his MSc and PhD in Computer Science at the Vienna University of Technology in 1997 and 2000, respectively From 1997 to 2001 he was a member of the Academic Faculty at the Department of Software Technology at the Vienna University of Technology He is currently an ERCIM Research Fellow at the Italian National Research Council (CNR) in Pisa He received the OeGAI award of the Austrian Society for Artiÿcial Intelligence in 1998 He has published over 30 papers in refereed journals and international conferences His current research interests include digital libraries, neural computation, and information visualization 216 M Dittenbach et al / Neurocomputing 48 (2002) 199–216 Dieter Merkl received his diploma and doctoral degree in social and economic sciences from University of Vienna, Austria, in 1989 and 1995, respectively During 1997 he was a post-graduate research fellow at the Department of Computer Science, Royal Melbourne Institute of Technology, Australia He holds a position of an Associate Professor at the Institute of Software Technology, Vienna University of Technology, where he serves as vice-chairman Apart from that, he is involved in the Softworld project, a joint programme of the European Community and Canada for co-operation in higher education and training Dieter Merkl currently considers digital libraries, e-learning, and neurocomputing as his major research interests, with software engineering being in queue ... provides a rather rough organization of the main clusters in the input data The three independent maps in the second layer o er a more detailed view on the data The input data for one map is the subset... consider a single SOM trying to explain the complete structure of the data in one single at map, ignoring all hierarchical information or rather trying to preserve it in the mapping of various... Obviously, the smaller the parameter is chosen, the longer the training will last, and the larger the resulting map will be In case of the ÿrst layer map the stopping criterion for the training process