4List of Tables ...5List of Figures...6Chapter 1 INTRODUCTION...7Chapter 2 MODERN PROGRESS IN TOPIC MODELING ...112.1 Linear algebra based models...122.2 Statistical topic models ...132.
INTRODUCTION
Information Retrieval (IR) is a dynamic field with a rich history, evolving alongside the growth of vast corpora, including extensive web pages and scientific papers This evolution raises challenging questions that captivate researchers, notably the ongoing dilemma of automating document indexing within various databases Additionally, a critical inquiry focuses on how to effectively retrieve the most relevant documents in a semantic context based on user queries from the Internet or specific corpora.
In Information Retrieval (IR), finding and ranking documents are crucial tasks, typically supported by tools like Google and Yahoo However, most of these tools rely on word matching rather than semantic matching, which complicates the process due to the intricacies of semantics Despite these challenges, the potential applications of semantic finding and ranking are significant, paving the way for future web service technologies such as semantic searching, semantic advertising, academic recommendations, and intelligent control systems.
Semantics plays a vital role in both the Information Retrieval (IR) and Artificial Intelligence (AI) communities, particularly in knowledge representation Effectively capturing and representing natural knowledge from our environment is essential for the seamless reuse and integration of new information A robust knowledge database relies heavily on semantics, as each word carries its own meanings and has specific semantic relationships with other words Given that words can have multiple senses and serve different functions in various contexts, accurately representing knowledge becomes a complex and ongoing challenge.
Knowledge representation plays a crucial role in artificial intelligence, particularly in robotics For instance, consider the development of an intelligent robot designed to classify various types of rubbish To achieve this, it is essential to efficiently represent the information that describes different kinds of waste, enabling the robot to quickly identify and categorize items as reusable or not Additionally, the robot must distinguish between rubbish and other objects in close proximity Given the vast amount of information required for accurate classification, proper organization is vital; otherwise, the robot's performance may suffer, hindering its ability to learn from its environment This example underscores the significance of knowledge representation in enhancing the capabilities of intelligent systems.
Various methods for knowledge representation have been explored, including high-dimensional semantic spaces where words are depicted as vectors Probabilistic topic models also play a role by illustrating the latent structure of words through topics Additionally, semantic networks represent knowledge by organizing words into nodes and connecting related pairs with edges For further insights into these and other techniques, refer to the surveys in the literature.
Figure 1.1 Some approaches to representing knowledge
(a) Semantic network, (b) Semantic space, (c) Topic models
Automatically discovering and interpreting information from conversations or documents poses significant challenges in AI These tasks are essential for effectively finding and ranking gathered information Consequently, extensive research has been conducted to develop efficient methods or adapt existing ones for specific applications For instance, a quick search using Google reveals that the work of Deerwester et al has garnered substantial attention, highlighting the importance of these research efforts.
4200 citations, the work of Blei et al in [11] receives more than 1200 citations, and the work of Landauer and Dumais in [39] receives more than 1800 citations.
Topic modeling is a powerful approach for uncovering latent structures within documents or collections of documents, enabling clear interpretation It significantly contributes to information retrieval (IR) by providing various methods to extract the essence of a document, conversation, or document collection Notable topic models include Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA), each demonstrating a wide range of applications in the field.
[11], Hierarchical Latent Dirichlet Allocation (hLDA) [8], CorrLDA2 [53] Due to the ability to uncover latent structures (e.g topics), topic models have been successfully applied to automatically index the documents of a given corpus [20],
[11], [72], to find topical communities from collections of scientific papers [45], to support spam filtering task [7], to reveal the development of Science over years
In the research community, various applications utilize text corpora to uncover trending topics and categorize function and content words These methodologies enable the identification of distinct groups and their roles, while also providing statistical insights into the inference processes of human memory For additional compelling applications, further references can be consulted.
From many amazing and potential applications of Topic Modeling, this thesis
This article examines the modern advancements in topic modeling, acknowledging the rapid growth of research in this area Rather than attempting to cover every aspect, it focuses on the most compelling features and key directions for the development of new topic models The article evaluates the advantages and disadvantages of each model discussed, and explores potential extensions for some Additionally, it highlights significant applications in artificial intelligence and shares the author's experiments with a collection of papers from the NIPS conferences up to volume 12, as well as reports from VnExpress, a Vietnamese electronic newspaper.
Chapter 2 provides a comprehensive overview of recent advancements in Topic Modeling, offering both a broad perspective and specific insights into the field The following chapters delve deeper into representative topic models, highlighting their advantages, disadvantages, and potential extensions Chapter 5 discusses intriguing applications of topic modeling and includes reports on the author's experiments conducted on various corpora.
2 Advances in Neural Information Processing Systems (NIPS): http://books.nips.cc/
MODERN PROGRESS IN TOPIC MODELING
Linear algebra based models
Latent Semantic Analysis (LSA) is a pioneering technique for topic extraction from a text corpus This method operates on the premise that each word can be represented as a point in a high-dimensional semantic space To capture the essence of a document, LSA employs the bag-of-words assumption, which posits that the order of words can be disregarded.
The LSA model utilizes the bag-of-words assumption to project words into a low-dimensional semantic space through Singular Value Decomposition (SVD) To identify the essence of a document, it first maps the document as a point in this space and then gathers nearby points using a similarity measure, such as cosine similarity or inner product The words linked to these nearby points define the document's topic Further mathematical details will be discussed in Chapter 3.
Latent Semantic Analysis (LSA) is a dimensionality reduction technique that effectively manages the complexity of large document corpora by utilizing a semantic space with a dimension significantly smaller than the total number of words or documents Unlike other methods, LSA employs a linear approach, projecting words and documents into this semantic space through linear transformations This characteristic distinguishes LSA in the realm of text analysis and processing.
Latent Semantic Analysis (LSA) is a powerful method that, despite its simplicity, has proven effective in various applications Research by Landauer and Dumais highlights that LSA's ability to judge similarity is comparable to that of foreign students, making it a valuable tool for knowledge representation and developing new induction methods León et al demonstrated LSA's capability in accurately grading concise text summaries, even those as short as 50 words Additionally, LSA has been utilized for deriving predication methods, assessing word coherence, and creating reliable document indexing techniques For further insights, refer to the surveys in sources [5], [40], and [46].
In addition to Latent Semantic Analysis (LSA) utilizing Singular Value Decomposition (SVD), various other techniques exist for topic extraction from a corpus Notably, the method introduced by Michael et al employs QR factorization to derive representations of words and documents within semantic spaces Additional variants of LSA are also discussed in the literature.
Statistical topic models
A promising approach to addressing the central challenge in Topic Modeling involves leveraging statistical tools to create and evaluate topic models While linear algebra-based methods have been successful, they often lack a robust theoretical framework to justify their effectiveness This gap has driven researchers to seek topic models that not only demonstrate strong performance but also rest on a solid theoretical foundation.
Hofmann's groundbreaking study introduced Probabilistic Latent Semantic Analysis (pLSA), also known as the Aspect Model, marking a significant advancement in Topic Modeling pLSA is a generative model that interprets the words in a corpus as being generated from a probability distribution, which enables it to surpass traditional methods like Latent Semantic Analysis (LSA) in various real-world applications.
Since Hofmann's findings, numerous topic models have been introduced, as shown in Table 2.1 Most of these models are fully generative and probabilistic, focusing on documents While some models maintain a fixed number of topics within a document or corpus, others allow for variability Additionally, certain models incorporate the bag-of-words assumption, while others do not Each model presents unique assumptions and intriguing characteristics The current advancements in Topic Modeling are illustrated in Figures 2.2, 2.3, and 2.4.
2.2.1 Bag-of-words versus non-Bag-of-words assumption
In various practical applications like document indexing and clustering, the sequence of words within a document or the arrangement of documents in a collection is often insignificant Consequently, this order can be disregarded when analyzing documents or corpora Numerous topic modeling techniques, such as Latent Semantic Analysis (LSA), probabilistic Latent Semantic Analysis (pLSA), Latent Dirichlet Allocation (LDA), and the Correlated Topic Model (CTM), operate under this principle.
[10], Dirichlet Enhanced LSA (DELSA) [74], and Discriminative LDA (DiscLSA)
In these models, input documents are characterized by a unique vocabulary of words, accompanied by a frequency matrix that indicates the occurrence of each word within the documents, as well as the total number of documents analyzed.
In Natural Language Processing, the order of words is crucial for accurately predicting meanings within context Previous words in a sentence significantly influence the interpretation of subsequent words, particularly in structured formats like scientific papers where earlier sentences shape the logic of later ones Therefore, it is essential to consider grammatical structures when developing topic models for specific NLP applications Notable topic models that address these considerations include the Syntactic Topic Model (STM), Bigram Topic Model (BTM), and Hidden Topic Markov Model (HTMM).
Table 2.1 Some selected Probabilistic topic models.
(The first column lists the abbreviations of models, the second writes the fullnames, and the last shows where a model can be found.)
Abbreviation Name of Model Citation
ART Author-Recipient-Topic Model [44]
BTM Bigram Topic Model [66] cDTM Continuous Dynamic Topic Model [67]
CTM Correlated Topic Model [10] dDTM Discrete Dynamic Topic Model [13]
HDP-RE Hierarchical Dirichlet Processes with random effects [34] hLDA Hierarchical Latent Dirichlet Allocation [8]
HMM-LDA Hidden Markov Model LDA [15]
HTMM Hidden Topic Markov Model [28]
IG-LDA Incremental Gibbs LDA [16]
MBTM Memory Bounded Topic Model [24]
NetSTM Network Regularized Statistical Topic Model [45]
PF-LDA Particle Filter LDA [16]
PLSV Probabilistic Latent Semantic Visualization [32] sLDA Supervised Latent Dirichlet Allocation [14]
Spatial LDA Spatial Latent Dirichlet Allocation [69]
Figure 2.2 illustrates the dual pathways for developing probabilistic topic models within the framework of the bag-of-words assumption It is important to highlight that, to our knowledge, the prevalence of models utilizing this assumption significantly exceeds that of other approaches, with Figure 2.2 showcasing only a selection of representative examples.
Figure 2.2 Probabilistic topic models in view of the bag-of-words assumption.
The evolution of science has a rich history, with research topics continuously emerging within the academic community New subjects often develop to address current events, demands, or challenges in the real world Consequently, modeling the progression of these topics over time within a collection of documents proves to be both useful and intriguing for various applications.
Many topic models have been introduced for this task, for instance, Discrete Dynamic Topic Model (dDTM) [13], Continuous Dynamic Topic Model (cDTM)
Hierarchical Latent Dirichlet Allocation (hLDA) and similar models uniquely treat the number of topics as an unknown variable, enabling the online learning of new topics during data processing These advanced models utilize complex random processes, such as the Dirichlet process and nested Chinese restaurant process, to enhance their functionality.
Viewing the generative models in the line of whether or not the number of topics is known a priori, the evolution of the generative models can be depicted in Figure 2.3.
Figure 2.3 Viewing generative models in terms of Topics.
(The models grouped in “Static Topics” assume the number of topics is known a priori, the others in “Dynamic Topics” group learn new topics online from data.)
2.2.3 Parametric versus non-Parametric models
Parametric techniques are favored by researchers for addressing statistical challenges in Machine Learning These models typically operate under the assumption that a dataset consists of a fixed set of parameters, which are estimated using specific inference algorithms Conversely, some models allow for the number of parameters to increase as the dataset expands, enabling the discovery of new parameters in real-time.
In Topic Modeling, various models, including pLSA, LDA, sLDA, Spatial LDA, and the Author-Topic Model, operate under the assumption that the parameters are predetermined These models conceptualize a document as a sample from a probability distribution that represents a mixture of topics, where each topic is defined by its distribution over words Consequently, the number of topics and their generating distributions are established in advance, with the primary focus being on estimating the coefficients within the mixture, aside from a few hyperparameters.
Non-parametric topic models treat both the number of parameters and the parameters themselves as unknown, enabling them to learn from the corpus and adapt as it expands However, this flexibility often results in increased complexity compared to parametric models.
Figure 2.4 illustrates a parametric view on the generative models The detailed descriptions of some models appear in Chapter 4.
Figure 2.4 A parametric view on generative models.
Discussion and notes
The current perspectives on Topic Modeling appear rigid, as they categorize models into distinct types without adequately illustrating the relationships among them For example, Latent Semantic Analysis (LSA) exemplifies this complexity, as it is rooted in linear algebra while also adhering to the bag-of-words assumption and being defined by the number of topics Consequently, these viewpoints on topic models seem to lack completeness.
Topic modeling can be classified based on various characteristics, offering diverse perspectives on its application By examining these distinct attributes, one can gain a deeper understanding of the different types of topic models available.
In the research community, scientific papers often explore either established topics or new concepts derived from existing ones, leading to correlations among various subjects This interconnectedness is evident in diverse data sets, including time-lapse collections of images from cameras and compilations of political news, which may share thematic relationships.
As a result, modeling documents with an eye on the correlations among latent topics is important Some models concerning this fact include CTM, HTMM.
Short-range dependency, also known as local dependency, refers to how certain words in a sentence can significantly influence the meaning of others, as illustrated by the word "away" altering "run" in the phrase "He is running away" compared to "He is running." Models that consider this relationship include STM, BTM, HMM-LDA, CBTM, and HTMM In contrast, long-range dependency models treat a document as a bag-of-words, disregarding the influence of word order and proximity on meaning.
When managing extensive collections of documents, such as years of camera images or personal web pages, the limitations of available memory become a significant challenge It is often impractical to load all documents into memory simultaneously due to size constraints Therefore, the capability to process data in manageable blocks is essential Certain topic models, including MBTM, PF-LDA, and IG-LDA, are designed to efficiently handle such scenarios, making them advantageous for working with large datasets.
Recent advancements in topic modeling have highlighted the distinction between parallel (distributed) and sequential learning methods Traditional models often overlook the advantages of parallel computation, making them less effective for processing data across multiple workstations However, innovative approaches have emerged that are specifically designed for parallel and distributed environments, enhancing their efficiency and applicability in data-intensive tasks For further details, refer to sources [3], [16], and [51].
Supervised and unsupervised learning in topic modeling reveals that most existing models operate in an unsupervised manner, extracting topics from documents without prior knowledge While these models have proven effective in various applications, there is potential for developing a topic model that incorporates prior knowledge to enhance efficiency Exploring such a model could significantly improve the intelligence of text processing services.
To the best we are aware of, few topic models work as a supervised mechanism See for example [14], [65], [17], and [76].
Topic hierarchy models, such as hLDA, are effective in uncovering the structured relationships between topics within a corpus, making them valuable for document classification and clustering tasks These models visualize documents as distributed across a hierarchical framework of topics, enabling the classification of documents into various categories at different levels However, a key challenge is that all topics in the hierarchy must be processed simultaneously to identify the complete structure Therefore, it is essential to explore models that can identify all relevant nodes up to a specified level, rather than attempting to uncover the entire hierarchy For further insights, refer to existing topic models in the literature.
LINEAR ALGEBRA BASED TOPIC MODELS
An overview
In this chapter, we will explore how terms, documents, and queries can be represented as points within a semantic space Additionally, we will examine the process of extracting topics from this semantic space for a document or corpus To effectively understand these concepts, it is essential to familiarize ourselves with key terminologies.
A word (or term) is the basic unit of discrete data defined to be an item from a vocabulary indexed by {1, , }V
A document is a sequence of N words, denoted by d ( ,w w1 2, ,wN), where wi is the ith word in the sequence.
A corpus is a collection of M documents, denoted by { , , ,d d1 2 dM} The j th word in the document di is denoted by di j ,
A term-by-document matrix A of the corpus is a V M matrix, where the ( , )i j entry is the number of times word i occurs in the j th document.
In a document-term matrix, the jth column represents the document vector for the jth document, while the ith row indicates the frequency of word i across the entire corpus Specifically, the entry at (i, j) reflects the count of how many times word i appears in the jth document.
The bag-of-words assumption states that the sequence of words within a document and the arrangement of documents in a corpus are disregarded This means that the order of words does not influence the representation of the document that contains them.
By applying the bag-of-words assumption, we aim to represent documents through latent concepts associated with the terms rather than the terms themselves This underlying structure is not static; it varies significantly based on the corpus and the relationships between the terms.
Latent Semantic Analysis (LSA) is a pioneering technique that effectively represents documents and terms within a semantic space, minimizing information loss By utilizing Singular Value Decomposition (SVD), LSA projects documents and terms into a lower-dimensional space, allowing for the identification of latent concepts within the documents The proximity of terms to a document in this space can be quantified using the cosine of the angle between vectors or their inner product.
The similar idea is used in some other methods We shall see this in the subsequent sections.
Latent Semantic Analysis
Latent Semantic Analysis (LSA) utilizes a V×M matrix A to represent a corpus, where V denotes a vocabulary of unique words In this context, the jth document is denoted as dj, and the ith word within this document is represented as dj i LSA aims to uncover new representations for both documents and words through its analytical processes.
3.2.1 New representations of documents and words:
Find the unique representation AT 0 S 0 D 0 t by a singular value decomposition technique, where T0 and D0 are orthonormal matrixes, S0 is the diagonal matrix composed of the singular values of A, D 0 t is the transpose matrix of D 0
Choose an integer k and construct S by removing all but the rows and columns of S0 corresponding to the k largest singular values.
Remove the corresponding columns of T and 0 D to obtain T and D such 0 that the product Ak T S D t exists So we have the approximation
The new representations of words and documents are as follows: the columns of Ak represent the documents, the rows of Ak represent the words.
Figure 3.1 A corpus consisting of 8 documents.
( is the term-by-document matrix of the corpus.) A
Finding new representations of items is relatively straightforward; however, there are key considerations to keep in mind Firstly, when the ith singular value is removed, it is essential to also eliminate the corresponding row and column from S0, as well as the ith columns of both T0 and D0 Additionally, the new representations pertain to various topics, including tight approximate inference of the logistic-normal topic admixture model, asynchronous distributed learning of topic models, pattern recognition and machine learning, hierarchical topic models with the nested Chinese restaurant process, variational inference for Dirichlet process mixtures, online inference of topics with latent Dirichlet allocation, a Bayesian hierarchical model for learning natural scene categories, and unsupervised learning through probabilistic latent semantic analysis.
Dirichlet inference hierarchical latent learning model models process topic topics
Documents and words are represented as vectors, typically with varying dimensions, making direct comparisons challenging We will explore solutions to this problem through an illustrative example.
In this example, we analyze a corpus of eight documents, each representing a paper title referenced in this thesis The corpus is refined by retaining only words that appear in at least two documents, while excluding the term "topics" to facilitate a comparative analysis with "topic." Additionally, common stop words, such as "the," are eliminated As a result, the final vocabulary is condensed to ten words, leading to a term-by-document matrix A of dimensions 10 by 8, as illustrated in Figure 3.1.
Using SVD, we can find the representation A T S 0 0 D 0 t , where
4 http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
Choose k 4, and take 4 rows and 4 columns of S0 associated with the 4 largest singular values of A to form the matrix
Remove the corresponding columns of T 0 and D 0 to obtain
Finally, the new representation of the corpus is
Column 1 of A 4 represents document d1, column 2 represents d2, and so on
In the matrix A k, the rows and columns represent terms and documents, respectively, but they often differ in dimensionality To effectively assess the similarity between various items—whether between terms, documents, or a combination of both—additional observations and methods are necessary.
In the matrix A k, each row represents a unique term from the vocabulary, making the comparison of two terms synonymous with comparing their respective rows However, this method can be inefficient due to the M-dimensional nature of each row We will explore an alternative approach for conducting these comparisons more effectively.
A careful examination shows that the matrix \( A A_k \cdot k_t \) is a square symmetric matrix representing all term-term inner products Specifically, the entry at position \( (i, j) \) in \( A A_k \cdot k_t \) corresponds to the inner product of the \( i \)-th and \( j \)-th rows of \( A_k \) Furthermore, since \( D_0 \) is orthonormal, it follows that \( D \) is also orthonormal.
The cell (i, j) of the matrix A can be derived from the inner product of the ith and jth rows of the matrix T This indicates that the rows of T can be viewed as term representations, where each term is represented as a vector in a k-dimensional space defined by the basis S Consequently, this representation allows for a more efficient comparison of terms.
In short, to measure the similarity of two terms, we shall deal with the two corresponding rows of T S
Example 3.2: Consider the corpus in Example 3.1 We would like to measure the closeness of two terms “inference” and “learning”.
5 A matrix is said to be orthonormal if all columns of the matrix are mutually orthogonal, and of unit length.
In 4-dimensional space, "inference" is represented as T S 2, with T i denoting the i-th row of T, while "learning" is represented as T S 5 The similarity between these two concepts is assessed through the comparison of the corresponding vectors.
Comparing two documents using the columns of A k can be inefficient due to their representation in high-dimensional space To enhance computational efficiency, it's essential to explore alternative representations for the documents.
The matrix \( A_k \) contains all document-document inner products, revealing that the inner product of two columns in \( A_k \) corresponds to the inner product of the respective rows in \( D S^* \) This indicates that the rows of \( D S^* \) can be viewed as document representations Utilizing these representations allows for an effective measurement of similarity between documents, as all vectors exist in a \( k \)-dimensional space.
Example 3.3: Return to Example 3.1 We would like to measure the similarity of two documents d1 and d2 The similarity of these documents can be measured by inspecting the following vectors.
where D j is the j th row of D c) Comparing a term and a document:
This comparison can be done by inspecting the row and column of A k Nevertheless, it is inefficient.
The matrix \( A_k \) is defined as \( A_k = T S D t = (T S \cdot 1/2)(\cdot S \cdot 1/2 \cdot D t) \), where \( S \cdot 1/2 \) is the diagonal matrix satisfying \( S \cdot 1/2 \cdot S \cdot 1/2 = S \) This formulation indicates that each cell of \( A_k \) represents the product of a row from \( T S \cdot 1/2 \) and a row from \( D S \cdot 1/2 \) Consequently, the rows of \( T S \cdot 1/2 \) and \( D S \cdot 1/2 \) can be interpreted as representations of terms and documents, allowing for a comparison between terms and documents using their respective rows.
Example 3.4: We want to measure the similarity of term “inference” with document d1 This task can be done by inspecting the following vectors.
In this article, we explore how Latent Semantic Analysis (LSA) creates new representations for terms and documents, enabling effective comparisons between pairs of items We will then delve into methods for identifying the main topic of a document by employing a specific similarity measure, such as cosine similarity.
From the previous section, we have already discussed how to compare a given term and a given document To extract the topic of the document i, we do the following steps.
- For each term j in the vocabulary, compute the cosine of the angle between two vectors T j S 1 2 and D S i 1 2
- Select any term j satisfying cos q j i , a to compose the topic of the document.
A sieve effectively filters out all but the most pertinent terms in a document, utilizing cosine similarity as a key metric for relevance Alternative measures, such as inner product or Euclidean distance, can also be employed The following example demonstrates this filtering process in action.
Figure 3.2 An illustration of finding topics by LSA using cosine.
In Example 3.5, we aim to determine the topic of document d5 from the corpus outlined in Example 3.1 We select a value of a = 0.6 and compute the cosine similarities cos q j,5 for j = 1 to 10 The results, illustrated in Figure 3.2, reveal key terms such as Dirichlet, inference, process, and topics, which collectively define the topic of document d5.
“topics” does not appear in d5, the term still plays an important role in the
5,5 10,5 cos cos 0.0164 cos cos 0.1832 cos 0.0713 cos cos 0.3917 cos 0.3344 cos 0.2503 cos q q q q q q q q q q
0.7418 After computing these numbers, select the numbers which are no longer less than a
We find that 4 numbers marked bold satisfy our condition Thus, look up the vocabulary to find the terms associated with these numbers (see Figure 3.1), and obtain the followings:
Dirichlet, inference, process, topicsThese terms comprise the topic of the document d5. document The happening of this fact can be explained as follows: both terms
“Dirichlet” and “inference” appear in d5 and d6, and “topics” appears in d6 Thus LSA suggests that these three terms may reasonably correlate.
Dirichlet inference hierarchical latent learning model models process topic topics
Figure 3.3 A geometric illustration of representing items in 2-dimensional space
(Covered by the two dashed lines may be similar with the document d5 in the cosine measure.)
QR factorization
In Section 3.2, we explored how Latent Semantic Analysis (LSA) processes the term-by-document matrix A of a corpus, aiming to eliminate uncertainties by reducing redundant information LSA achieves this by approximating A with a new matrix Ak through singular value decomposition Despite this approximation, both matrices Ak and A maintain the same dimensions Furthermore, the comparisons made between items, including document-document and document-query, can be considered somewhat unnatural.
Latent Semantic Analysis (LSA) primarily utilizes Singular Value Decomposition (SVD) to analyze text corpora However, as noted by Michael et al [46], alternative linear algebra techniques can also mimic the behavior of SVD This section will explore the application of QR factorization of matrices in Information Retrieval (IR).
The QR-based method simplifies the representation of a corpus by utilizing a term-by-document matrix, where certain columns may be linearly dependent on others By identifying and removing these dependent columns, we can achieve a more efficient and reduced representation of the corpus Therefore, the crucial task is to pinpoint the key columns within the matrix.
Assume we are given the VM term-by-document matrix A of a corpus Then, as well-known, it can be decomposed into two matrixes Q and R such that
A Q R (3.2) where Q is a VV orthonormal matrix, R is a VM upper triangular matrix.
The columns of matrix A can be expressed as linear combinations of the columns of matrix Q, allowing us to approximate A using Q To minimize information loss, we will refine our approximation of Q by eliminating certain columns that correspond to sparse rows in matrix R.
Let Q k be the Vk matrix derived from Q by removing the columns associated with the (V k) sparsest rows of R Then the representation of the corpus is approximated by k k k
A Q R (3.3) where R k is the matrix obtained from R by removing the (V k) sparsest rows.
To retrieve all pertinent documents related to a specific query in this updated representation, we can follow the same procedures as the LSA method, with the exception that the calculation of cos q j is slightly altered.
(3.4) where r j is the j th column of R k
Figure 3.4 Finding relevant documents using QR-based method.
First, we decompose into two other matrixes and using QR factorization Then A Q R keep the first 3 columns of associated with the 3 densest rows of to form the new Q R matrixes
Next, compute the similarity between the query vector d q (0, 0, 0, 0, 1, 0, 0, 0, 1, 0) t and the documents using (3.4).
5 6 7 8 cos 0.4082; cos ; cos ; cos 0.5; cos 0.2236; cos 0.2236; cos ; cos q q q q q q q q
If we choose the threshold a 0.6 , the returned documents include d2, d3, d7, and d8 This result is also in coincidence with what is shown in Figure 3.3.
Example 3.7: Consider the corpus in Example 3.1 We want to find the most relevant documents to the query “topic model” The result appears in Figure 3.4
Discussion
Linear algebra techniques offer efficient representations for document corpora, proving valuable in identifying document topics and retrieving the most relevant documents for specific queries Their key advantages include enhanced accuracy in topic detection and improved relevance in search results.
- The computation is simple since SVD and QR factorization are elementary techniques in linear algebra.
- These methods are able to provide explicit representations of terms and documents in the semantic space.
By closely positioning correlated terms, these models effectively maintain the relationships between them, making them valuable tools for addressing complex challenges in Natural Language Processing, including polysemy and synonymy.
- These methods are applicable to any discrete data, not only text corpora.
Despite of many appealing properties, linear algebra based methods have their own serious drawbacks, some of which are:
- The dimension of the semantic space is chosen empirically For LSA, it is suggested that the good choice of the dimension should be from 100 to 300
Selecting the appropriate dimension is crucial for effectively uncovering latent structures; if the dimension is too small or too large, the results may be suboptimal Therefore, a key challenge is to determine how to automatically identify the best dimension tailored to specific applications.
Latent Semantic Analysis (LSA) and its counterparts often lack a solid theoretical basis for their notable successes in various applications A plausible explanation for these achievements is that LSA effectively approximates the term-document matrix A with the optimal matrix A k of the same dimensions For further details, refer to [54].
The handling of the corpus when new documents are added or removed remains unclear and unnatural, a scenario that frequently occurs with real corpora and warrants further investigation While some proposals have been suggested in previous studies, they tend to be complex and require simplification for practical application.
A significant yet under-researched drawback of Latent Semantic Analysis (LSA) is its performance when dealing with a corpus containing only one or very few documents This scenario can occur in applications with limited observable data LSA identifies the topics of a document by analyzing every word in the vocabulary, and its effectiveness generally improves with larger datasets Consequently, LSA may struggle to deliver optimal results when confronted with a corpus that has minimal documentation.
To the best of my knowledge, there has not been any attention to this problem so far Hence, in my opinion, this situation is worth being studied further.
PROBABILISTIC TOPIC MODELS
An overview
Probabilistic topic models have gained significant attention from researchers due to their effectiveness in various applications and their strong statistical foundations These models utilize statistical principles to elucidate their functionality and the success of their experiments, which appears to be the primary factor driving research interest in this area.
Probabilistic topic models aim to approximate the stochastic processes that generate data Typically, these models consist of several key components that work together to identify underlying patterns and themes within the data.
Probabilistic topic models rely on specific assumptions regarding the data they analyze For instance, models like pLSA and LDA operate under the premise that each word in the vocabulary is drawn from a particular probability distribution Additionally, models such as LDA, CTM, hLDA, and HTMM posit that the documents within a corpus are generated through a defined probability distribution or stochastic process These foundational assumptions are crucial for the effective development of probabilistic topic models.
The generative process in topic modeling involves making assumptions about how a document or collection of documents is created For example, the Latent Dirichlet Allocation (LDA) model posits that each document is a mixture of topics, with each topic being represented by a probability distribution over words Understanding this generative process is crucial for the model's effectiveness in analyzing and interpreting data.
Inference in topic modeling involves determining unknown parameters of assumed probability distributions or processes over data Accurately modeling the data requires finding or approximating these parameters, a process known as parameter inference.
Many models lack explicit algebraic forms for their probability distributions or processes, making exact parameter inference challenging In such cases, approximate inference serves as a practical solution For a foundational understanding of inference, refer to sources [33], [71], [2], and [6], while additional information on stochastic processes can be found in [4], [9], [28], [36], [37], and [60].
Comparing probabilistic topic models can be challenging due to their differing assumptions and inference procedures Without a reliable measure for assessment, it becomes difficult to identify the advantages of one model over another for specific problems Fortunately, the literature provides a useful metric known as Perplexity, commonly applied to text corpora Perplexity quantifies the uncertainty of a model by evaluating the likelihood of an unseen test document based on the first N words observed.
log Pr( 1 , , | 1 , , ) exp N N test N test w w w w perplexity
(4.1) where N test is the length of the test document.
Perplexity indicates the challenge of predicting new, unseen documents based on a training set, with lower values signifying better model performance Hofmann [31] asserts that perplexity is a dependable metric for evaluating model effectiveness Additionally, alternative measures such as log probability of test data [10], precision-recall, and average precision [31] can also be utilized for performance assessment.
In this chapter, we explore various topic models that exemplify different characteristics within the realm of Topic Modeling We begin with the pLSA model in Section 4.2, which serves as a key representative of probabilistic and parametric topic models Next, Section 4.3 focuses on the LDA model, notable for its foundational role in subsequent models and its classification within static topic models, bag-of-words assumptions, and parametric models In Section 4.4, we examine the Hierarchical LDA, a significant model that represents dynamic topics and parametric frameworks Finally, Section 4.5 discusses the Bigram Topic Model, which effectively demonstrates the principles of a non-Bag-of-words topic model.
Probabilistic Latent Semantic Analysis
Chapter 3 highlights several limitations of Latent Semantic Analysis (LSA), with the most significant being its lack of a robust theoretical framework to explain its functionality and empirical successes in practical applications This shortcoming has led to the development of the probabilistic Latent Semantic Analysis (pLSA) model.
In a corpus containing M documents and V unique words, the term-by-document matrix A represents the frequency of each word in each document The probabilistic Latent Semantic Analysis (pLSA) model treats each word occurrence as a sample from a specific probability distribution, linking it to an unobserved class variable zk, which can be understood as a topic Consequently, each word is associated with a particular topic, enabling a deeper understanding of the underlying themes within the documents.
Being provided the number of topics, K , pLSA assumes the occurrences of words in documents to be generated according to the following process:
- Select a document d i with probability Pr( )di
- Pick a latent class z k with probability Pr( | )z k d i
- Generate a word w j with probability Pr(wj | )zk
In the context of document analysis, Pr( )di represents the likelihood of a word appearing in document d i Additionally, Pr( | )zk denotes the probability of topic zk given document d i, while Pr(wj | )zk indicates the probability of word w j being associated with topic zk.
The graphical model of pLSA, depicted in Figure 4.1, illustrates its generative process, where boxes function as "plates" indicating replicates The number in the bottom-right corner of each plate signifies the replication count for each variable, while blank circles denote latent variables and shaded circles represent observed variables For further insights into graphical models, additional resources can be consulted.
Figure 4.1 Graphical model representation of pLSA.
Latent variables z1, ,zK represent underlying topics in the model, while the observed variables are denoted as jw and di To effectively model the corpus, we aim to determine the probabilities Pr(wj | zk) and Pr(di | zk) Once these probabilities are established, calculating the joint probability of jw and di becomes straightforward.
Pr( , )w dj i Pr( ) Pr(di wj | ),di (4.2)
There is a nice geometric interpretation of the model derived from (4.2) See Figure 4.2 for a 3-dimensional representation of the probabilistic latent semantic space which is the convex hull of Pr( | ) z k , …, Pr( | ) z k
Figure 4.2 A geometric interpretation of pLSA.
In the context of probabilistic latent semantic analysis, each probability Pr(wj | di) is derived as a convex combination of K distinct probability mass functions, Pr(wj | zk) The mixing weights, Pr(zk | di), serve as coordinates that represent the document di within a K-dimensional probabilistic latent semantic space.
To determine Pr(wj | )zk and Pr( | )zk di for each variable pair, one must sample from specific probability distributions, a process commonly referred to as inference or learning Various methodologies have been developed to tackle this challenge, including Maximum Likelihood Estimation (ML), Expectation Propagation (EP), Expectation Maximization (EM), Variational methods, and Gibbs sampling For a deeper understanding of these techniques, refer to sources [6], [2], [33], and [71] Notably, Hofmann [31] introduced an EM algorithm specifically for inferring Pr(wj | )zk and Pr( | )zk di, and we will adopt his approach in our analysis.
The EM algorithm aims to estimate quantities related to latent variables, maximizing the likelihood function In our analysis, we can approximate Pr(wj | )zk and Pr( | )zk di effectively.
- Initial step: randomly choose some values for Pr(wj | )zk and Pr( | )zk di , for all possible k j i , ,
- E-step: posterior probabilities are computed for the latent variables For all possible k j i , , , compute
- M-step: for all possible k j i , , , compute
- If the termination condition is not met, return to the E-step.
The termination condition in optimization can either be based on convergence criteria or utilize a method called early stopping Early stopping allows for halting the iterative process when there is minimal improvement in the parameters.
Hofmann's experiments revealed that the basic EM algorithm frequently suffers from overfitting, significantly diminishing the model's predictive accuracy To mitigate this issue, he introduced the Tempered EM algorithm (TEM), which utilizes an energy function parameterized by variational parameters instead of the traditional log likelihood function For a comprehensive understanding of this innovative approach, refer to the original paper [31].
Identifying the topic of a document is a fundamental challenge in text analysis The Latent Semantic Analysis (LSA) method complicates this task, making it less intuitive In contrast, the probabilistic Latent Semantic Analysis (pLSA) simplifies the process significantly Each latent class variable, denoted as zk, represents a distinct topic linked to a specific document, di Consequently, determining the topic of document di can be achieved by selecting the topic zk that maximizes the probability Pr(zk | di) among all possible topics.
When a document (query) q is not present in the corpus, we can determine its topic using the folding-in technique developed by Hofmann This involves calculating the probabilities Pr( | )zk q through the Tempered EM algorithm while keeping the parameters Pr(wj | )zk constant Ultimately, we identify the topic z k that has the highest probability Pr( | )z k q compared to other topics Pr( | )z q 1, …, Pr(z K | )q.
In Information Retrieval, the folding-in technique is utilized to determine the relevance of documents to a specific query, q, by calculating the probabilities Pr(z1 | q), , Pr(zK | q) These probabilities represent the coordinates of the document di within a probabilistic latent semantic space defined by Pr(γ | zk) To assess the similarity between a document and the query q, we interpret Pr(z1 | q), , Pr(zK | q) as the coordinates of the query in this semantic space.
By this interpretation, one can easily use a similarity measure to find all relevant documents to the query.
We have seen how pLSA deals with a given corpus or a query Compared with LSA, the method has various advantages.
Statistical techniques play a crucial role in model fitting, selection, and complexity control, providing a solid foundation for pLSA This foundation contributes to pLSA's superior performance and robustness compared to LSA.
- pLSA allows to systematically combine different models For example, one can easily integrate the cosine or inner product into the model to assess the extent of relevance.
- The documents and queries have explicit and homogeneous representations in the semantic space, opposed to the heterogeneous ones in LSA.
Despite of these benefits, pLSA suffers from many problems when dealing with real data, some of which are:
The use of the TEM method to identify the maximum of the likelihood function raises questions about whether this maximum is global or merely local Currently, there is no assurance regarding the validity of the obtained maximum, and the accuracy of TEM remains uncertain.
Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA) is one of the most influential models in Topic Modeling It is the basis for many subsequent interesting models such as
CTM, DELSA, PLSV, and BTM represent significant advancements in generative modeling This progress stems from the work of Blei et al [11], who uncovered a latent principle within various data set genres and developed a fully generative model that operates at the document level.
Exchangeability is a key concept in probability theory, closely related to the bag-of-words assumption in text processing, where words and documents are considered interchangeable According to de Finetti's theorem, any collection of exchangeable random variables can be represented as a mixture distribution In the context of Topic Modeling, treating words and documents as random variables suggests the existence of a mixture distribution that generates them Consequently, Blei and his collaborators have developed the Latent Dirichlet Allocation (LDA) model by applying specific distributions to these variables.
In this study, we focus on a corpus consisting of M documents and a vocabulary of V unique words, utilizing the bag-of-words assumption to represent the corpus through a term-by-document matrix A The Latent Dirichlet Allocation (LDA) model proposes a generative process for each document dm within the corpus.
2 For each word d m n , in the document a Choose a topic zm n , |q m Mult(q m), b Choose a word d m n , | z m n , , b Mult( b z m n , ).
Where Dir() and Mult() are respectively the Dirichlet and Multinomial distribution over the variables, the notation q m |a Dir( )a should be interpreted as drawing q m from the Dirichlet distribution with parameter a.
In the context of the LDA model, the topics of the corpus are represented as b1, b2, , bK, or succinctly as b1:K Each topic b_k is generated from a distribution, such as b_k ∼ Dir(θ), where θ defines the mixture proportions of topics within a specific document d_m Additionally, every word in the document is produced based on a topic distribution determined by Mult(b_zmn, φ) This framework is effectively illustrated in Figure 4.3, showcasing the generative process of the LDA model.
Figure 4.3 Graphical model representation of LDA.
Figure 4.4 A geometric interpretation of LDA.
pLSA generates an empirical distribution on the topic simplex, while LDA establishes a smooth distribution represented by contour lines on the same simplex For further details, refer to the original image found in [11].
The LDA model offers a geometric interpretation where the vocabulary's V words form a (V-1)-dimensional simplex, referred to as the word simplex, while the words of a topic create a topic simplex In LDA, each word in both observed and unseen documents is generated by a randomly selected topic from a distribution characterized by a randomly chosen parameter, which is sampled once per document from a smooth distribution on the topic simplex This contrasts with pLSA, which assumes that each document's distribution is a specific point on the topic simplex from which topics are drawn Figure 4.4 illustrates the geometric differences between LDA and pLSA.
Similar to pLSA, it is essential to compute distributions over latent variables, but the posterior distributions in LDA are more complex, making inference procedures more intricate than in pLSA Specifically, for each document \(d_m\), the hidden variables \(q_m\) and \(z_m\) contribute to a joint posterior distribution that requires sophisticated handling.
q b q q b (4.7) and, by integrating over q m and summing over z of (4.7),
The posterior distribution in equation (4.6) is difficult to compute, necessitating effective strategies to address this intractability Several approaches have been proposed, including variational methods, expectation propagation, and Gibbs sampling.
[25], [55] Here we shall follow the original proposal of Blei et al in [11].
Variational inference leverages Jensen’s inequality to establish a flexible lower bound on log likelihood, which aids in parameter derivation This lower bound typically manifests as a factorial function over the variables, allowing for an effective approximation of Pr( ,q m zm |dm, , )a b.
q g f q g , where the Dirichlet parameter g and the multinomial parameters f m n , are free parameters, called variational parameters.
The variational parameters are found by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior Pr( ,q m zm |d am, , )b , i.e.,
( , ) arg min KL( ( ,, | , ) Pr( , | , , )) m m m m q m zm m m m zm dm a g f g f q g f q b
For more about KL divergence, we refer to [6], [33].
Since KL divergence is a non-linear function, we have to find its minima by an iterative procedure, such as the one depicted in Figure 4.5.
Figure 4.5 A variational inference algorithm for LDA.
(Note that this algorithm is applied to each document d m of the corpus.)
After computing g m * and f m * , we turn to approximate the parameters of our
(2) initialize g m i , : ai Nm/K for all i
(9) until convergence model, a and b To model the observed data accurately, we need to find a and b that maximize the marginal log likelihood of the data,
Note that Pr(dm | , )a b is intractable to compute So we shall use the variational EM algorithm, using variational parameters, as follows:
- E-step: for each document d m , find the optimal values of the variational parameters ( g f m * , m * ) by the inference algorithm in Figure 4.5.
- M-step: maximize the resulting lower bound on the log likelihood with respect to the model parameters a and b b is computed by
where wm n j , 1 if d m n , is the j th word of the vocabulary, otherwise
The Dirichlet parameter a is computed using the Newton-Raphson method More specifically, iteratively compute the following sequence until convergence.
The E-step and M-step are iteratively performed until convergence or an early stopping criterion is achieved The resulting parameter values can then be utilized for various applications, including topic extraction.
4.3.2 Topics of the new documents
The multinomial parameter \( b_k \) highlights the significant words associated with topic \( k \), while the parameter \( q_j \) identifies the topics present in document \( j \) By utilizing these parameters, which are estimated through the variational EM algorithm, we can effectively extract topics from both individual documents and the entire corpus Additionally, this approach allows for the easy identification of the most relevant documents in the corpus related to a specific document or query.
To uncover the hidden topics within a previously analyzed document \( d_m \), we first apply the variational EM algorithm to compute \( f_{g,b} \) based on the existing corpus Subsequently, we calculate the probability \( Pr(z_k | q_m) = Mult(q_m) \) for each topic \( k \) from 1 to \( K \), selecting those \( z_k \) with the highest posterior probabilities To identify the words that form each topic \( z_k \), we compute \( Pr(d_{mj} | z_k) = Mult(b_{z_k}) \) for all \( j \) from 1 to \( N_m \), and then choose the words with probabilities exceeding a specified threshold.
To identify the topics in a new document d that has not previously appeared in the corpus, one can follow the method outlined by Blei et al First, infer the parameters f, g, and b from the existing corpus By using the approximation f n to estimate Pr(zn |dm), it is possible to determine all zn values that exceed a specified threshold Each identified zn represents a topic present in the document d.
To effectively identify the most relevant documents for a specific query, Latent Dirichlet Allocation (LDA) offers a straightforward approach This involves uncovering the underlying topics within the query and subsequently locating all documents in the corpus that correspond to these identified topics.
In our exploration of influential topic models, we focus on Latent Dirichlet Allocation (LDA), a generative model specifically designed for document analysis LDA offers numerous advantages, making it a powerful tool in the field of topic modeling.
- LDA is a completely unsupervised model in learning topics If was shown to be very robust and promising for many tasks such as document modeling, collaborative filtering, document classification.
- Compared with pLSA, the number of parameters in LDA is much smaller and do not depend on the size of the training set (K K V versus
K M K V in pLSA) This characteristic helps the model avoid overfitting.
Hierarchical Latent Dirichlet Allocation
Many existing flat topic models identify topics without recognizing the hierarchical relationships between them, such as the connection between the broader topic of "animal" and the more specific topic of "dog." This lack of consideration for parent-child relationships can hinder the model's ability to accurately reflect the nuances of topic categorization Therefore, it is essential for topic models to incorporate these hierarchical structures to enhance their effectiveness in knowledge discovery.
The hLDA model, developed by Blei et al., is a sophisticated and robust topic modeling approach that effectively captures both abstract and specific topics within a corpus It uncovers a hierarchical structure of topics in an unsupervised manner, allowing each topic to either be a specific instance (child) or a generalization (parent) of another Notably, hLDA does not require prior assumptions about the hierarchy or the number of topics, enabling the discovery of new topics through its inference algorithm without a predefined distribution As a non-parametric and non-bag-of-words model, hLDA processes words in sequence, preserving their order, and is characterized by its dynamic nature, as the total number of topics remains unknown in advance.
The hLDA model conceptualizes a document as a product of finite random walks on an infinite random tree, where each node signifies a topic By applying a prior distribution to this tree, one can utilize an inference algorithm to uncover its structure The prior distribution is designed so that broader topics are situated closer to the root, while more specific topics are found nearer to the leaves Blei et al have shown that the nested Chinese restaurant process (nCRP) can effectively serve as this prior distribution.
The infinite tree established by the nCRP defines a path, denoted as cm, for the mth document within our corpus The hierarchical Latent Dirichlet Allocation (hLDA) model posits that the generation of documents follows a specific process based on this structure.
1 For each node k in the infinite tree,
2 For each document d m of the corpus,
(b) Draw a distribution over levels in the tree: q m | ( , ) p GEM( , ) p (c) For each word d m n , ,
(ii) Choose word dm n , |zm n , ,cm,b Mult(b c m [zm n , ]), which is parameterized by the topic in position z m n , on the path
Figure 4.6 illustrates the geometric representation of how documents are generated within the corpus Each document, indexed by {1, 2, 3, }, selects its words from topic distributions along a randomly selected path.
Figure 4.6 A geometric illustration of document generation process.
To create a document, begin by selecting a path for its structure Then, establish a distribution that assigns levels to the words within the document This distribution will dictate the level of each word, and finally, generate each word according to the predetermined level and path.
In hLDA, the challenge lies in determining a posterior distribution over countably infinite collections of objects, including hierarchies, paths, assignments, and level allocations of words However, this task proves to be infeasible, necessitating the use of approximation methods to achieve viable results.
Markov Chain Monte Carlo (MCMC) is a widely used technique for approximating probability distributions over hidden variables This method involves creating a Markov chain with the desired target distribution as its stationary distribution By sampling from this chain over a sufficient duration, the samples converge to the target distribution, allowing for the collection of states that can be utilized to estimate the target effectively.
In hLDA, we need to approximate the posterior Pr(c 1: M,z 1: M | , , , , )g h p And we shall employ collapsed Gibbs sampling, a specification of MCMC Our sampling algorithm iterates the following steps until convergence:
- Sampling path for the documents.
Each document in the corpus requires the completion of two essential steps During each iteration, the posterior probability of a variable is calculated based on the values of other variables derived from the prior step The first step involves sampling level allocations.
For document d m, given the current path assignments, we need to sample the level allocation variable z m n , for the nth word.
In this formulae, z ( , ) m n and ( , ) m n are vectors of level allocations and observed words leaving out z m n , and d m n , , respectively z m , n can be similarly interpreted, i.e the level allocations in document d m leaving out z m n , The notion
implies that the posterior is computed by the right-hand formulae and then normalized to sum to one.
The posterior in the right-hand side of (4.9) are computed as follows:
In the above formulas, we use #[ ] to counts the elements of an array satisfying a given condition. b) Sampling paths
For each document d m , given the level allocation variables, we need to sample the associated path conditioned on all other paths and the observed words.
Pr(cm |c m, , , , )z g h Pr(cm |c m, ) Pr(g dm | ,c m, , )z h (4.10) Where Pr(cm |c m, )g nCRP( )g , and
To compute the posteriors in equations (4.9) and (4.10), prior knowledge of the hyperparameters λ, p, h, and g is essential However, these values are typically unknown a priori in specific applications, complicating the modeling process Therefore, an inference process is necessary to empirically determine their values For a comprehensive discussion on this solution, please refer to the original paper [8].
In this article, we explore the implementation of parameter inference within a preprocessed corpus, focusing on the construction of a hierarchical topic structure.
Our inference procedure successfully identifies the path \( c_m \) for each document \( d_m \) within the corpus It also determines the assignments of words to specific levels and documents to their corresponding paths This enables us to calculate the probability of a particular word \( w \) occurring in a topic at level \( h \) of path \( p \).
The tree can be constructed using the elements { , ,c 1 c M } By examining the aforementioned formulas, we can gather words assigned to each node, which aids in visualizing the hierarchy Figure 4.7 showcases the hierarchy of topics identified by Blei et al in their analysis of abstracts from the journal Psychological Review.
Figure 4.7 An example of hierarchy of topics [8].
Abstraction and specification are complex concepts that can differ from person to person, making them challenging to define clearly The varying degrees of abstraction further complicate this understanding However, the hierarchical Latent Dirichlet Allocation (hLDA) model excels in identifying and categorizing topics across multiple levels of abstraction, showcasing its impressive capabilities Additionally, hLDA offers several other significant advantages.
- It is able to uncover the underlying hierarchy structure appropriate to collection, and visualize that hierarchy for a better understanding of the structure.
- It is easy and natural to deal with the new documents thanks to the definition of the generative process.
- The topics obtained by hLDA are often more interpretable than other models Its predictive performance was demonstrated to be better than other models.
- Since hLDA works as nonparametric approach, it is well suitable for collections of data over times.
In spite of possessing many appealing characteristics, hLDA also has some limitations.
- The inference process is quite complicated and slow since MCMC is used.
The current limitations of hLDA hinder its ability to fully uncover the correlations among topics, as it primarily focuses on parent-child relationships This approach fails to highlight other significant connections, such as the sharing of specific information between topics.
Bigram Topic Model
Word order is crucial for comprehending natural language, yet the topic models discussed previously overlook this aspect As a result, these models may perform poorly in certain natural language tasks In this section, we will explore a model that moves beyond the bag-of-words assumption.
The Bigram Topic Model (BTM), introduced by Wallach, is an extension of the Latent Dirichlet Allocation (LDA) model, offering enhanced word correlation retention, particularly among adjacent words BTM posits that the generation of a word is influenced by both a specific topic distribution and the preceding word, which, while complicating parameter inference, significantly boosts the model's predictive capabilities Wallach highlighted this advantage in comparison to other models, including LDA and the hierarchical Dirichlet language model.
Assume we are given a corpus having M documents and V unique words
We would like to model this corpus as good as possible Then BTM assume our corpus was generated as follows:
1 For each topic k {1, , }K and each word j {1, ,M},
2 For each document d m in the corpus, a) Draw the topic mixture q m |ah Dir(ah). b) For each position n in document,
The graphical model illustrated in Figure 4.8 represents the process of word generation within a document In this model, ah and bg serve as hyperparameters, while q and f are identified as the model's parameters.
Figure 4.8 A graphical model representation of BTM.
4.5.1 Inference by a Gibbs EM algorithm
To complete the model following a generation process, our focus shifts to inferring the model's parameters The parameters q and f are associated with prior distributions concerning the hyperparameters ah and bg, allowing us to effectively infer these hyperparameters and finalize the model.
The overall inference procedure consists of the following two phases:
- Phase 1: find the optimal hyperparameters, [ah] MP of ah and [bg] MP of bg, by using a Gibbs EM algorithm.
- Phase 2: approximate the posterior distributions of words and of topics by the following formulas.
In the provided formulas, \( w_i \) represents the ith word in the vocabulary The variable \( N_{ij} \) indicates how many times word \( i \) follows word \( j \) in the corpus, while \( N_i \) counts the total occurrences of word \( i \) Additionally, \( N_{ijk} \) reflects the frequency with which word \( i \) is generated by topic \( k \) when it is preceded by word \( j \).
The optimal hyperparameters in phase 1 are computed by the following algorithm.
- E-step: draw S sample { z ( ) s } S s 1 from Pr( | ,[z ah] ,[ old bg] ) old using a Gibbs sampler The Gibbs sampler repeatedly computes the following for all k.
- M-step: use the fixed-point iteration to compute ah
bg can be found by the same technique.
- Alternate these two steps until the convergence condition is met.
BTM, an extension of LDA, allows for similar tasks such as specifying topics for a corpus or individual documents after parameter inference By considering the relationships between adjacent words, BTM enhances topic interpretability and reduces noise compared to LDA Additionally, it demonstrates improved predictive performance, often evaluated through perplexity However, BTM does have its limitations.
- The number of topics is assumed to be fixed Thus the model does not fit the corpora containing the evolution of topics over times.
- The correlations of more than two words have not been considered appropriately BTM only places high probability on word pairs that co- occurs highly.
The model's tendency to favor highly appreciated word pairs can lead to biases, particularly when one word appears at the end of a sentence and the other at the beginning of the following sentence This phenomenon may result in decreased predictive accuracy for the model.
- BTM may not be applied to some kinds of data other than texts since it partially takes syntax into consideration.
SOME APPLICATIONS OF TOPIC MODELS
Classification
Probabilistic topic models effectively identify multiple topics within a collection of documents, with each topic potentially appearing across various documents This allows for the interpretation of topics as classes or clusters of documents, suggesting that uncovering topics equates to classifying the documents into distinct categories Consequently, numerous researchers have successfully utilized topic models for document classification, achieving remarkable outcomes.
Figure 5.1 demonstrates the findings of Blei et al [11], highlighting that the application of LDA enhances the performance of classifiers across various data sets This indicates that LDA can significantly improve classification tasks in specific scenarios.
Recent research by Lacoste-Julien et al highlights the effectiveness of probabilistic topic models in classification tasks Their findings, detailed in Table 5.1, illustrate that the DiscLDA model performs comparably to the traditional support vector machine classifier.
LDA+SVM DiscLDA+SVM DiscLDA alone
Analyzing research trends over times
An interesting application of Topic Modeling is analyzing collections of scientific papers and discovering the evolution of topics over times These facts were pointed out in [25], [10], [13], and [67].
In a study published in [25], the authors utilized Latent Dirichlet Allocation (LDA) to analyze a collection of PNAS papers from 1991 to 2001, identifying various scientific topics and their classifications Their research also demonstrated a method for revealing trending (hot) and declining (cold) topics over the years Figure 5.2 illustrates several of these identified hot and cold topics derived from their analysis of PNAS publications.
Figure 5.2 The dynamics of the three hottest and three coldest topics.
Probabilistic topic models are effective tools for uncovering research trends over time, as demonstrated in Figure 5.3 This figure depicts the evolution of various topics within a collection of papers published in the journal Science For a deeper exploration of research trend analysis, refer to [13].
Figure 5.3 Evolution of topics through decades.
Semantic representation
Semantic knowledge encompasses the understanding of relationships among various elements such as words, concepts, and percepts To effectively represent this knowledge, it is essential to consider the different types of relations, including those between words and concepts, concepts and other concepts, concepts and percepts, as well as relationships among words themselves.
Probabilistic topic models offer superior solutions for representing word correlations compared to traditional methods like semantic networks and semantic spaces, particularly in tasks such as prediction, disambiguation, and gist extraction These models have demonstrated significant predictive power across various data sets and are effective in uncovering hidden topics within documents, making them ideal for addressing semantic associations For further insights, please refer to [27].
Landauer and Dumais proposed that Latent Semantic Analysis (LSA) serves as a robust theory for understanding learning, memory, and knowledge acquisition Their research revealed that LSA can effectively address synonym challenges, showcasing a learning capability comparable to that of school-aged children on similarity assessments Additionally, they found that with proper learning, LSA-generated knowledge aligns closely with that of foreign students in similarity evaluations on TOEFL tests.
Information retrieval
Delivering the most relevant documents in response to user queries is a fundamental requirement for any information retrieval (IR) system Over the years, significant advancements have been made in this area While word matching remains a basic approach, it often yields unsatisfactory results As the demand for effective search capabilities continues to rise, there is a pressing need for advanced searching techniques that go beyond simple word matching to ensure high-quality results.
Topic models offer effective solutions for improving document representation, facilitating easier searching and ranking By placing documents and queries in a semantic space, one can identify the most relevant topics for each query, ultimately leading to the discovery of pertinent documents Research by Wei and Croft indicates that LDA-based retrieval consistently outperforms alternative methods across various datasets, with average precision serving as the key metric for comparison, as illustrated in their findings.
Table 5.2 Comparison of query likelihood retrieval (QL), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBDM).
More applications
The number of important applications of Topic Modeling has been increasing day by day Due to the limited time and space of this thesis, we cannot describe all
So we only discuss some typical applications in previous sections.
Topic modeling has diverse applications beyond text analysis, including the identification of objects in image collections Researchers like Biro et al have successfully utilized topic models for distinguishing between spam and non-spam websites, enhancing spam filtering methods.
[21] proposed a topic model that is able to learn the correlations among scientific papers for predicting citation influences The same researches can be found in [19], and [49].
Many authors appreciate the use of topic models for classifying texts and visualizing documents These models facilitate the learning of concepts, as highlighted by various researchers For further intriguing applications of topic modeling, refer to the works of several authors.
Experimenting with some topic models
This section reports the results of the author’s experiments with some probabilistic topic models 6 These experiments used the following data sets:
- A collection of 1740 papers of NIPS’ conferences from volume 0 to volume 12 This collection yields a vocabulary of 13649 unique English words, and 2301375 word tokens in total.
- A collection of 12525 articles of VnExpress – an electronic Vietnamese newspaper It is comprised of a vocabulary of 20171 unique Vietnamese words and 1427482 word tokens in total
The initial dataset underwent preprocessing to eliminate stop words and special characters and is available at http://psiexp.ss.uci.edu/research/programs_data For the original papers, refer to http://books.nips.cc/ Additionally, the second collection was contributed by Nguyen Cam Tu.
In our experiments, we utilized generative models such as LDA and HMM-LDA While LDA was previously outlined in Chapter 4, our implementation employs a Gibbs sampler for parameter inference, rather than the variational EM algorithm HMM-LDA, introduced by Griffiths et al., is a non-bag-of-words model that enhances the capabilities of traditional LDA.
The HMM-LDA model effectively captures both long- and short-range dependencies by integrating a Hidden Markov Model (HMM) for the syntactic component and a topic model for the semantic component This approach allows a document to be viewed as a sample from a mixture of these two models, facilitating the separate discovery of function words and content words.
To demonstrate the power of topic models in uncovering topics of a given corpus or document, we applied LDA to the two data sets with the following settings.
6 Note that some experiments with linear algebra based models were presented in Chapter 3 Nonetheless, they only dealt with some toy data sets.
- The number of iterations of Gibbs sampler: N 1000
We remark that the scalar b here should be interpreted as the hyperparameters h of LDA described in Section 4.3.
Table 5.3 The most probable topics from NIPS and VnExpress collections.
4 Topics from the NIPS Collection
TOPIC_10 0.0406 order 0.0284 approach 0.0199 case 0.0174 results 0.0170 number 0.0118 general 0.0106 method 0.0102 work 0.0095 terms 0.0090 problem 0.0088
TOPIC_1 0.0329 units 0.0898 hidden 0.0624 unit 0.0527 layer 0.0518 network 0.0363 input 0.0360 weights 0.0312 output 0.0298 net 0.0190 training 0.0187
TOPIC_21 0.0313 network 0.2152 neural 0.1134 networks 0.0988 input 0.0743 output 0.0564 inputs 0.0203 architecture 0.0185 outputs 0.0136 net 0.0113 layer 0.0103
TOPIC_47 0.0279 training 0.1233 set 0.0697 error 0.0631 test 0.0326 data 0.0319 generalization 0.0271 sets 0.0207 performance 0.0201 examples 0.0183 trained 0.0146
4 Topics from the VnExpress Collection
TOPIC_46 0.0340 tr ận_đấu 0.0328 trận 0.0323 chiến_thắng 0.0285 tỷ_số 0.0255 thắng 0.0229
Vòng 0.0228 giành 0.0219 mở_rộng 0.0217 đối_thủ 0.0197 set 0.0194
TOPIC_49 0.0308 c ổ_phiếu 0.0903 phiên 0.0662 giảm 0.0539 đạt 0.0532 điểm 0.0478 lệnh 0.0471 giao_dịch 0.0466 khớp 0.0404 giá_trị 0.0376 khối_lượng 0.0350
TOPIC_34 0.0301 trường 0.0829 sinh_viên 0.0565 đại_học 0.0519 học_sinh 0.0425 lớp 0.0216 đào_tạo 0.0213 giáo_dục 0.0207 tiếng 0.0183 tốt_nghiệp 0.0171 học_bổng 0.0160
TOPIC_24 0.0296 c ầu_thủ 0.0453 đội 0.0441 bóng_đá 0.0287 cup 0.0262 trận 0.0247 sân 0.0209 mùa 0.0208 bóng 0.0201 đội_bóng 0.0194 đội_tuyển 0.0191
The experiments yielded easily comprehensible topics, as shown in Table 5.3, which lists the four most probable topics for the NIPS and VnExpress collections, along with their probabilities Each topic is illustrated by the ten most probable words and their associated probabilities Notably, in the NIPS corpus, the tenth topic stands out as the most significant, exhibiting the highest probability among the 50 topics, prompting considerations around "methods."
The analysis reveals that the NIPS corpus predominantly addresses "problems," while a lesser focus is placed on "Neural Networks." In contrast, the VnExpress corpus highlights a clear trend, with articles primarily concentrating on specific main topics.
To identify the topics of a document that have previously appeared in the corpus, we utilize the probabilities Pr( | )zk q i Specifically, for a given document d i, we analyze all Pr(z 1: T | )q i to determine the most likely topics associated with that document For instance, Table 5.4 illustrates the most probable topics from the inaugural NIPS paper by Abu-Mostafa published in 1988.
Table 5.4 Finding the topics of a document.
4 Topics of the paper “Connectivity versus Entropy”
TOPIC_41 0.2155 probability 0.0634 distribution 0.0490 information 0.0308 density 0.0160 random 0.0153 stochastic 0.0149 entropy 0.0139 log 0.0130 distributions 0.0128 statistical 0.0101
TOPIC_36 0.2111 theorem 0.0263 bound 0.0188 threshold 0.0179 number 0.0162 proof 0.0149 size 0.0137 bounds 0.0134 dimension 0.0130 neural 0.0129 networks 0.0118
TOPIC_8 0.0967 neurons 0.0746 neuron 0.0540 activity 0.0247 connections 0.0238 phase 0.0205 network 0.0180 inhibitory 0.0120 excitatory 0.0103 fig 0.0100 inhibition 0.0091
TOPIC_26 0.0571 learning 0.3049 learn 0.0350 learned 0.0294 task 0.0264 rule 0.0243 tasks 0.0180 based 0.0113 examples 0.0106 space 0.0096 learns 0.0092
Table 5.5 Finding topics of a report.
4 Topics of the report “Phái đoàn Triều Tiên viếng cố tổng thống Hàn Quốc”
TOPIC_18 0.1022 đàm_phán 0.2484 thành_viên 0.2314 tổ_chức 0.0914 tự_do 0.0805 hiệp_định 0.0780 kinh_tế 0.0647 thỏa_thuận 0.0563 đoàn 0.0247 qu ốc_hội 0.0242 mở_cửa 0.0115
TOPIC_47 0.0916 t ổng_ ống th 0.3516 hạt_nhân 0.1072 mối 0.0925 tuyên_bố 0.0677 quan_chức 0.0630 chiến_tranh 0.0625 chuyến 0.0381 khẳng_định 0.0328 nỗ_lực 0.0293 căng_thẳng 0.0250
TOPIC_42 0.0554 kinh_t ế 0.8690 tình_trạng 0.0419 gia_tăng 0.0270 thống_đốc 0.0068 dừng 0.0055 tín_hiệu 0.0025 lý_thuyết 0.0006 quốc_hội 0.0002 đoàn 0.0000 quan_chức 0.0000
TOPIC_ 1 0.0553 xây_d ựng 0.5419 dự_án 0.3218 tòa_nhà 0.0848 liên 0.0050 đoàn 0.0000 quan_chức 0.0000 trang_phục 0.0000 màu 0.0000 người 0.0000 t ổng_thống 0.0000
A newly introduced document can be analyzed to identify its main topics, as detailed in Section 4.3.2 For instance, Table 5.5 presents the four most likely topics from a recent report dated August 21, 2009, concerning Taiwan ("Triều Tiên").
HMM-LDA excels at classifying words into distinct classes using only a text corpus, without requiring prior knowledge This capability stems from its innovative integration of Hidden Markov Models (HMM) and topic modeling techniques Consequently, HMM-LDA is significantly more complex than other topic models like LDA, BTM, and CTM For further information, interested readers can consult reference [26].
We conducted experiments on the NIPS corpus to showcase the model's capabilities, utilizing 400 iterations, 16 states, and a hyperparameter of 0.1 The results, detailed in Tables 5.6 and 5.7, reveal the four most probable topics and eight classes of function words derived from the training process.
Table 5.6 Selected topics found by HMM-LDA.
4 Topics of the NIPS corpus
* 0.9435 behavior 0.0039 detail 0.0015 work 0.0011 air 0.0010 complex 0.0010 others 0.0010 ways 0.0009 mass 0.0009 pressure 0.0009
TOPIC_15 0.0319 units 0.2029 unit 0.1110 hidden 0.1107 layer 0.0748 weights 0.0580 network 0.0330 activation 0.0301 connections 0.0209 layers 0.0207 training 0.0206
TOPIC_21 0.0309 cells 0.0776 cell 0.0472 stimulus 0.0394 response 0.0361 visual 0.0306 stimuli 0.0253 responses 0.0205 spatial 0.0194 receptive 0.0170 input 0.0158
TOPIC_10 0.0294 state 0.0508 policy 0.0391 action 0.0350 value 0.0317 reinforcement 0.0302 actions 0.0268 control 0.0216 function 0.0175 reward 0.0162 time 0.0157
HMM-LDA identifies topics that are as interpretable as those found by traditional LDA, showcasing its unique ability to uncover classes of function words Notably, table 5.7 illustrates various classes of words that serve as representatives of these function word categories.
“determiner”, “preposition”, “verb”, and “adjective”.
Table 5.7 Classes of function words found by HMM-LDA.
CLASS_14 0.1856 the 0.5731 a 0.1720 an 0.0330 this 0.0263 each 0.0190 these 0.0138 our 0.0110 its 0.0100 two 0.0078 all 0.0077
CLASS_1 0.0944 in 0.2797 for 0.1631 with 0.1038 on 0.0755 from 0.0621 as 0.0598 at 0.0444 by 0.0246 using 0.0241 over 0.0127
* 0.0393 same 0.0181 two 0.0154 different 0.0139 first 0.0139 single 0.0136 new 0.0123 neural 0.0116 large 0.0110 local 0.0110
CLASS_7 0.0647 used 0.0397 shown 0.0242 based 0.0150 obtained 0.0116 described 0.0115 trained 0.0110
CLASS_11 0.0594 is 0.3466 are 0.1621 can 0.0986 was 0.0544 will 0.0386 have 0.0368 were 0.0367 has 0.0270 may 0.0245 would 0.0168
CLASS_2 0.0577 of 0.8971 between 0.0312 in 0.0160 for 0.0110 to 0.0059 on 0.0057 over 0.0054 that 0.0030 and 0.0014 where 0.0013
CLASS_8 0.0551 be 0.1718 not 0.0627 been 0.0299 also 0.0254 more 0.0186 only 0.0137 then 0.0126 very 0.0117 have 0.0116
This thesis provides a comprehensive survey of Topic Modeling, highlighting its attractive features and potential applications The author introduces a novel classification system for topic models, aimed at facilitating a better understanding of the field To the best of our knowledge, this is the first classification of topic models, offering insights not previously addressed by other authors in the domain.
In addition to the primary findings, the author introduced various topic models representative of specific model categories The analysis highlighted numerous advantages and disadvantages associated with each model Furthermore, potential extensions for some of these models were also explored.
The thesis details the author's experiments with probabilistic topic models applied to collections of NIPS papers and VnExpress articles These experiments aim to establish a foundational understanding for clearer interpretation and future research in the field.
While the thesis aims to provide a comprehensive survey on Topic Modeling, it presents incomplete results Chapter 2 offers only partial insights, lacking coverage of all aspects of the topic Additionally, some probabilistic topic models are discussed in depth without practical testing, which may hinder the thesis's ability to fully convince readers of the author's arguments.
1 Aldous, D (1985), “Exchangeability and Related Topics”, in E´cole d’E´te´ de
Probabilite´s de Saint-Flour XIII–1983, Springer, Berlin, pp 1–198.
2 Andrieu C., Freitas N D., Doucet A, Jordan M I (2003), “An Introduction to MCMC for Machine Learning”, Machine Learning , 50, pp 5–43.
3 Asuncion A., Smyth P., Welling M (2008), “Asynchronous Distributed Learning of
Topic Models”, Advances in Neural Information Processing Systems, 20, pp 81-88.
4 Beal M J., Ghahramani Z., Rasmussen C E (2002), “The infinite hidden Markov model”, Advances in Neural Information Processing Systems, 14.
5 Berry M W., Dumais S T., O’Brien G W (1994), “Using Linear Algebra for
Intelligent Information Retrieval”, SIAM Review , 37, pp 573–595.
6 Bishop C (2006), Pattern Recognition and Machine Learning , Springer.
7 Biro I., Szabo J., Benczur A (2008), “Latent Dirichlet Allocation in Web Spam
Filtering”, In Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web -WWW, pp 29-32
8 Blei D M., Griffiths T L., Jordan M I (2007), “The nested Chinese restaurant process and Bayesian inference of topic hierarchies”, http://arxiv.org/abs/0710.0845 Shorter version appears in NIPS , 16, pp 17–24.
9 Blei D M., Jordan M I (2006), “Variational inference for Dirichlet process mixtures”,
10 Blei D M., Lafferty J (2007), “A correlated topic model of Science”, The Annals of
11 Blei D M., Ng A Y., Jordan M I (2003), “Latent Dirichlet allocation”, Journal of
12 Blei D., Jordan M (2003), “Modeling annotated data”, In Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 127-134.
13 Blei D., Lafferty J (2006), “Dynamic Topic Models”, In Proceedings of the 23rd
International Conference on Machine Learning -ICML, pp 113 - 120.
14 Blei D., McAuliffe J (2007), “Supervised topic models”, Advances in Neural
15 Boyd-Graber J., Blei D (2008), “Syntactic topic models”, Advances in Neural
16 Canini K R., Shi L., Griffths T (2009), “Online Inference of Topics with Latent
Dirichlet Allocation”, In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics –AISTATS , 5, pp 65-72.
17 Chemudugunta C., Holloway A., Smyth P., Steyvers M (2008), “Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning”, In Proceedings of the International Semantic Web Conference.
18 Chemudugunta C., Smyth P., Steyvers M (2006), “Modeling General and Specific
Aspects of Documents with a Probabilistic Topic Model”, Advances in Neural Information Processing Systems, 18.
19 Cohn D., Hofmann T (2000), “The missing link - a probabilistic model of document content and hypertext connectivity”, Advances in Neural Information Processing Systems, 12.
20 Deerwester S., Dumais S T., Fumas G W., Landauer T K., Harshman R (1990),
“Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science, 41, pp 391-407.
21 Dietz L., Bickel S., Scheffer T (2007), “Unsupervised Prediction of Citation
Influences”, In Proceedings of the 24th International Conference on Machine Learning –ICML, pp 233 - 240.
22 Fei-Fei L., Perona P (2005), “A Bayesian hierarchical model for learning natural scene categories”, In Proceedings of the International Conference on Computer Vision and Pattern Recognition -CVPR, 2, pp 524-531.
23 Foltz P W., Kintsch W., Landauer T K (1998), “The measurement of textual coherence with Latent Semantic Analysis”, Discourse Processes , 25, pp 285-307.
24 Gomes R., Welling M., Perona P (2008), “Memory Bounded Inference in Topic
Models”, In Proceedings of the 25th International Conference on Machine Learning -ICML, pp 344-351
25 Griffiths T L., Steyvers M (2004), “Finding scientific topics”, Proceedings of the
National Academy of Sciences, USA, 101, pp 5228–5235.
26 Griffiths T L., Steyvers M., Blei D M., Tenenbaum J B (2005), “Integrating topics and syntax”, Advances in Neural Information Processing Systems, 17 , pp 537–544.
27 Griffiths T L., Steyvers M., Tenenbaum J (2007), “Topics in Semantic
28 Gruber A., Rosen-Zvi M., Weiss Y (2007), “Hidden topic Markov models”, In
Proceedings of Artificial Intelligence and Statistics -AISTATS, 2, pp 163-170.
29 Hieu P X., Minh N L., Horiguchi S (2008), “Learning to Classify Short and Sparse
Text & Web with Hidden Topics from Large-scale Data Collections”, In Proceed- ings of the 17th International World Wide Web Conference -WWW, pp 91-100.
30 Hofmann T (1999), “Probabilistic latent semantic indexing”, In Proceedings of the
22sd Annual International SIGIR Conference, pp 50-57.
31 Hofmann T (2001), “Unsupervised Learning by Probabilistic Latent Semantic
32 Iwata T., Yamada T., Ueda N (2008), “Probabilistic Latent Semantic Visualization:
Topic Model for Visualizing Documents”, In Proceedings of The 14 th ACM SIGKDD Inter Conference on Knowledge Discovery and Data Mining -KDD.
33 Jordan M., Ghahramani Z., Jaakkola T., Saul L (1999), “Introduction to variational methods for graphical models”, Machine Learning, 37, pp 183–233.
34 Kim S., Smyth P (2007), “Hierarchical Dirichlet Processes with random effects”,
Advances in Neural Information Processing Systems, 19.
35 Kintsch W (2001), “Predication”, Cognitive Science , 25, pp 173–202.
36 Kurihara K., Welling M., Teh Y W (2007), “Collapsed Variational Dirichlet Process
Mixture Models”, In Proceedings of the 21st Joint Conference on Artificial Intelligence -IJCAI, pp 2796-2801.
37 Kurihara K., Welling M., Vlassis N (2007), “Accelerated variational DP mixture models”, Advances in Neural Information Processing Systems, 19
38 Lacoste-Julien S., Sha F., Jordan M I (2008), “DiscLDA: Discriminative Learning for
Dimensionality Reduction and Classification”, Advances in Neural Information Processing Systems, 20.
39 Landauer T K., Dumais S T (1997), “A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge”, Psychological Review , 104, pp 211–240.
40 Landauer T K., Foltz P W., Laham D (1998), “Introduction to latent semantic analysis”, Discourse Processes , 25, pp 259–284
41 Lavrenko V., Manmatha R., Jeon J (2003), “A Model for Learning the Semantics of
Pictures”, Advances in Neural Information Processing Systems, 15.
42 Lee D D., Seung H S (1999), “Learning the parts of objects by non-negative matrix factorization”, Nature , 401, pp 788-791.
43 Leún J A., Olmos R., Escudero I., Caủas J., Salmerún L (2005), “Assessing Short
Summaries With Human Judgments Procedure and LSA in narrative and expository texts”, Behavioral Research Methods , 38(4), pp 616-627.
44 McCallum A., Corrada-Emmanuel A., Wang X (2005), “Topic and Role Discovery in
Social Networks”, In Proceedings of the 19th Joint Conference on Artificial Intelligence -IJCAI, pp 786-791.
45 Mei Q., Cai D., Zhang D., Zhai C (2008), “Topic Modeling with Network
Regularization”, In Proceedings of the 17th International World Wide Web Conference -WWW, pp 101-110.
46 Michael W., Berry M W., Drmac D., Jessup E R (1999), “Matrices, Vector Spaces, and Information Retrieval”, Siam Review, 41(2), pp 335–362.
47 Minka T., Lafferty J (2002), “Expectation–propagation for the generative aspect model”, In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence -UAI, pp 352–359
48 Mukherjee I., David M Blei D M (2008), “Relative Performance Guarantees for
Approximate Inference in Latent Dirichlet Allocation”, Advances in Neural Information Processing Systems, 20.
49 Nallapati R., Ahmed A., Xing E P., Cohen W W (2008), “Joint Latent Topic Models for Text and Citations”, In Proceedings of The 14 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining -KDD.
50 Navarro D J., Griffiths T (2008), “Latent Features in Similarity Judgments: A
Nonparametric Bayesian Approach”, Neural Computation , 20, pp 2597–2628.
51 Newman D., Asuncion A., Smyth P., Welling M (2007), “Distributed inference for latent Dirichlet allocation”, Advances in Neural Information Processing Systems, 19.
52 Newman D., Chemudugunta C., Smyth P., Steyvers M (2006), “Analyzing Entities and Topics in News Articles Using Statistical Topic Models”, In Proceedings of Intelligence and Security Informatics, LNCS 3975, Springer-Verlag, pp 93–104.
53 Newman D., Chemudugunta C., Smyth P., Steyvers M (2006), “Statistical Entity-
Topic Models”, In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining -KDD, pp 680 - 686.
54 Papadimitriou C., Tamaki H., Raghavan P., Vempala S (1998), “Latent semantic indexing: A probabilistic analysis”, In Proceedings of the 17th ACM SIGACT- SIGMOD-SIGART symposium on Principles of database systems, pp 159-168.
55 Porteous I., Newman D., Ihler A., Asuncion A., Smyth P., Welling M (2008), “Fast
Collapsed Gibbs Sampling For Latent Dirichlet Allocation”, In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining –KDD , pp 569-577.
56 Rosen-Zvi M., Griffiths T., Steyvers M., Smyth P (2006), “Learning Author Topic
Models from Text Corpora”, ACM Transactions on Information Systems.
57 Roweis S T., Saul L K (2000), “Nonlinear Dimensionality Reduction by Locally
58 Steyvers M., Griffiths T L., Dennis S (2006), “Probabilistic inference in human semantic memory”, TRENDS in Cognitive Sciences , 10(7), pp 327-334.
59 Tam Y C., Schultz T (2008), “Correlated Bigram LSA for Unsupervised Language
Model Adaptation”, Advances in Neural Information Processing Systems, 20
60 Teh Y W., Kurihara K., Welling M (2008), “Collapsed variational inference for
HDP”, Advances in Neural Information Processing Systems, 20.
61 Teh Y W., Newman D., Welling M (2007), “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation”, Advances in Neural Information Processing Systems, 19.
62 Teh Y., Jordan M., Beal M., Blei D (2006), “Hierarchical Dirichlet Processes”,
Journal of the American Statistical Association, 101(476), pp 1566-1581.
63 Tenenbaum J B., Griffiths T L (2001), “Generalization, similarity, and Bayesian inference”, Behavioral and Brain Sciences , 24, pp 629–640.
64 Tenenbaum J B., Silva V., Langford J (2000), “A Global Geometric Framework for
Nonlinear Dimensionality Reduction”, Science , 290, pp 2319-2322.
65 Toutanova K., Johnson M (2008), “A Bayesian LDA-based model for semi-supervised part-of-speech tagging”, Advances in Neural Information Processing Systems, 20.
66 Wallach H (2006), “Topic Modeling: Beyond Bag-of-Words”, In Proceedings of the
23rd International Conference on Machine Learning –ICML , pp 977 - 984.
67 Wang C., Blei D., Heckerman D (2008), “Continuous Time Dynamic Topic Models”,
Advances in Neural Information Processing Systems, 20.
68 Wang L., Dunson D B (2007), “Fast Bayesian Inference in Dirichlet Process Mixture
69 Wang X., Grimson E (2007), “Spatial Latent Dirichlet Allocation”, Advances in
70 Wang X., Mohanty N., McCallum A (2005), “Group and Topic Discovery from
Relations and Text”, In Proceedings of the 3rd international workshop on Link discovery –LinkKDD, pp 28-35
71 Wainwright M J., Jordan M I (2008), “Graphical Models, Exponential Families, and
Variational Inference”, Foundations and Trends in Machine Learning, 1(1–2), pp 1-305.
72 Wei X., Croft B (2006), “LDA-based document models for ad-hoc retrieval”, In
Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp 178-185.
73 Wolfe M B., Schreiner M E., Rehder B., Laham D., Foltz P W., Kintsch W.,
Landauer T (1998), “Learning from text: Matching readers and text by latent semantic analysis”, Discourse Processes , 25, pp 309–336.
74 Yu K., Yu S., Tresp V (2005), “Dirichlet Enhanced Latent Semantic Analysis”,
Advances in Neural Information Processing Systems, 17.
75 Zheng B., McLean D C, Lu X (2006), “Identifying biological concepts from a protein-related corpus with a probabilistic topic model”, BMC Bioinformatics , 7(58), pp 1-10.
76 Zhu J Ahmed A., Xing E (2009), “MedLDA: Maximum Margin Supervised Topic ,
Models for Regression and Classification”, In Proceedings of the 26th International Conference on Machine Learning.
Topic Modeling is a significant area of research in Artificial Intelligence, with numerous applications such as automatic document indexing, identifying topical communities in scientific literature, enhancing spam filtering, and tracking the evolution of scientific topics over time This thesis surveys the latest advancements in Topic Modeling, highlighting key characteristics and future directions for model development while assessing the strengths and weaknesses of various models It also discusses potential extensions for certain models and presents experimental results from applying these models to datasets from NIPS conferences and VnExpress, a Vietnamese electronic newspaper.
Keywords: topic modeling, topic models, semantic representation, graphical models, knowledge discovery.