Balancing Clarity and Efficiency in Typed Feature Logic through Delaying
Gerald Penn
University of Toronto
10 King’s College Rd.
Toronto M5S 3G4
Canada
gpenn@cs.toronto.edu
Abstract
The purpose of this paper is to re-examine the bal-
ance between clarity and efficiency in HPSG design,
with particular reference to the design decisions
made in the English Resource Grammar (LinGO,
1999, ERG). It is argued that a simple generaliza-
tion of the conventional delay statements used in
logic programming is sufficient to restore much of
the functionality and concomitant benefit that the
ERG elected to forego, with an acceptable although
still perceptible computational cost.
1 Motivation
By convention, current HPSGs consist, at the very
least, of a deductive backbone of extended phrase
structure rules, in which each category is a descrip-
tion of a typed feature structure (TFS), augmented
with constraints that enforce the principles of gram-
mar. These principles typically take the form of
statements, “for all TFSs, ψ holds,” where ψ is
usually an implication. Historically, HPSG used
a much richer set of formal descriptive devices,
however, mostly on analogy to developments in
the use of types and description logics in program-
ming language theory (A¨ıt-Ka´ci, 1984), which had
served as the impetus for HPSG’s invention (Pol-
lard, 1998). This included logic-programming-style
relations (H¨ohfeld and Smolka, 1988), a powerful
description language in which expressions could de-
note sets of TFSs through the use of an explicit
disjunction operator, and the full expressive power
of implications, in which antecedents of the above-
mentioned ψ principles could be arbitrarily com-
plex.
Early HPSG-based natural language processing
systems faithfully supported large chunks of this
richer functionality, in spite of their inability to han-
dle it efficiently — so much so that when the de-
signers of the ERG set out to select formal descrip-
tive devices for their implementation with the aim
of “balancing clarity and efficiency,” (Flickinger,
2000), they chose to include none of these ameni-
ties. The ERG uses only phrase-structure rules and
type-antecedent constraints, pushing all would-be
description-level disjunctions into its type system or
rules. In one respect, this choice was successful, be-
cause it did at least achieve a respectable level of
efficiency. But the ERG’s selection of functionality
has acquired an almost liturgical status within the
HPSG community in the intervening seven years.
Keeping this particular faith, moreover, comes at a
considerable cost in clarity, as will be argued below.
This paper identifies what it is precisely about
this extra functionality that we miss (modularity,
Section 2), determines what it would take at a mini-
mum computationally to get it back (delaying, Sec-
tion 3), and attempts to measure exactly how much
that minimal computational overhead would cost
(about 4 µs per delay, Section 4). This study has
not been undertaken before; the ERG designers’
decision was based on largely anecdotal accounts
of performance relative to then-current implemen-
tations that had not been designed with the inten-
tion of minimizing this extra cost (indeed, the ERG
baseline had not yet been devised).
2 Modularity: the cost in clarity
Semantic types and inheritance serve to organize
the constraints and overall structure of an HPSG
grammar. This is certainly a familiar, albeit vague
justification from programming languages research,
but the comparison between HPSG and modern
programming languages essentially ends with this
statement.
Programming languages with inclusional poly-
morphism (subtyping) invariably provide functions
or relations and allow these to be reified as meth-
ods BalancingKeynesianandNeoclassicalModelsBalancingKeynesianandNeoclassicalModels By: OpenStaxCollege Finding the balance between KeynesianandNeoclassicalmodels can be compared to the challenge of riding two horses simultaneously When a circus performer stands on two horses, with a foot on each one, much of the excitement for the viewer lies in contemplating the gap between the two As modern macroeconomists ride into the future on two horses—with one foot on the short-term Keynesian perspective and one foot on the long-term neoclassical perspective—the balancing act may look uncomfortable, but there does not seem to be any way to avoid it Each approach, Keynesianand neoclassical, has its strengths and weaknesses The short-term Keynesian model, built on the importance of aggregate demand as a cause of business cycles and a degree of wage and price rigidity, does a sound job of explaining many recessions and why cyclical unemployment rises and falls By focusing on the short-run adjustments of aggregate demand, Keynesian economics risks overlooking the long-term causes of economic growth or the natural rate of unemployment that exists even when the economy is producing at potential GDP The neoclassical model, with its emphasis on aggregate supply, focuses on the underlying determinants of output and employment in markets, and thus tends to put more emphasis on economic growth and how labor markets work However, the neoclassical view is not especially helpful in explaining why unemployment moves up and down over short time horizons of a few years Nor is the neoclassical model especially helpful when the economy is mired in an especially deep and long-lasting recession, like the Great Depression of the 1930s Keynesian economics tends to view inflation as a price that might sometimes be paid for lower unemployment; neoclassical economics tends to view inflation as a cost that offers no offsetting gains in terms of lower unemployment Macroeconomics cannot, however, be summed up as an argument between one group of economists who are pure Keynesians and another group who are pure neoclassicists Instead, many mainstream economists believe both the Keynesianandneoclassical 1/4 BalancingKeynesianandNeoclassicalModels perspectives Robert Solow, the Nobel laureate in economics in 1987, described the dual approach in this way: “At short time scales, I think, something sort of ‘Keynesian’ is a good approximation, and surely better than anything straight ‘neoclassical.’ At very long time scales, the interesting questions are best studied in a neoclassical framework, and attention to the Keynesian side of things would be a minor distraction At the five-to-ten-year time scale, we have to piece things together as best we can, and look for a hybrid model that will the job.” Many modern macroeconomists spend considerable time and energy trying to construct models that blend the most attractive aspects of the Keynesianandneoclassical approaches It is possible to construct a somewhat complex mathematical model where aggregate demand and sticky wages and prices matter in the short run, but wages, prices, and aggregate supply adjust in the long run However, creating an overall model that encompasses both short-term Keynesianand long-term neoclassicalmodels is not easy Navigating Unchartered Waters Were the policies implemented to stabilize the economy and financial markets during the Great Recession effective? Many economists from both the Keynesianandneoclassical schools have found that they were, although to varying degrees Alan Blinder of Princeton University and Mark Zandi for Moody’s Analytics found that, without fiscal policy, GDP decline would have been significantly more than its 3.3% in 2008 followed by its 0.1% decline in 2009 They also estimated that there would have been 8.5 million more job losses had the government not intervened in the market with the TARP to support the financial industry and key automakers General Motors and Chrysler Federal Reserve Bank economists Carlos Carvalho, Stefano Eusip, and Christian Grisse found in their study, Policy Initiatives in the Global Recession: What Did Forecasters Expect? that once policies were implemented, forecasters adapted their expectations to these policies They were more likely to anticipate increases in investment due to lower interest rates brought on by monetary policy and increased economic growth resulting from fiscal policy The difficulty with evaluating the effectiveness of the stabilization policies that were taken in response to the Great Recession is that we will never know what would have happened had those policies not have been implemented Surely some of the programs were more effective at creating and saving jobs, while other programs were less so The final conclusion on the effectiveness of macroeconomic policies is still up for debate, and further study will no doubt consider the impact of these policies on the U.S budget and ...Copulas and credit models
R¨udiger Frey
Swiss Banking Institute
University of Zurich
freyr@isb.unizh.ch
Alexander J. McNeil
Department of Mathematics
ETH Zurich
mcneil@math.ethz.ch
Mark A. Nyfeler
Investment Office RTC
UBS Zurich
mark.nyfeler@ubs.com
October 2001
1 Introduction
In this article we focus on the latent variable approach to modelling credit portfolio
losses. This methodology underlies all models that descend from Merton’s firm-value
model (Merton 1974). In particular, it underlies the most important industry models,
such as the model proposed by the KMV corporation and CreditMetrics.
In these models default of an obligor occurs if a latent variable, often interpreted as the
value of the obligor’s assets, falls below some threshold, often interpreted as the value of the
obligor’s liabilities. Dependence between default events is caused by dependence between
the latent variables. The correlation matrix of the latent variables is often calibrated by
developing factor models that relate changes in asset value to changes in a small number of
economic factors. For further reading see papers by Koyluoglu and Hickman (1998), Gordy
(2000) and Crouhy, Galai, and Mark (2000).
A core assumption of the KMV and CreditMetrics models is the multivariate normality
of the latent variables. However there is no compelling reason for choosing a multivariate
normal (Gaussian) distribution for asset values. The aim of this article is to show that
the aggregate portfolio loss distribution is often very sensitive to the exact nature of the
multivariate distribution of the latent variables.
This is not simply a question of asset correlation. Even when individual default prob-
abilities of obligors and the matrix of latent variable correlations are held fixed, it is still
possible to develop alternative models which lead to much heavier-tailed loss distributions.
A useful source of alternative models is the family of multivariate normal mixture distribu-
tions, which includes Student’s t distribution and the generalized hyperbolic distribution.
In most cases it is as easy to base latent variable models on these mixture distributions as
it is to base them on the multivariate normal distribution.
An elegant way of understanding how a multivariate latent variable distribution de-
termines the distribution of the number of defaults in a portfolio is to use the concept of
copulas. In this article we show that it is the copula (or dependence structure) of the latent
variables that determines the higher order joint default probabilities for groups of obligors,
and thus determines the extreme risk that there are many defaults in the portfolio.
If we choose alternative latent variable distributions in the normal mixture family then
we implicitly work with alternative copulas which often differ markedly from the copula of a
Gaussian distribution. Some of these copulas, such as the t copula, possess tail dependence
and, in contrast to the multivariate normal, they have a much greater tendency to generate
1
simultaneous extreme values (Embrechts, McNeil, and Straumann 1999). This effect is
highly important in latent variable models, since simultaneous low asset values will lead to
many joint defaults and past experience shows that realistic credit risk models need to be
able to give sufficient weight to scenarios where many joint defaults occur.
This article may be understood as a model risk study in the context of latent variable
models. Individual default probabilities and asset correlations are insufficient to determine
the portfolio loss distribution, since they do not fix the copula of the latent variables. For
large portfolios of tens of thousands of counterparties there remains considerable model
risk. Risk managers who employ the latent variable methodology should be aware of this.
2 Latent Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 793–803,
Portland, Oregon, June 19-24, 2011.
c
2011 Association for Computational Linguistics
Large-Scale Cross-Document Coreference Using
Distributed Inference and Hierarchical Models
Sameer Singh
§
Amarnag Subramanya
†
Fernando Pereira
†
Andrew McCallum
§
§
Department of Computer Science, University of Massachusetts, Amherst MA 01002
†
Google Research, Mountain View CA 94043
sameer@cs.umass.edu, asubram@google.com, pereira@google.com, mccallum@cs.umass.edu
Abstract
Cross-document coreference, the task of
grouping all the mentions of each entity in a
document collection, arises in information ex-
traction and automated knowledge base con-
struction. For large collections, it is clearly
impractical to consider all possible groupings
of mentions into distinct entities. To solve
the problem we propose two ideas: (a) a dis-
tributed inference technique that uses paral-
lelism to enable large scale processing, and
(b) a hierarchical model of coreference that
represents uncertainty over multiple granular-
ities of entities to facilitate more effective ap-
proximate inference. To evaluate these ideas,
we constructed a labeled corpus of 1.5 million
disambiguated mentions in Web pages by se-
lecting link anchors referring to Wikipedia en-
tities. We show that the combination of the
hierarchical model with distributed inference
quickly obtains high accuracy (with error re-
duction of 38%) on this large dataset, demon-
strating the scalability of our approach.
1 Introduction
Given a collection of mentions of entities extracted
from a body of text, coreference or entity resolu-
tion consists of clustering the mentions such that
two mentions belong to the same cluster if and
only if they refer to the same entity. Solutions to
this problem are important in semantic analysis and
knowledge discovery tasks (Blume, 2005; Mayfield
et al., 2009). While significant progress has been
made in within-document coreference (Ng, 2005;
Culotta et al., 2007; Haghighi and Klein, 2007;
Bengston and Roth, 2008; Haghighi and Klein,
2009; Haghighi and Klein, 2010), the larger prob-
lem of cross-document coreference has not received
as much attention.
Unlike inference in other language processing
tasks that scales linearly in the size of the corpus,
the hypothesis space for coreference grows super-
exponentially with the number of mentions. Conse-
quently, most of the current approaches are devel-
oped on small datasets containing a few thousand
mentions. We believe that cross-document coref-
erence resolution is most useful when applied to a
very large set of documents, such as all the news ar-
ticles published during the last 20 years. Such a cor-
pus would have billions of mentions. In this paper
we propose a model and inference algorithms that
can scale the cross-document coreference problem
to corpora of that size.
Much of the previous work in cross-document
coreference (Bagga and Baldwin, 1998; Ravin and
Kazi, 1999; Gooi and Allan, 2004; Pedersen et al.,
2006; Rao et al., 2010) groups mentions into entities
with some form of greedy clustering using a pair-
wise mention similarity or distance function based
on mention text, context, and document-level statis-
tics. Such methods have not been shown to scale up,
and they cannot exploit cluster features that cannot
be expressed in terms of mention pairs. We provide
a detailed survey of related work in Section 6.
Other previous work attempts to address some of
the above concerns by mapping coreference to in-
ference on an undirected graphical model (Culotta
et al., 2007; Poon et al., 2008; Wellner et al., 2004;
Wick et al., 2009a). These models contain pair-
wise factors between all pairs of mentions Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:shortpapers, pages 101–106,
Portland, Oregon, June 19-24, 2011.
c
2011 Association for Computational Linguistics
Extracting Opinion Expressions and Their Polarities – Exploration of
Pipelines and Joint Models
Richard Johansson and Alessandro Moschitti
DISI, University of Trento
Via Sommarive 14, 38123 Trento (TN), Italy
{johansson, moschitti}@disi.unitn.it
Abstract
We investigate systems that identify opinion
expressions and assigns polarities to the ex-
tracted expressions. In particular, we demon-
strate the benefit of integrating opinion ex-
traction and polarity classification into a joint
model using features reflecting the global po-
larity structure. The model is trained using
large-margin structured prediction methods.
The system is evaluated on the MPQA opinion
corpus, where we compare it to the only previ-
ously published end-to-end system for opinion
expression extraction and polarity classifica-
tion. The results show an improvement of be-
tween 10 and 15 absolute points in F-measure.
1 Introduction
Automatic systems for the analysis of opinions ex-
pressed in text on the web have been studied exten-
sively. Initially, this was formulated as a coarse-
grained task – locating opinionated documents –
and tackled using methods derived from standard re-
trieval or categorization. However, in recent years
there has been a shift towards a more detailed task:
not only finding the text expressing the opinion, but
also analysing it: who holds the opinion and to what
is addressed; it is positive or negative (polarity);
what its intensity is. This more complex formula-
tion leads us deep into NLP territory; the methods
employed here have been inspired by information
extraction and semantic role labeling, combinatorial
optimization and structured machine learning.
A crucial step in the automatic analysis of opinion
is to mark up the opinion expressions: the pieces of
text allowing us to infer that someone has a partic-
ular feeling about some topic. Then, opinions can
be assigned a polarity describing whether the feel-
ing is positive, neutral or negative. These two tasks
have generally been tackled in isolation. Breck et al.
(2007) introduced a sequence model to extract opin-
ions and we took this one step further by adding a
reranker on top of the sequence labeler to take the
global sentence structure into account in (Johansson
and Moschitti, 2010b); later we also added holder
extraction (Johansson and Moschitti, 2010a). For
the task of classifiying the polarity of a given expres-
sion, there has been fairly extensive work on suitable
classification features (Wilson et al., 2009).
While the tasks of expression detection and polar-
ity classification have mostly been studied in isola-
tion, Choi and Cardie (2010) developed a sequence
labeler that simultaneously extracted opinion ex-
pressions and assigned polarities. This is so far
the only published result on joint opinion segmenta-
tion and polarity classification. However, their ex-
periment lacked the obvious baseline: a standard
pipeline consisting of an expression identifier fol-
lowed by a polarity classifier.
In addition, while theirs is the first end-to-end sys-
tem for expression extraction with polarities, it is
still a sequence labeler, which, by construction, is
restricted to use simple local features. In contrast, in
(Johansson and Moschitti, 2010b), we showed that
global structure matters: opinions interact to a large
extent, and we can learn about their interactions on
the opinion level by means of their interactions on
the syntactic Joint and conditional estimation of tagging and parsing models
∗
Mark Johnson
Brown University
Mark
Johnson@Brown.edu
Abstract
This paper compares two different ways
of estimating statistical language mod-
els. Many statistical NLP tagging and
parsing models are estimated by max-
imizing the (joint) likelihood of the
fully-observed training data. How-
ever, since these applications only re-
quire the conditional probability distri-
butions, these distributions can in prin-
ciple be learnt by maximizing the con-
ditional likelihood of the training data.
Perhaps somewhat surprisingly, models
estimated by maximizing the joint were
superior to models estimated by max-
imizing the conditional, even though
some of the latter models intuitively
had access to “more information”.
1 Introduction
Many statistical NLP applications, such as tag-
ging and parsing, involve finding the value
of some hidden variable Y (e.g., a tag or a
parse tree) which maximizes a conditional prob-
ability distribution P
θ
(Y |X), where X is a
given word string. The model parameters θ
are typically estimated by maximum likelihood:
i.e., maximizing the likelihood of the training
∗
I would like to thank Eugene Charniak and the other
members ofBLLIP for theircomments andsuggestions. Fer-
nando Pereira was especially generous with comments and
suggestions, as were the ACL reviewers; I apologize for not
being able to follow up all of your good suggestions. This re-
search was supported by NSF awards 9720368 and 9721276
and NIH award R01 MH60922-01A2.
data. Given a (fully observed) training cor-
pus D = ((y
1
, x
1
), . . . , (y
n
, x
n
)), the maximum
(joint) likelihood estimate (MLE) of θ is:
ˆ
θ = argmax
θ
n
i=1
P
θ
(y
i
, x
i
). (1)
However, it turns out there is another maximum
likelihood estimation method which maximizes
the conditional likelihood or “pseudo-likelihood”
of the training data (Besag, 1975). Maximum
conditional likelihood is consistent for the con-
ditional distribution. Given a training corpus
D, the maximum conditional likelihood estimate
(MCLE) of the model parameters θ is:
ˆ
θ = argmax
θ
n
i=1
P
θ
(y
i
|x
i
). (2)
Figure 1 graphically depicts the difference be-
tween the MLE and MCLE. Let Ω be the universe
of all possible pairs (y, x) of hidden and visible
values. Informally, the MLE selects the model
parameter θ which make the training data pairs
(y
i
, x
i
) as likely as possible relative to all other
pairs (y
, x
) in Ω. The MCLE, on the other hand,
selects the model parameter θ in order to make the
training data pair (y
i
, x
i
) more likely than other
pairs (y
, x
i
) in Ω, i.e., pairs with the same visible
value x
i
as the training datum.
In statistical computational linguistics, max-
imum conditional likelihood estimators have
mostly been used with general exponential or
“maximum entropy” models because standard
maximum likelihood estimation is usually com-
putationally intractable (Berger et al., 1996; Della
Pietra et al., 1997; Jelinek, 1997). Well-
known computational linguistic models such as
(MLE)
(MCLE)
Ω
Y = y
i
, X = x
i
Ω
X = x
i
Y = y
i
, X = x
i
Figure 1: The MLE makes the training data (y
i
, x
i
) as
likely as possible (relative to Ω), while the MCLE makes
(y
i
, x
i
) as likely as possible relative to other pairs (y
, x
i
).
Maximum-Entropy Markov Models (McCallum
et al., 2000) and Stochastic Unification-based
Grammars (Johnson et al., 1999) are standardly
estimated with conditional estimators, and it
would be interesting to know whether conditional
estimation affects the quality of the estimated
model. It should be noted that in practice, the
MCLE of a model with a large number of features
with complex dependencies may yield far better
performance than the MLE of the ... policies on the U.S budget and deficit, as well as the value of the U.S dollar in the financial market 2/4 Balancing Keynesian and Neoclassical Models Key Concepts and Summary The Keynesian perspective... taxes and higher interest rates? Explain your answer Critical Thinking Question Is it a logical contradiction to be a neoclassical Keynesian? Explain 3/4 Balancing Keynesian and Neoclassical Models. .. aggregate demand as a cause of business cycles and a degree of wage and price rigidity, and thus does a sound job of explaining many recessions and why cyclical unemployment rises and falls The neoclassical