... 4a and 4b, evaluationmetrics al-
ways correlate better on the initial task than on the
update task. This suggests that there is much room
for improvement for readability metrics, and metrics
need ... DICOMER – a DIscourse COherence
Model forEvaluating Readability.
LIN outperforms all metrics on all correlations on
both tasks. On the initial task, it outperforms the
best scores by 3.62%, 16.20%, ... Explicit/Non-Explicit
information, and demonstrate that they improve the
original model.
There are parallels between evaluations of ma-
chine translation (MT) and summarization with re-
spect to textual content. For...
... offering a rich set of metrics and meta -metrics
for assessing MT quality (Gim
´
enez and M
`
arquez,
2010a). Although automatic MTevaluation is still
far from manual evaluation, it is indeed ... Association for Computational Linguistics, pages 139–144,
Jeju, Republic of Korea, 8-14 July 2012.
c
2012 Association for Computational Linguistics
A Graphical Interface forMTEvaluation and ... existing evaluation measures
and to support the development of further improve-
ments or even totally new evaluation metrics. This
information can be gathered both from the experi-
139
Figure 1: MT...
... word alignment information.
3 Experiments
3.1
PORT as an Evaluation Metric
We studied PORT as an evaluation metric on
WMT data; test sets include WMT 2008, WMT
2009, and WMT 2010 all-to-English, ... Birch and M. Osborne. 2011. ReorderingMetricsfor
MT. In Proceedings of ACL.
C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz and
J. Schroeder. 2008. Further Meta -Evaluation of
Machine Translation. ... and 22.0% ties).
1 Introduction
Automatic evaluationmetricsfor machine
translation (MT) quality are a key part of building
statistical MT (SMT) systems. They play two
1
PORT: Precision-Order-Recall...
... human assessment are higher than stan-
dard automatic evaluation metrics.
2 MT Evaluation
Recent automatic evaluationmetrics typically frame
the evaluation problem as a comparison task: how
similar ... in-
valuable resource for measuring the reliability of au-
tomatic evaluation metrics. In this paper, we show
that they are also informative in developing better
metrics.
3 MTEvaluation with Machine ... Meeting of the Association for Computa-
tional Linguistics, July.
Chin-Yew Lin and Franz Josef Och. 2004b. Orange: a
method forevaluating automatic evaluationmetricsfor ma-
chine translation....
... these metrics cor-
relate highly with human judgments.
1 Introduction
Machine paraphrasing has many applications for
natural language processing tasks, including ma-
chine translation (MT) , MT evaluation, ... Paraphrase Evaluation Metrics
One of the limitations to the development of ma-
chine paraphrasing is the lack of standard metrics
like BLEU, which has played a crucial role in driv-
ing progress in MT. ... for what constitutes a high-quality para-
phrase. In addition to the lack of standard datasets
for training and testing, there are also no standard
metrics like BLEU (Papineni et al., 2002) for...
... Similarity Metrics
We begin by defining a set of 22 similarity metrics
taken from the list of standard evaluation metrics
in Subsection 2.1. Evaluationmetrics can be tuned
into similarity metrics ... families
of similarity metrics form a set of 104 metrics. Our
goal is to obtain the subset of metrics with highest
descriptive power; for this, we rely on the KING
probability. A brute force exploration ... references:
ORANGE was introduced by Lin and Och
(2004b)
6
for the meta -evaluation of MT evalua-
tion metrics. The
measure provides
information about the average behavior of auto-
matic and manual...
... R
2
for the family of metrics AEv(α,N), for correctness scores, second QA evaluation
A Unified Framework for Automatic Evaluation using
N-gram Co-Occurrence Statistics
Radu SORICUT
Information ...
penalized). Another evaluation we consider in this
paper, the DUC 2001 evaluationfor Automatic
Summarization (also performed by NIST), had
specific guidelines for coverage evaluation, which ... Unified Framework for Automatic
Evaluation
In this section we propose a family of evaluation
metrics based on N-gram co-occurrence statistics.
Such a family of evaluationmetrics provides...
... used in the vec-
tor-space model for Information Retrieval (Salton
and Leck, 1968) and the S-score proposed for
evaluating MT output corpora for the purposes of
Information Extraction (Babych ... scores for both runs were
compared using a standard deviation measure.
3. The results of the MTevaluation with
frequency weights
With respect to evaluatingMT systems, the cor-
relation for ... for translation: MT systems that have no
means for prioritising this information often in-
troduce excessive information noise into the tar-
get text by literally translating structural
information,...
... 9000 factors for an evaluation and a strategic university
planning. For the implementation, a Web-based DSS is based on ISO 9000 factors for the evaluation
and strategic planning for a case study ... alternatives for an evaluation
model / a strategic university planning.
3. DSS model application for an evaluation and a strategy planning
3.1. Application model using ISO 9000 factors for a strategic ... The
forth step is to analyze the hierarchy model using ISO 9000 factors for an evaluation and a strategic
planning. The final step is to build a Web-based DSS application based on AHP model for...
... on
overall driving forces for education reforms be consid-
ered (Figure 5).
Indicators
Finally, we d educe ten core indicators from the above
framework for the purpose of monitoring and evaluation
via ... higher policy
and decision-making fora, but equally - and potentially
more important - they can be bottom-up, that is promoted
and enforced by the health workforce, for instance by
means of addressing ... the evaluation
of educational interventions or the monitoring of curri-
culum development during education reforms. It further
suggests comprehensive consideration of the driving
forces for education...
... tabular form CN, and
E
i
(k) to denote the cell at the k-th row and the
i-th column. W(k ) is the weight for E(k), and
W
i
(k) = W (k) is the weight for E
i
(k). p
i
(k)
is the normalized weight for ... newsgroup sections of MT0 6,
whereas the test set is the entire MT0 8. The 10-
best translations for every source sentence in the
dev and test sets are collected from eight MT sys-
tems. Case-insensitive ... Open MT evaluation.
1 Introduction
Word-level combination using confusion network
(Matusov et al. (2006) and Rosti et al. (2007)) is a
widely adopted approach for combining Machine
Translation (MT) ...
... 2006.
c
2006 Association for Computational Linguistics
An Automatic Method for Summary Evaluation
Using Multiple Evaluation Results by a Manual Method
Hidetsugu Nanba
Faculty of Information Sciences, ... section, are
necessary for a more accurate summary
evaluation.
3 Investigation of an Automatic Method
using Multiple Manual Evaluation
Results
3.1 Overview of Our Evaluation Method
and ... Consortium.
2
http://www.nist.gov/speech/tests /mt/ mt2001/resource/
604
tested ROUGE and cosine distance, both of
which have been used for summary evaluation.
If a score by Yasuda’s method exceeds...
... is, therefore, how to find informative
metrics, and then how to combine them into an op-
timal single quality estimation for automatic sum-
maries. The most immediate way of combining
metrics is ... and (iii) test whether
evaluating with that test-bed is reliable (JACK
measure).
2 Formal constraints on any evaluation
framework based on similarity metrics
We are looking for a framework to evaluate ... Lin. 2004. Orange: a Method forEvaluating Au-
tomatic Metricsfor Machine Translation. In Pro-
ceedings of the 36th Annual Conference on Compu-
tational Linguisticsion for Computational Linguis-
tics...
... whole corpus (BNC). C is the total number of
categories. W stands for Written, S for Spoken. C1, C2,
DE, UN are demographic classes for the spontaneous
conversations, no
cat is the BNC undefined category.
ples ... to
investigate how the choice of the biased sampling
method affects the performance of our procedure
and its relations to uniform sampling.
3.1 Corpora as unigram distributions
A compact way of representing ... collections of doc-
uments is closely related to the similarity of the
218
A Figure of Merit for the Evaluation of Web-Corpus Randomness
Massimiliano Ciaramita
Institute of Cognitive Science and...