... 70% of the time,the judges chose between only 2 of them. For theitems with 5 possible alternatives, in 10% of thoseitems the human judges chose only 1 of those al-ternatives; in 30% of cases, ... can only give a rough im-pression of the quality of the system output. It isunclear, however, what kind of metric would bemost suitable for the evaluation of string realisa-tions, so that, ... basis of a given input f-structure.In these experiments, we use f -structures fromtheir held-out and test sets, of which 96% canbe associated with surface realisations by thegrammar. F-structures...