... runs using anoptimal threshold (box 3) for the experiment as de-termined by using the test set. In all remaining ex-periments, we learn the threshold from the trainingset as in the BASELINE ... including the number of documents, annotated CEs, coreference chains, annotatedCEs per chain (average), and number of documents in the train/test split. We use st to indicate a standard train/test ... coreference chain to which ce isassigned in the response (i.e. the system-generatedoutput) and Kceis the coreference chain that con-tains ce in the key (i.e. the gold standard). Pre-cision and recall...