A survey on evaluation of summarization methods

作者：

Highlights：

• Manual assessment is not re-usable.

• Re-use of the gold standard by non-participants is often problematic.

• Overlap-based metrics are not suitable for full text comparison-based evaluation.

• GRAD exceeds word-based metrics to distinguish between generated and human written summaries.

• Overlap metrics and GRAD can identify native abstracts among ones from different texts.

• Existing metrics, except GEM, have relative values and so are not interpretable.

• The majority of the metrics are normalized, but in practice, their values tend to 0.

摘要

•Manual assessment is not re-usable.•Re-use of the gold standard by non-participants is often problematic.•Overlap-based metrics are not suitable for full text comparison-based evaluation.•GRAD exceeds word-based metrics to distinguish between generated and human written summaries.•Overlap metrics and GRAD can identify native abstracts among ones from different texts.•Existing metrics, except GEM, have relative values and so are not interpretable.•The majority of the metrics are normalized, but in practice, their values tend to 0.

论文关键词：Automatic summarization,Text compression,Evaluation campaigns,Assessment metrics,Extraction,Extractive summarization,ROUGE

论文评审过程：Received 25 July 2018, Revised 1 April 2019, Accepted 3 April 2019, Available online 20 April 2019, Version of Record 20 June 2019.

论文官网地址：https://doi.org/10.1016/j.ipm.2019.04.001