Evaluating diagnostic content of AI-generated radiology reports of chest X-rays

Highlights：

• BLEU scores of generated reports can be high while they lack diagnostically important information.

• We investigate this problem and propose a new measure that quantifies the diagnostic content of AI-generated radiology reports.

• In addition, we exploit the standardization of reports by generating a sequence of sentences because in practice most of the radiologist use a well focused vocabulary of ‘standard’ sentences.

摘要

•Conventional quality metrics for natural language processing methods like the popular BLEU score, provide little information about the quality of the diagnostic content of AI-generated radiology reports.•BLEU scores of generated reports can be high while they lack diagnostically important information.•We investigate this problem and propose a new measure that quantifies the diagnostic content of AI-generated radiology reports.•In addition, we exploit the standardization of reports by generating a sequence of sentences because in practice most of the radiologist use a well focused vocabulary of ‘standard’ sentences.

论文评审过程：Received 25 February 2020, Revised 5 February 2021, Accepted 7 April 2021, Available online 15 April 2021, Version of Record 26 April 2021.