如何让大模型提取更有信息密度的文本摘要？SalesforceAI最新的密度链提示方法Chain of Density Prompting

基于文本做文本摘要的时候，摘要所包含的信息密度是一个非常重要的问题。正常情况下我们希望文本摘要既能覆盖更多的重要信息，又要保持简洁和连贯。SalesforceAI与MIT等机构的研究人员联合发布了一个最新的Prompt技巧，称为密度链提示方法（Chain of Density Prompting），可以提取有信息含量的简洁摘要。

[TOC]

使用大模型提取文本摘要的问题

传统上，我们希望摘要既能覆盖更多重要信息，又要保持连贯易读。但是这两者存在权衡关系，增加信息量会降低可读性。那么，究竟何种密度才能达到最佳的信息量与可读性的平衡？这是一个值得探索的问题。

要回答上述问题，我们首先需要有效地控制和评估不同密度级别的摘要。目前研究中很难区分长度和密度两个因素的影响。

为了有一个好的方法解决摘要密度与质量之间的关系，SalesforceAI研究人员提出了一种非常创新的方法——密度链(Chain of Density, CoD)提示。

密度链提示方法（Chain of Density，CoD）

密度链（CoD）提示方法通过迭代地识别并融合缺失实体，在不增加长度的条件下生成信息密度逐渐提高的文本摘要，从而探索摘要的信息量与可读性之间的权衡关系。

其主要的工作原理:

修正后的文本如下：

生成一个实体较稀疏的初始Summary A，长度约80词。
从原文中识别1-3个未出现在A中的实体，作为Missing Entities。
不增加长度的情况下，融合Missing Entities，重写生成Summary B。
重复步骤2-3，逐步提高摘要的实体密度，一共5轮。

可以看到，这个prompt的核心思想就是通过识别摘要中缺少的原文中出现的实体来判断当前摘要的信息密度，所以通过这种方式让LLM来理解摘要中缺少了哪些部分，然后自动修正摘要。

上图展示了CoD的具体提示和对比结果。

这个密度链提示方法（Chain of Density，CoD）的Prompt提示结果：

prompt_input = f"""

Article: {{{input_article}}}
You will generate increasingly concise, entity-dense summaries of the above article. 

Repeat the following 2 steps 5 times. 

Step 1. Identify 1-3 informative entities (";" delimited) from the article which are missing from the previously generated summary. 
Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the missing entities. 

A missing entity is:
- relevant to the main story, 
- specific yet concise (5 words or fewer), 
- novel (not in the previous summary), 
- faithful (present in the article), 
- anywhere (can be located anywhere in the article).

Guidelines:

- The first summary should be long (4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.
- Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
- The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the article. 
- Missing entities can appear anywhere in the new summary.
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities. 

Remember, use the exact same number of words for each summary.
Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary".
"""

密度链提示方法（Chain of Density，CoD）的价值

从这个思路看，密度链提示方法（Chain of Density，CoD）不仅仅可以用在文本摘要中。它的思路本质是一种让大模型自己调优的思路。通过识别到的实体来判断摘要的信息密度。按照这个思路我们也可以考虑用其他内容来帮助大家做一些更加有利的工作。举个简单的例子，例如我们要用大模型生成代码，可以让大模型识别生成的代码的功能点与输入需求的差异和比对，不断来优化，这也是很好的思路。

这篇论文题目和地址：From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

我也测试了一下用这个prompt来总结这篇文章的摘要，结果如下：

This is a methodology that the researchers have employed to generate increasingly concise and entity-dense summaries of articles using GPT-4. The process involves identifying 1-3 informative entities missing from the previously generated summary and then writing a new summary of identical length that includes these entities. The process is repeated five times, with each iteration making the summary denser by incorporating more entities. The researchers conducted a human preference study on 100 CNN DailyMail articles and found that humans prefer GPT-4 summaries that are denser than those generated by a vanilla prompt. The findings suggest a tradeoff between informativeness and readability in summary generation.

翻译后：

这是研究人员采用的一种方法，使用GPT-4生成越来越简明和实体密集的文章摘要。该过程涉及到识别前一次生成的摘要中缺失的1-3个信息丰富的实体，然后编写一个新的长度相同的摘要，包括这些实体。这个过程重复了五次，每次迭代都通过增加更多实体使摘要更加密集。研究人员对100篇CNN DailyMail文章进行了人类偏好研究，发现人们更喜欢GPT-4生成的摘要，这些摘要比普通提示生成的摘要更密集。研究结果表明，在摘要生成中存在信息丰富性和可读性之间的权衡。

使用大模型提取文本摘要的问题

密度链提示方法（Chain of Density，CoD）

密度链提示方法（Chain of Density，CoD）的价值

欢迎关注 DataLearner 官方微信，获得最新 AI 技术推送