This blog summarize the latest research development of summarization papers published in ACL2023 conferences. This year there are total 49 papers related to summarization in ACL2023. Most of the authors' affiliations are top research institutes (Google Research, DeepMind, Meta FAIR) and universities (Stanford, Berkeley, MIT, CMU and others).
Navigation
- 1.Binary and Ternary Natural Language Generation
- 2.DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization
- 3.Compositional Data Augmentation for Abstractive Conversation Summarization
- 4.Cross-lingual Science Journalism: Select, Simplify and Rewrite Summaries for Non-expert Readers
- 5.Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering
- 6.Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization
- 7.Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion
- 8.CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs
- 9.Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization
- 10.Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
- 11.Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization
- 12.Incorporating Distributions of Discourse Structure for Long Document Abstractive Summarization
- 13.Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
- 14.Improving the Robustness of Summarization Systems with Dual Augmentation
- 15.RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs
- 16.Attributable and Scalable Opinion Summarization
- 17.CFSum Coarse-to-Fine Contribution Network for Multimodal Summarization
- 18.Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method
- 19.SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks
- 20.Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation
- 21.Unsupervised Extractive Summarization of Emotion Triggers
- 22.DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
- 23.Concise Answers to Complex Questions: Summarization of Long-form Answers
- 24.Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations
- 25.SIMSUM: Document-level Text Simplification via Simultaneous Summarization
- 26.What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization
- 27.AlignScore: Evaluating Factual Consistency with A Unified Alignment Function
- 28.Contrastive Error Attribution for Finetuned Language Models
- 29.Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors
- 30.On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
- 31.Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization
- 32.BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
- 33.UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot Summarization
- 34.ExplainMeetSum: A Dataset for Explainable Meeting Summarization Aligned with Human Intent
- 35.Dialogue Summarization with Static-Dynamic Structure Fusion Graph
- 36.Reference Matters: Benchmarking Factual Error Correction for Dialogue Summarization with Fine-grained Evaluation Framework
- 37.Towards Understanding Omission in Dialogue Summarization
- 38.A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
- 39.MeetingQA: Extractive Question-Answering on Meeting Transcripts
- 40.Towards Unifying Multi-Lingual and Cross-Lingual Summarization
- 41.On Improving Summarization Factual Consistency from Natural Language Feedback
- 42.MeetingBank: A Benchmark Dataset for Meeting Summarization
- 43.Abstractive Summarizers are Excellent Extractive Summarizers
- 44.Toward Expanding the Scope of Radiology Report Summarization to Multiple Anatomies and Modalities
- 45.Balancing Lexical and Semantic Quality in Abstractive Summarization
- 46.Exploring Continual Learning for Code Generation Models
- 47.Token-Level Self-Evolution Training for Sequence-to-Sequence Learning
- 48.Improving Factuality of Abstractive Summarization without Sacrificing Summary Quality
- 49.With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
Paper List
1.Binary and Ternary Natural Language Generation
Zechun Liu,Barlas Oguz,Aasish Pappu,Yangyang Shi,Raghuraman Krishnamoorthi
Download URL
https://aclanthology.org/2023.acl-long.5/
abstract
AbstractTernary and binary neural networks enable multiplication-free computation and promise multiple orders of magnitude efficiency gains over full-precision networks if implemented on specialized hardware. However, since both the parameter and the output space are highly discretized, such networks have proven very difficult to optimize. The difficulties are compounded for the class of transformer text generation models due to the sensitivity of the attention operation to quantization and the noise-compounding effects of autoregressive decoding in the high-cardinality output space. We approach the problem with a mix of statistics-based quantization for the weights and elastic quantization of the activations and demonstrate the first ternary and binary transformer models on the downstream tasks of summarization and machine translation. Our ternary BART base achieves an R1 score of 41 on the CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while being 16x more efficient. Our binary model, while less accurate, achieves a highly non-trivial score of 35.6. For machine translation, we achieved BLEU scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full precision mBART model score of 26.8. We also compare our approach in the 8-bit activation setting, where our ternary and even binary weight models can match or outperform the best existing 8-bit weight models in the literature. Our code and models are available at: https://github.com/facebookresearch/Ternary_Binary_Transformer.
2.DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization
Yu Li,Baolin Peng,Pengcheng He,Michel Galley,Zhou Yu,Jianfeng Gao
Download URL
https://aclanthology.org/2023.acl-long.76/
abstract
AbstractDialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues have limitations because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pre-train DIONYSUS, we create two pseudo summaries for each dialogue example: one from a fine-tuned summarization model and the other from important dialogue turns. We then choose one of these pseudo summaries based on information distribution differences in different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach on a large dialogue corpus. Our experiments show that DIONYSUS outperforms existing methods on six datasets, as demonstrated by its ROUGE scores in zero-shot and few-shot settings
3.Compositional Data Augmentation for Abstractive Conversation Summarization
Siru Ouyang,Jiaao Chen,Jiawei Han,Diyi Yang
Download URL
https://aclanthology.org/2023.acl-long.82/
abstract
AbstractRecent abstractive conversation summarization systems generally rely on large-scale datasets with annotated summaries. However, collecting and annotating these conversations can be a time-consuming and labor-intensive task. To address this issue, in this work, we present a sub-structure level compositional data augmentation method, Compo, for generating diverse and high-quality pairs of conversations and summaries. Specifically, Compo first extracts conversation structures like topic splits and action triples as basic units. Then we organize these semantically meaningful conversation snippets compositionally to create new training instances.Additionally, we explore noise-tolerant settings in both self-training and joint-training paradigms to make the most of these augmented samples. Our experiments on benchmark datasets, SAMSum and DialogSum, show that Compo substantially outperforms prior baseline methods by achieving a nearly 10% increase of ROUGE scores with limited data. Code is available at https://github.com/ozyyshr/Compo.
4.Cross-lingual Science Journalism: Select, Simplify and Rewrite Summaries for Non-expert Readers
Mehwish Fatima,Michael Strube
Download URL
https://aclanthology.org/2023.acl-long.103/
abstract
AbstractAutomating Cross-lingual Science Journalism (CSJ) aims to generate popular science summaries from English scientific texts for non-expert readers in their local language. We introduce CSJ as a downstream task of text simplification and cross-lingual scientific summarization to facilitate science journalistsâ?? work. We analyze the performance of possible existing solutions as baselines for the CSJ task. Based on these findings, we propose to combine the three components - SELECT, SIMPLIFY and REWRITE (SSR) to produce cross-lingual simplified science summaries for non-expert readers. Our empirical evaluation on the Wikipedia dataset shows that SSR significantly outperforms the baselines for the CSJ task and can serve as a strong baseline for future work. We also perform an ablation study investigating the impact of individual components of SSR. Further, we analyze the performance of SSR on a high-quality, real-world CSJ dataset with human evaluation and in-depth analysis, demonstrating the superior performance of SSR for CSJ.
5.Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering
Avi Caciularu,Matthew Peters,Jacob Goldberger,Ido Dagan,Arman Cohan
Download URL
https://aclanthology.org/2023.acl-long.110/
abstract
AbstractThe integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while â??peekingâ?? into other topically-related documents.In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information.This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization).Following this scheme, we pre-train our model - termed QAmden - and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.
6.Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization
Shiyue Zhang,David Wan,Mohit Bansal
Download URL
https://aclanthology.org/2023.acl-long.120/
abstract
AbstractThe problems of unfaithful summaries have been widely discussed under the context of abstractive summarization. Though extractive summarization is less prone to the common unfaithfulness issues of abstractive summaries, does that mean extractive is equal to faithful? Turns out that the answer is no. In this work, we define a typology with five types of broad unfaithfulness problems (including and beyond not-entailment) that can appear in extractive summaries, including incorrect coreference, incomplete coreference, incorrect discourse, incomplete discourse, as well as other misleading information. We ask humans to label these problems out of 1600 English summaries produced by 16 diverse extractive systems. We find that 30% of the summaries have at least one of the five issues. To automatically detect these problems, we find that 5 existing faithfulness evaluation metrics for summarization have poor correlations with human judgment. To remedy this, we propose a new metric, ExtEval, that is designed for detecting unfaithful extractive summaries and is shown to have the best performance. We hope our work can increase the awareness of unfaithfulness problems in extractive summarization and help future work to evaluate and resolve these issues.
7.Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion
Shaoxiang Wu,Damai Dai,Ziwei Qin,Tianyu Liu,Binghuai Lin,Yunbo Cao,Zhifang Sui
Download URL
https://aclanthology.org/2023.acl-long.124/
abstract
AbstractVideo multimodal fusion aims to integrate multimodal signals in videos, such as visual, audio and text, to make a complementary prediction with multiple modalities contents.However, unlike other image-text multimodal tasks, video has longer multimodal sequences with more redundancy and noise in both visual and audio modalities.Prior denoising methods like forget gate are coarse in the granularity of noise filtering. They often suppress the redundant and noisy information at the risk of losing critical information.Therefore, we propose a denoising bottleneck fusion (DBF) model for fine-grained video multimodal fusion. On the one hand, we employ a bottleneck mechanism to filter out noise and redundancy with a restrained receptive field. On the other hand, we use a mutual information maximization module to regulate the filter-out module to preserve key information within different modalities.Our DBF model achieves significant improvement over current state-of-the-art baselines on multiple benchmarks covering multimodal sentiment analysis and multimodal summarization tasks. It proves that our model can effectively capture salient features from noisy and redundant video, audio, and text inputs.The code for this paper will be publicly available at https://github.com/WSXRHFG/DBF
8.CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs
Abhik Bhattacharjee,Tahmid Hasan,Wasi Uddin Ahmad,Yuan-Fang Li,Yong-Bin Kang,Rifat Shahriyar
Download URL
https://aclanthology.org/2023.acl-long.143/
abstract
AbstractWe present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also introduce LaSE, an embedding-based metric for automatically evaluating model-generated summaries. LaSE is strongly correlated with ROUGE and, unlike ROUGE, can be reliably measured even in the absence of references in the target language. Performance on ROUGE and LaSE indicate that our proposed model consistently outperforms baseline models. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first ever that is not centered around English. We are releasing the dataset, training and evaluation scripts, and models to spur future research on cross-lingual summarization. The resources can be found at https://github.com/csebuetnlp/CrossSum
9.Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization
Yunlong Liang,Fandong Meng,Jinan Xu,Jiaan Wang,Yufeng Chen,Jie Zhou
Download URL
https://aclanthology.org/2023.acl-long.165/
abstract
AbstractThe goal of multimodal abstractive summarization (MAS) is to produce a concise summary given the multimodal data (text and vision). Existing studies on MAS mainly focus on how to effectively use the extracted visual features, having achieved impressive success on the high-resource English dataset. However, less attention has been paid to the quality of the visual features to the summary, which may limit the model performance, especially in the low- and zero-resource scenarios. In this paper, we propose to improve the summary quality through summary-oriented visual features. To this end, we devise two auxiliary tasks including vision to summary task and masked image modeling task. Together with the main summarization task, we optimize the MAS model via the training objectives of all these tasks. By these means, the MAS model can be enhanced by capturing the summary-oriented visual features, thereby yielding more accurate summaries. Experiments on 44 languages, covering mid-high-, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach, which achieves state-of-the-art performance under all scenarios. Additionally, we will contribute a large-scale multilingual multimodal abstractive summarization (MM-Sum) dataset to the research community.
10.Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
Yixin Liu,Alex Fabbri,Pengfei Liu,Yilun Zhao,Linyong Nan,Ruilin Han,Simeng Han,Shafiq Joty,Chien-Sheng Wu,Caiming Xiong,Dragomir Radev
Download URL
https://aclanthology.org/2023.acl-long.228/
abstract
AbstractHuman evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale, and an in-depth analysis of human evaluation is lacking. Therefore, we address the shortcomings of existing summarization evaluation along the following axes: (1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units and allows for a high inter-annotator agreement. (2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems on three datasets. (3) We conduct a comparative study of four human evaluation protocols, underscoring potential confounding factors in evaluation setups. (4) We evaluate 50 automatic metrics and their variants using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. The metrics we benchmarked include recent methods based on large language models (LLMs), GPTScore and G-Eval. Furthermore, our findings have important implications for evaluating LLMs, as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotatorsâ?? prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
11.Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization
Pengcheng He,Baolin Peng,Song Wang,Yang Liu,Ruochen Xu,Hany Hassan,Yu Shi,Chenguang Zhu,Wayne Xiong,Michael Zeng,Jianfeng Gao,Xuedong Huang
Download URL
https://aclanthology.org/2023.acl-long.279/
abstract
AbstractThis paper presents Z-Code++, a new pre-trained language model optimized for abstractive text summarization. The model extends the state-of-the-art encoder-decoder model using three techniques. First, we use a two-phase pre-training to improve the modelâ??s performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ createsa new state-of-the-art on 9 of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM540B on XSum, and the finetuned 200x larger GPT3175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.
12.Incorporating Distributions of Discourse Structure for Long Document Abstractive Summarization
Dongqi Pu,Yifan Wang,Vera Demberg
Download URL
https://aclanthology.org/2023.acl-long.306/
abstract
AbstractFor text summarization, the role of discourse structure is pivotal in discerning the core content of a text. Regrettably, prior studies on incorporating Rhetorical Structure Theory (RST) into transformer-based summarization models only consider the nuclearity annotation, thereby overlooking the variety of discourse relation types. This paper introduces the â??RSTformerâ??, a novel summarization model that comprehensively incorporates both the types and uncertainty of rhetorical relations. Our RST-attention mechanism, rooted in document-level rhetorical structure, is an extension of the recently devised Longformer framework. Through rigorous evaluation, the model proposed herein exhibits significant superiority over state-of-the-art models, as evidenced by its notable performance on several automatic metrics and human evaluation.
13.Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Paul Roit,Johan Ferret,Lior Shani,Roee Aharoni,Geoffrey Cideron,Robert Dadashi,Matthieu Geist,Sertan Girgin,Leonard Hussenot,Orgad Keller,Nikola Momchev,Sabela Ramos Garea,Piotr Stanczyk,Nino Vieillard,Olivier Bachem,Gal Elidan,Avinatan Hassidim,Olivier Pietquin,Idan Szpektor
Download URL
https://aclanthology.org/2023.acl-long.344/
abstract
AbstractDespite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input.This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems.We use reinforcement learning with reference-free, textual-entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries.Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience and conciseness of the generated summaries.
14.Improving the Robustness of Summarization Systems with Dual Augmentation
Xiuying Chen,Guodong Long,Chongyang Tao,Mingzhe Li,Xin Gao,Chengqi Zhang,Xiangliang Zhang
Download URL
https://aclanthology.org/2023.acl-long.378/
abstract
AbstractA robust summarization system should be able to capture the gist of the document, regardless of the specific word choices or noise in the input.In this work, we first explore the summarization modelsâ?? robustness against perturbations including word-level synonym substitution and noise.To create semantic-consistent substitutes, we propose a SummAttacker, which is an efficient approach to generating adversarial samples based on pre-trained language models.Experimental results show that state-of-the-art summarization models have a significant decrease in performance on adversarial and noisy test sets.Next, we analyze the vulnerability of the summarization systems and explore improving the robustness by data augmentation.Specifically, the first vulnerability factor we found is the low diversity of the training inputs.Correspondingly, we expose the encoder to more diverse cases created by SummAttacker in the input space.The second factor is the vulnerability of the decoder, and we propose an augmentation in the latent space of the decoder to improve its robustness.Concretely, we create virtual cases by manifold softmixing two decoder hidden states of similar semantic meanings.Experimental results on Gigaword and CNN/DM datasets demonstrate that our approach achieves significant improvements over strong baselines and exhibits higher robustness on noisy, attacked, and clean datasets
15.RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs
Afra Feyza Akyurek,Ekin Akyurek,Ashwin Kalyan,Peter Clark,Derry Tanti Wijaya,Niket Tandon
Download URL
https://aclanthology.org/2023.acl-long.427/
abstract
AbstractDespite their unprecedented success, even the largest language models make mistakes.Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.
16.Attributable and Scalable Opinion Summarization
Tom Hosking,Hao Tang,Mirella Lapata
Download URL
https://aclanthology.org/2023.acl-long.473/
abstract
AbstractWe propose a method for unsupervised opinion summarization that encodes sentences from customer reviews into a hierarchical discrete latent space, then identifies common opinions based on the frequency of their encodings. We are able to generate both abstractive summaries by decoding these frequent encodings, and extractive summaries by selecting the sentences assigned to the same frequent encodings. Our method is attributable, because the model identifies sentences used to generate the summary as part of the summarization process. It scales easily to many hundreds of input reviews, because aggregation is performed in the latent space rather than over long sequences of tokens. We also demonstrate that our appraoch enables a degree of control, generating aspect-specific summaries by restricting the model to parts of the encoding space that correspond to desired aspects (e.g., location or food). Automatic and human evaluation on two datasets from different domains demonstrates that our method generates summaries that are more informative than prior work and better grounded in the input reviews.
17.CFSum Coarse-to-Fine Contribution Network for Multimodal Summarization
Min Xiao,Junnan Zhu,Haitao Lin,Yu Zhou,Chengqing Zong
Download URL
https://aclanthology.org/2023.acl-long.476/
abstract
AbstractMultimodal summarization usually suffers from the problem that the contribution of the visual modality is unclear. Existing multimodal summarization approaches focus on designing the fusion methods of different modalities, while ignoring the adaptive conditions under which visual modalities are useful. Therefore, we propose a novel Coarse-to-Fine contribution network for multimodal Summarization (CFSum) to consider different contributions of images for summarization. First, to eliminate the interference of useless images, we propose a pre-filter module to abandon useless images. Second, to make accurate use of useful images, we propose two levels of visual complement modules, word level and phrase level. Specifically, image contributions are calculated and are adopted to guide the attention of both textual and visual modalities. Experimental results have shown that CFSum significantly outperforms multiple strong baselines on the standard benchmark. Furthermore, the analysis verifies that useful images can even help generate non-visual words which are implicitly represented in the image.
18.Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method
Yiming Wang,Zhuosheng Zhang,Rui Wang
Download URL
https://aclanthology.org/2023.acl-long.482/
abstract
AbstractAutomatic summarization generates concise summaries that contain key ideas of source documents.As the most mainstream datasets for the news sub-domain, CNN/DailyMail and BBC XSum have been widely used for performance benchmarking. However, the reference summaries of those datasets turn out to be noisy, mainly in terms of factual hallucination and information redundancy. To address this challenge, we first annotate new expert-writing Element-aware test sets following the â??Lasswell Communication Modelâ?? proposed by Lasswell, allowing reference summaries to focus on more fine-grained news elements objectively and comprehensively. Utilizing the new test sets, we observe the surprising zero-shot summary ability of LLMs, which addresses the issue of the inconsistent results between human preference and automatic evaluation metrics of LLMsâ?? zero-shot summaries in prior work. Further, we propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step, which helps them integrate more fine-grained details of source documents into the final summaries that correlate with the human writing mindset. Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L on the two datasets, respectively. Dataset and code are publicly available at https://github.com/Alsace08/SumCoT.
19.SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks
Suwon Shon,Siddhant Arora,Chyi-Jiunn Lin,Ankita Pasad,Felix Wu,Roshan S Sharma,Wei-Lun Wu,Hung-yi Lee,Karen Livescu,Shinji Watanabe
Download URL
https://aclanthology.org/2023.acl-long.496/
abstract
AbstractSpoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will release a new benchmark suite, including for each task (i) curated annotations for a relatively small fine-tuning set, (ii) reproducible pipeline (speech recognizer + text model) and end-to-end baseline models and evaluation metrics, (iii) baseline model performance in various types of systems for easy comparisons. We present the details of data collection and annotation and the performance of the baseline models. We also analyze the sensitivity of pipeline modelsâ?? performance to the speech recognition accuracy, using more than 20 publicly availablespeech recognition models.
20.Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation
Yulong Chen,Huajian Zhang,Yijie Zhou,Xuefeng Bai,Yueguan Wang,Ming Zhong,Jianhao Yan,Yafu Li,Judy Li,Xianchao Zhu,Yue Zhang
Download URL
https://aclanthology.org/2023.acl-long.519/
abstract
AbstractMost existing cross-lingual summarization (CLS) work constructs CLS corpora by simply and directly translating pre-annotated summaries from one language to another, which can contain errors from both summarization and translation processes.To address this issue, we propose ConvSumX, a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context.ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions.We conduct thorough analysis on ConvSumX and 3 widely-used manually annotated CLS corpora and empirically find that ConvSumX is more faithful towards input text.Additionally, based on the same intuition, we propose a 2-Step method, which takes both conversation and summary as input to simulate human annotation process.Experimental results show that 2-Step method surpasses strong baselines on ConvSumX under both automatic and human evaluation.Analysis shows that both source input text and summary are crucial for modeling cross-lingual summaries.
21.Unsupervised Extractive Summarization of Emotion Triggers
Tiberiu Sosea,Hongli Zhan,Junyi Jessy Li,Cornelia Caragea
Download URL
https://aclanthology.org/2023.acl-long.531/
abstract
AbstractUnderstanding what leads to emotions during large-scale crises is important as it can provide groundings for expressed emotions and subsequently improve the understanding of ongoing disasters. Recent approaches trained supervised models to both detect emotions and explain emotion triggers (events and appraisals) via abstractive summarization. However, obtaining timely and qualitative abstractive summaries is expensive and extremely time-consuming, requiring highly-trained expert annotators. In time-sensitive, high-stake contexts, this can block necessary responses. We instead pursue unsupervised systems that extract triggers from text. First, we introduce CovidET-EXT, augmenting (Zhan et al., 2022)â??s abstractive dataset (in the context of the COVID-19 crisis) with extractive triggers. Second, we develop new unsupervised learning models that can jointly detect emotions and summarize their triggers. Our best approach, entitled Emotion-Aware Pagerank, incorporates emotion information from external sources combined with a language understanding module, and outperforms strong baselines. We release our data and code at https://github.com/tsosea2/CovidET-EXT.
22.DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
Pei Ke,Fei Huang,Fei Mi,Yasheng Wang,Qun Liu,Xiaoyan Zhu,Minlie Huang
Download URL
https://aclanthology.org/2023.acl-long.539/
abstract
AbstractExisting evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation score for each dimension without revealing the evidence to interpret how this score is obtained. To deal with these challenges, we propose a simple yet effective metric called DecompEval. This metric formulates NLG evaluation as an instruction-style question answering task and utilizes instruction-tuned pre-trained language models (PLMs) without training on evaluation datasets, aiming to enhance the generalization ability. To make the evaluation process more interpretable, we decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result. Experimental results show that DecompEval achieves state-of-the-art performance in untrained metrics for evaluating text summarization and dialogue generation, which also exhibits strong dimension-level / task-level generalization ability and interpretability.
23.Concise Answers to Complex Questions: Summarization of Long-form Answers
Abhilash Potluri,Fangyuan Xu,Eunsol Choi
Download URL
https://aclanthology.org/2023.acl-long.541/
abstract
AbstractLong-form question answering systems provide rich information by presenting paragraph-level answers, often containing optional background or auxiliary information. While such comprehensive answers are helpful, not all information is required to answer the question (e.g. users with domain knowledge do not need an explanation of background). Can we provide a concise version of the answer by summarizing it, while still addressing the question? We conduct a user study on summarized answers generated from state-of-the-art models and our newly proposed extract-and-decontextualize approach. We find a large proportion of long-form answers (over 90%) in the ELI5 domain can be adequately summarized by at least one system, while complex and implicit answers are challenging to compress. We observe that decontextualization improves the quality of the extractive summary, exemplifying its potential in the summarization task. To promote future work, we provide an extractive summarization dataset covering 1K long-form answers and our user study annotations. Together, we present the first study on summarizing long-form answers, taking a step forward for QA agents that can provide answers at multiple granularities.
24.Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations
Lucy Lu Wang,Yulia Otmakhova,Jay DeYoung,Thinh Hung Truong,Bailey Kuehl,Erin Bransom,Byron Wallace
Download URL
https://aclanthology.org/2023.acl-long.549/
abstract
AbstractEvaluating multi-document summarization (MDS) quality is difficult. This is especially true in the case of MDS for biomedical literature reviews, where models must synthesize contradicting evidence reported across different documents. Prior work has shown that rather than performing the task, models may exploit shortcuts that are difficult to detect using standard n-gram similarity metrics such as ROUGE. Better automated evaluation metrics are needed, but few resources exist to assess metrics when they are proposed. Therefore, we introduce a dataset of human-assessed summary quality facets and pairwise preferences to encourage and support the development of better automated evaluation methods for literature review MDS. We take advantage of community submissions to the Multi-document Summarization for Literature Review (MSLR) shared task to compile a diverse and representative sample of generated summaries. We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries, to other automated metrics including several we propose in this work, and to aspects of human-assessed summary quality. We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
25.SIMSUM: Document-level Text Simplification via Simultaneous Summarization
Sofia Blinova,Xinyu Zhou,Martin Jaggi,Carsten Eickhoff,Seyed Ali Bahrainian
Download URL
https://aclanthology.org/2023.acl-long.552/
abstract
AbstractDocument-level text simplification is a specific type of simplification which involves simplifying documents consisting of several sentences by rewriting them into fewer or more sentences. In this paper, we propose a new two-stage framework SIMSUM for automated document-level text simplification. Our model is designed with explicit summarization and simplification models and guides the generation using the main keywords of a source text.In order to evaluate our new model, we use two existing benchmark datasets for simplification, namely D-Wikipedia and Wiki-Doc. We compare our modelâ??s performance with state of the art and show that SIMSUM achieves top results on the D-Wikipedia dataset SARI (+1.20), D-SARI (+1.64), and FKGL (-0.35) scores, improving over the best baseline models. In order to evaluate the quality of the generated text, we analyze the outputs from different models qualitatively and demonstrate the merit of our new model. Our code and datasets are available.
26.What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization
Griffin Adams,Bichlien Nguyen,Jake Smith,Yingce Xia,Shufang Xie,Anna Ostropolets,Budhaditya Deb,Yuan-Jyue Chen,Tristan Naumann,Noémie Elhadad
Download URL
https://aclanthology.org/2023.acl-long.587/
abstract
AbstractSummarization models often generate text that is poorly calibrated to quality metrics because they are trained to maximize the likelihood of a single reference (MLE). To address this, recent work has added a calibration step, which exposes a model to its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effective, much of this work has focused on how to generate and optimize these sets. Less is known about why one setup is more effective than another. In this work, we uncover the underlying characteristics of effective sets. For each training instance, we form a large, diverse pool of candidates and systematically vary the subsets used for calibration fine-tuning. Each selection strategy targets distinct aspects of the sets, such as lexical diversity or the size of the gap between positive and negatives. On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between candidates should be maximized and surpriseâ??the disagreement between model and metric defined candidate rankingsâ??minimized.
27.AlignScore: Evaluating Factual Consistency with A Unified Alignment Function
Yuheng Zha,Yichi Yang,Ruichen Li,Zhiting Hu
Download URL
https://aclanthology.org/2023.acl-long.634/
abstract
AbstractMany text generation applications require the generated text to be factually consistent with input information. Automatic evaluation of factual consistency is challenging. Previous work has developed various metrics that often depend on specific functions, such as natural language inference (NLI) or question answering (QA), trained on limited data. Those metrics thus can hardly assess diverse factual inconsistencies (e.g., contradictions, hallucinations) that occur in varying inputs/outputs (e.g., sentences, documents) from different tasks. In this paper, we propose AlignScore, a new holistic metric that applies to a variety of factual inconsistency scenarios as above. AlignScore is based on a general function of information alignment between two arbitrary text pieces. Crucially, we develop a unified training framework of the alignment function by integrating a large diversity of data sources, resulting in 4.7M training examples from 7 well-established tasks (NLI, QA, paraphrasing, fact verification, information retrieval, semantic similarity, and summarization). We conduct extensive experiments on large-scale benchmarks including 22 evaluation datasets, where 19 of the datasets were never seen in the alignment training. AlignScore achieves substantial improvement over a wide range of previous metrics. Moreover, AlignScore (355M parameters) matches or even outperforms metrics based on ChatGPT and GPT-4 that are orders of magnitude larger.
28.Contrastive Error Attribution for Finetuned Language Models
Faisal Ladhak,Esin Durmus,Tatsunori Hashimoto
Download URL
https://aclanthology.org/2023.acl-long.643/
abstract
AbstractRecent work has identified noisy and misannotated data as a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks. Consequently, identifying and removing these examples is a key open challenge in creating reliable NLG systems. In this work, we introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs, such as faithfulness errors in text summarization. We show that existing approaches for error tracing, such as gradient-based influence measures, do not perform reliably for detecting faithfulness errors in NLG datasets. We overcome the drawbacks of existing error tracing methods through a new, contrast-based estimate that compares undesired generations to human-corrected outputs. Our proposed method can achieve a mean average precision of 0.93 at detecting known data errors across synthetic tasks with known ground truth, substantially outperforming existing approaches. Using this approach and re-training models on cleaned data leads to a 70% reduction in entity hallucinations on the NYT dataset and a 55% reduction in semantic errors on the E2E dataset.
29.Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors
Liyan Tang,Tanya Goyal,Alex Fabbri,Philippe Laban,Jiacheng Xu,Semih Yavuz,Wojciech Kryscinski,Justin Rousseau,Greg Durrett
Download URL
https://aclanthology.org/2023.acl-long.650/
abstract
AbstractThe propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systemsâ?? outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models. Critically, our analysis shows that much of the recent improvement in the factuality detection space has been on summaries from older (pre-Transformer) models instead of more relevant recent summarization models. We further perform a finer-grained analysis per error-type and find similar performance variance across error types for different factuality metrics. Our results show that no one metric is superior in all settings or for all error types, and we provide recommendations for best practices given these insights.
30.On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
Tianxing He,Jingyu Zhang,Tianle Wang,Sachin Kumar,Kyunghyun Cho,James Glass,Yulia Tsvetkov
Download URL
https://aclanthology.org/2023.acl-long.674/
abstract
AbstractIn this work, we explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics: stress tests with synthetic data. Basically, we design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. We examine a range of recently proposed evaluation metrics based on pretrained language models, for the tasks of open-ended generation, translation, and summarization. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics. For example, we find that BERTScore is confused by truncation errors in summarization, and MAUVE (built on top of GPT-2) is insensitive to errors at the beginning or middle of generations. Further, we investigate the reasons behind these blind spots and suggest practical workarounds for a more reliable evaluation of text generation. We have released our code and data at https://github.com/cloudygoose/blindspot_nlg.
31.Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization
Artidoro Pagnoni,Alex Fabbri,Wojciech Kryscinski,Chien-Sheng Wu
Download URL
https://aclanthology.org/2023.acl-long.713/
abstract
AbstractIn long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
32.BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
Liang Ma,Shuyang Cao,Robert L Logan IV,Di Lu,Shihao Ran,Ke Zhang,Joel Tetreault,Alejandro Jaimes
Download URL
https://aclanthology.org/2023.acl-long.716/
abstract
AbstractThe proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., indicate lower faithfulness as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, and 3) unlike datasets containing generated summaries with multiple errors, BUMP enables the measurement of metricsâ?? performance on individual error types.
33.UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot Summarization
Yulong Chen,Yang Liu,Ruochen Xu,Ziyi Yang,Chenguang Zhu,Michael Zeng,Yue Zhang
Download URL
https://aclanthology.org/2023.acl-long.718/
abstract
AbstractThe high annotation costs and diverse demands of various summarization tasks motivate the development of few-shot summarization.However, despite the emergence of many summarization tasks and datasets, the current training paradigm for few-shot summarization systems ignores potentially shareable knowledge in heterogeneous datasets.To this end, we propose UniSumm, a unified few-shot summarization model pre-trained with multiple summarization tasks and can be prefix-tuned to excel at any few-shot summarization task.Meanwhile, to better evaluate few-shot summarizers, under the principles of diversity and robustness, we assemble and release a new benchmark SummZoo. It consists of 8 summarization tasks with multiple sets of few-shot samples for each task, covering diverse domains.Experimental results and analysis show that UniSumm outperforms strong baselines by a large margin across all sub-tasks in SummZoo under both automatic and human evaluations and achieves comparable results in human evaluation compared with a GPT-3.5 model.
34.ExplainMeetSum: A Dataset for Explainable Meeting Summarization Aligned with Human Intent
Hyun Kim,Minsoo Cho,Seung-Hoon Na
Download URL
https://aclanthology.org/2023.acl-long.731/
abstract
AbstractTo enhance the explainability of meeting summarization, we construct a new dataset called â??ExplainMeetSum,â?? an augmented version of QMSum, by newly annotating evidence sentences that faithfully â??explainâ?? a summary. Using ExplainMeetSum, we propose a novel multiple extractor guided summarization, namely Multi-DYLE, which extensively generalizes DYLE to enable using a supervised extractor based on human-aligned extractive oracles. We further present an explainability-aware task, named â??Explainable Evidence Extractionâ?? (E3), which aims to automatically detect all evidence sentences that support a given summary. Experimental results on the QMSum dataset show that the proposed Multi-DYLE outperforms DYLE with gains of up to 3.13 in the ROUGE-1 score. We further present the initial results on the E3 task, under the settings using separate and joint evaluation metrics.
35.Dialogue Summarization with Static-Dynamic Structure Fusion Graph
Shen Gao,Xin Cheng,Mingzhe Li,Xiuying Chen,Jinpeng Li,Dongyan Zhao,Rui Yan
Download URL
https://aclanthology.org/2023.acl-long.775/
abstract
AbstractDialogue, the most fundamental and specially privileged arena of language, gains increasing ubiquity across the Web in recent years. Quickly going through the long dialogue context and capturing salient information scattered over the whole dialogue session benefit users in many real-world Web applications such as email thread summarization and meeting minutes draft. Dialogue summarization is a challenging task in that dialogue has dynamic interaction nature and presumably inconsistent information flow among various speakers. Many researchers address this task by modeling dialogue with pre-computed static graph structure using external linguistic toolkits. However, such methods heavily depend on the reliability of external tools and the static graph construction is disjoint with the graph representation learning phase, which makes the graph canâ??t be dynamically adapted for the downstream summarization task. In this paper, we propose a Static-Dynamic graph-based Dialogue Summarization model (SDDS), which fuses prior knowledge from human expertise and adaptively learns the graph structure in an end-to-end learning fashion. To verify the effectiveness of SDDS, we conduct experiments on three benchmark datasets (SAMSum, MediaSum, and DialogSum) and the results verify the superiority of SDDS.
36.Reference Matters: Benchmarking Factual Error Correction for Dialogue Summarization with Fine-grained Evaluation Framework
Mingqi Gao,Xiaojun Wan,Jia Su,Zhefeng Wang,Baoxing Huai
Download URL
https://aclanthology.org/2023.acl-long.779/
abstract
AbstractFactuality is important to dialogue summarization. Factual error correction (FEC) of model-generated summaries is one way to improve factuality. Current FEC evaluation that relies on factuality metrics is not reliable and detailed enough. To address this problem, we are the first to manually annotate a FEC dataset for dialogue summarization containing 4000 items and propose FERRANTI, a fine-grained evaluation framework based on reference correction that automatically evaluates the performance of FEC models on different error categories. Using this evaluation framework, we conduct sufficient experiments with FEC approaches under a variety of settings and find the best training modes and significant differences in the performance of the existing approaches on different factual error categories.
37.Towards Understanding Omission in Dialogue Summarization
Yicheng Zou,Kaitao Song,Xu Tan,Zhongkai Fu,Qi Zhang,Dongsheng Li,Tao Gui
Download URL
https://aclanthology.org/2023.acl-long.798/
abstract
AbstractDialogue summarization aims to condense the lengthy dialogue into a concise summary, and has recently achieved significant progress. However, the result of existing methods is still far from satisfactory. Previous works indicated that omission is a major factor in affecting the quality of summarization, but few of them have further explored the omission problem, such as how omission affects summarization results and how to detect omission, which is critical for reducing omission and improving summarization quality. Moreover, analyzing and detecting omission relies on summarization datasets with omission labels (i.e., which dialogue utterances are omitted in the summarization), which are not available in the current literature. In this paper, we propose the OLDS dataset, which provides high-quality omission labels for dialogue summarization. By analyzing this dataset, we find that a large improvement in summarization quality can be achieved by providing ground-truth omission labels for the summarization model to recover omission information, which demonstrates the importance of omission detection for omission mitigation in dialogue summarization. Therefore, we formulate an omission detection task and demonstrate our proposed dataset can support the training and evaluation of this task well. We also call for research action on omission detection based on our proposed datasets. Our dataset and codes are publicly available.
38.A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
Lining Zhang,Simon Mille,Yufang Hou,Daniel Deutsch,Elizabeth Clark,Yixin Liu,Saad Mahamood,Sebastian Gehrmann,Miruna Clinciu,Khyathi Raghavi Chandu,João Sedoc
Download URL
https://aclanthology.org/2023.acl-long.835/
abstract
AbstractTo prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.
39.MeetingQA: Extractive Question-Answering on Meeting Transcripts
Archiki Prasad,Trung Bui,Seunghyun Yoon,Hanieh Deilamsalehy,Franck Dernoncourt,Mohit Bansal
Download URL
https://aclanthology.org/2023.acl-long.837/
abstract
AbstractWith the ubiquitous use of online meeting platforms and robust automatic speech recognition systems, meeting transcripts have emerged as a promising domain for natural language tasks. Most recent works on meeting transcripts primarily focus on summarization and extraction of action items. However, meeting discussions also have a useful question-answering (QA) component, crucial to understanding the discourse or meeting content, and can be used to build interactive interfaces on top of long transcripts. Hence, in this work, we leverage this inherent QA component of meeting discussions and introduce MeetingQA, an extractive QA dataset comprising of questions asked by meeting participants and corresponding responses. As a result, questions can be open-ended and actively seek discussions, while the answers can be multi-span and distributed across multiple speakers. Our comprehensive empirical study of several robust baselines including long-context language models and recent instruction-tuned models reveals that models perform poorly on this task (F1 = 57.3) and severely lag behind human performance (F1 = 84.6), thus presenting a challenging new task for the community to improve upon.
40.Towards Unifying Multi-Lingual and Cross-Lingual Summarization
Jiaan Wang,Fandong Meng,Duo Zheng,Yunlong Liang,Zhixu Li,Jianfeng Qu,Jie Zhou
Download URL
https://aclanthology.org/2023.acl-long.843/
abstract
AbstractTo adapt text summarization to the multilingual world, previous work proposes multi-lingual summarization (MLS) and cross-lingual summarization (CLS). However, these two tasks have been studied separately due to the different definitions, which limits the compatible and systematic research on both of them. In this paper, we aim to unify MLS and CLS into a more general setting, i.e., many-to-many summarization (M2MS), where a single model could process documents in any language and generate their summaries also in any language. As the first step towards M2MS, we conduct preliminary studies to show that M2MS can better transfer task knowledge across different languages than MLS and CLS. Furthermore, we propose Pisces, a pre-trained M2MS model that learns language modeling, cross-lingual ability and summarization ability via three-stage pre-training. Experimental results indicate that our Pisces significantly outperforms the state-of-the-art baselines, especially in the zero-shot directions, where there is no training data from the source-language documents to the target-language summaries.
41.On Improving Summarization Factual Consistency from Natural Language Feedback
Yixin Liu,Budhaditya Deb,Milagro Teruel,Aaron Halfaker,Dragomir Radev,Ahmed Hassan Awadallah
Download URL
https://aclanthology.org/2023.acl-long.844/
abstract
AbstractDespite the recent progress in language generation models, their outputs may not always meet user expectations. In this work, we study whether informational feedback in natural language can be leveraged to improve generation quality and user preference alignment. To this end, we consider factual consistency in summarization, the quality that the summary should only contain information supported by the input documents, as the user-expected preference. We collect a high-quality dataset, DeFacto, containing human demonstrations and informational natural language feedback consisting of corrective instructions, edited summaries, and explanations with respect to the factual consistency of the summary. Using our dataset, we study three natural language generation tasks: (1) editing a summary by following the human feedback, (2) generating human feedback for editing the original summary, and (3) revising the initial summary to correct factual errors by generating both the human feedback and edited summary. We show that DeFacto can provide factually consistent human-edited summaries and further insights into summarization factual consistency thanks to its informational natural language feedback. We further demonstrate that fine-tuned language models can leverage our dataset to improve the summary factual consistency, while large language models lack the zero-shot learning ability in our proposed tasks that require controllable text generation.
42.MeetingBank: A Benchmark Dataset for Meeting Summarization
Yebowen Hu,Timothy Ganter,Hanieh Deilamsalehy,Franck Dernoncourt,Hassan Foroosh,Fei Liu
Download URL
https://aclanthology.org/2023.acl-long.906/
abstract
AbstractAs the number of recorded meetings increases, it becomes increasingly important to utilize summarization technology to create useful summaries of these recordings. However, there is a crucial lack of annotated meeting corpora for developing this technology, as it can be hard to collect meetings, especially when the topics discussed are confidential. Furthermore, meeting summaries written by experienced writers are scarce, making it hard for abstractive summarizers to produce sensible output without a reliable reference. This lack of annotated corpora has hindered the development of meeting summarization technology. In this paper, we present MeetingBank, a new benchmark dataset of city council meetings over the past decade. MeetingBank is unique among other meeting corpora due to its divide-and-conquer approach, which involves dividing professionally written meeting minutes into shorter passages and aligning them with specific segments of the meeting. This breaks down the process of summarizing a lengthy meeting into smaller, more manageable tasks. The dataset provides a new testbed of various meeting summarization systems and also allows the public to gain insight into how council decisions are made. We make the collection, including meeting video links, transcripts, reference summaries, agenda, and other metadata, publicly available to facilitate the development of better meeting summarization techniques.
43.Abstractive Summarizers are Excellent Extractive Summarizers
Daniel Varab,Yumo Xu
Download URL
https://aclanthology.org/2023.acl-short.29/
abstract
AbstractExtractive and abstractive summarization designs have historically been fragmented, limiting the benefits that often arise from compatible model architectures. In this paper, we explore the potential synergies of modeling extractive summarization with an abstractive summarization system and propose three novel inference algorithms using the sequence-to-sequence architecture. We evaluate them on the CNN & Dailymail dataset and show that recent advancements in abstractive system designs enable abstractive systems to not only compete, but even surpass the performance of extractive systems with custom architectures. To our surprise, abstractive systems achieve this without being exposed to extractive oracle summaries and, therefore, for the first time allow a single model to produce both abstractive and extractive summaries. This evidence questions our fundamental understanding of extractive system design, and the necessity for extractive labels while pathing the way for promising research directions in hybrid models.
44.Toward Expanding the Scope of Radiology Report Summarization to Multiple Anatomies and Modalities
Zhihong Chen,Maya Varma,Xiang Wan,Curtis Langlotz,Jean-Benoit Delbrouck
Download URL
https://aclanthology.org/2023.acl-short.41/
abstract
AbstractRadiology report summarization (RRS) is a growing area of research. Given the Findings section of a radiology report, the goal is to generate a summary (called an Impression section) that highlights the key observations and conclusions of the radiology study. However, RRS currently faces essential limitations.First, many prior studies conduct experiments on private datasets, preventing reproduction of results and fair comparisons across different systems and solutions. Second, most prior approaches are evaluated solely on chest X-rays. To address these limitations, we propose a dataset (MIMIC-RRS) involving three new modalities and seven new anatomies based on the MIMIC-III and MIMIC-CXR datasets. We then conduct extensive experiments to evaluate the performance of models both within and across modality-anatomy pairs in MIMIC-RRS. In addition, we evaluate their clinical efficacy via RadGraph, a factual correctness metric.
45.Balancing Lexical and Semantic Quality in Abstractive Summarization
Jeewoo Sul,Yong Suk Choi
Download URL
https://aclanthology.org/2023.acl-short.56/
abstract
AbstractAn important problem of the sequence-to-sequence neural models widely used in abstractive summarization is exposure bias. To alleviate this problem, re-ranking systems have been applied in recent years. Despite some performance improvements, this approach remains underexplored. Previous works have mostly specified the rank through the ROUGE score and aligned candidate summaries, but there can be quite a large gap between the lexical overlap metric and semantic similarity. In this paper, we propose a novel training method in which a re-ranker balances the lexical and semantic quality. We further newly define false positives in ranking and present a strategy to reduce their influence. Experiments on the CNN/DailyMail and XSum datasets show that our method can estimate the meaning of summaries without seriously degrading the lexical aspect. More specifically, it achieves an 89.67 BERTScore on the CNN/DailyMail dataset, reaching new state-of-the-art performance. Our code is publicly available at https://github.com/jeewoo1025/BalSum.
46.Exploring Continual Learning for Code Generation Models
Prateek Yadav,Qing Sun,Hantian Ding,Xiaopeng Li,Dejiao Zhang,Ming Tan,Parminder Bhatia,Xiaofei Ma,Ramesh Nallapati,Murali Krishna Ramanathan,Mohit Bansal,Bing Xiang
Download URL
https://aclanthology.org/2023.acl-short.68/
abstract
AbstractLarge-scale code generation models such as Copilot and CodeT5 have achieved impressive performance. However, libraries are upgraded or deprecated very frequently and re-training large-scale language models is computationally expensive. Therefore, Continual Learning (CL) is an important aspect that remains under-explored in the code domain. In this paper, we introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement, with different input and output programming languages. Next, on our CodeTask-CL benchmark, we compare popular CL techniques from NLP and Vision domains. We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism caused by stark distribution shifts in coding tasks. We address this issue with our proposed method, Prompt Pooling with Teacher Forcing (PP-TF), that stabilizes training by enforcing constraints on the prompt selection mechanism and leads to a 21.54% improvement over Prompt Pooling. Along with the benchmark, we establish a training pipeline that can be used for CL on code models, which we believe can motivate further development of CL methods for code models.
47.Token-Level Self-Evolution Training for Sequence-to-Sequence Learning
Keqin Peng,Liang Ding,Qihuang Zhong,Yuanxin Ouyang,Wenge Rong,Zhang Xiong,Dacheng Tao
Download URL
https://aclanthology.org/2023.acl-short.73/
abstract
AbstractAdaptive training approaches, widely used in sequence-to-sequence models, commonly reweigh the losses of different target tokens based on priors, e.g. word frequency. However, most of them do not consider the variation of learning difficulty in different training steps, and overly emphasize the learning of difficult one-hot labels, making the learning deterministic and sub-optimal. In response, we present Token-Level Self-Evolution Training (SE), a simple and effective dynamic training method to fully and wisely exploit the knowledge from data. SE focuses on dynamically learning the under-explored tokens for each forward pass and adaptively regularizes the training by introducing a novel token-specific label smoothing approach. Empirically, SE yields consistent and significant improvements in three tasks, i.e. machine translation, summarization, and grammatical error correction. Encouragingly, we achieve averaging +0.93 BLEU improvement on three machine translation tasks. Analyses confirm that, besides improving lexical accuracy, SE enhances generation diversity and model generalization.
48.Improving Factuality of Abstractive Summarization without Sacrificing Summary Quality
Tanay Dixit,Fei Wang,Muhao Chen
Download URL
https://aclanthology.org/2023.acl-short.78/
abstract
AbstractImproving factual consistency of abstractive summarization has been a widely studied topic. However, most of the prior works on training factuality-aware models have ignored the negative effect it has on summary quality. We propose {pasted macro â??MODELâ??}name (i.e. Effective Factual Summarization), a candidate summary generation and ranking technique to improve summary factuality without sacrificing quality. We show that using a contrastive learning framework with our refined candidate summaries leads to significant gains on both factuality and similarity-based metrics. Specifically, we propose a ranking strategy in which we effectively combine two metrics, thereby preventing any conflict during training. Models trained using our approach show up to 6 points of absolute improvement over the base model with respect to FactCC on XSUM and 11 points on CNN/DM, without negatively affecting either similarity-based metrics or absractiveness.
49.With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
Julius Steen,Juri Opitz,Anette Frank,Katja Markert
Download URL
https://aclanthology.org/2023.acl-short.79/
abstract
AbstractConditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth of prior research and data. But recent research suggests that NLI models require costly additional machinery to perform reliably across datasets, e.g., by running inference on a cartesian product of input and generated sentences, or supporting them with a question-generation/answering step.In this work we show that pure NLI models _can_ outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures. We propose: (1) Augmenting NLI training data toadapt NL inferences to the specificities of faithfulness prediction in dialogue;(2) Making use of both entailment and contradiction probabilities in NLI, and(3) Using Monte-Carlo dropout during inference.Applied to the TRUE benchmark, which combines faithfulness datasets across diverse domains and tasks, our approach strongly improves a vanilla NLI model and significantly outperforms previous work, while showing favourable computational cost.