This blog summarizes the latest research development of dialogue and large language models (LLM) papers published in ACL2023 conferences. This year there are total 79 papers related to dialogue in ACL2023. Most of the authors' affiliations are top research institutes (Google Research, DeepMind, Meta FAIR) and universities (Stanford, Berkeley, MIT, CMU and others).
Navigation
- 1.One Cannot Stand for Everyone! Leveraging Multiple User Simulators to train Task-oriented Dialogue Systems
- 2.SafeConv: Explaining and Correcting Conversational Unsafe Behavior
- 3.Span-Selective Linear Attention Transformers for Effective and Robust Schema-Guided Dialogue State Tracking
- 4.EM Pre-training for Multi-party Dialogue Response Generation
- 5.Evaluating Open-Domain Dialogues in Latent Space with Next Sentence Prediction and Mutual Information
- 6.DialoGPS: Dialogue Path Sampling in Continuous Semantic Space for Data Augmentation in Multi-Turn Conversations
- 7.DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization
- 8.Facilitating Multi-turn Emotional Support Conversation with Positive Emotion Elicitation: A Reinforcement Learning Approach
- 9.Query Enhanced Knowledge-Intensive Conversation via Unsupervised Joint Modeling
- 10.White-Box Multi-Objective Adversarial Attack on Dialogue Generation
- 11.Divide, Conquer, and Combine: Mixture of Semantic-Independent Experts for Zero-Shot Dialogue State Tracking
- 12.BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
- 13.Schema-Guided User Satisfaction Modeling for Task-Oriented Dialogues
- 14.MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions
- 15.Injecting knowledge into language generation: a case study in auto-charting after-visit care instructions from medical dialogue
- 16.DiffusEmp: A Diffusion Model-Based Framework with Multi-Grained Control for Empathetic Response Generation
- 17.BREAK: Breaking the Dialogue State Tracking Barrier with Beam Search and Re-ranking
- 18.Learning to Generate Equitable Text in Dialogue from Biased Training Data
- 19.TREA: Tree-Structure Reasoning Schema for Conversational Recommendation
- 20.CORE: Cooperative Training of Retriever-Reranker for Effective Dialogue Response Selection
- 21.PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism
- 22.MPCHAT: Towards Multimodal Persona-Grounded Conversation
- 23.Towards Boosting the Open-Domain Chatbot with Human Feedback
- 24.Knowledge-enhanced Mixed-initiative Dialogue System for Emotional Support Conversations
- 25.ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems
- 26.Towards Faithful Dialogues via Focus Learning
- 27.Prompter: Zero-shot Adaptive Prefixes for Dialogue State Tracking Domain Adaptation
- 28.Enhancing Dialogue Generation via Dynamic Graph Knowledge Aggregation
- 29.Privacy-Preserving Domain Adaptation of Semantic Parsers
- 30.VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
- 31.Enhancing Personalized Dialogue Generation with Contrastive Latent Variables: Combining Sparse and Dense Persona
- 32.FutureTOD: Teaching Future Knowledge to Pre-trained Language Model for Task-Oriented Dialogue
- 33.PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
- 34.Retrieval-free Knowledge Injection through Multi-Document Traversal for Dialogue Models
- 35.Annotating and Detecting Fine-grained Factual Errors for Dialogue Summarization
- 36.MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation
- 37.Envisioning Future from the Past: Hierarchical Duality Learning for Multi-Turn Dialogue Generation
- 38.Can Language Models Make Fun? A Case Study in Chinese Comical Crosstalk
- 39.A Dataset of Argumentative Dialogues on Scientific Papers
- 40.Contextual Knowledge Learning for Dialogue Generation
- 41.Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
- 42.MidMed: Towards Mixed-Type Dialogues for Medical Consultation
- 43.CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation
- 44.RECAP: Retrieval-Enhanced Context-Aware Prefix Encoder for Personalized Dialogue Response Generation
- 45.Dual Class Knowledge Propagation Network for Multi-label Few-shot Intent Detection
- 46.The CRINGE Loss: Learning what language not to model
- 47.Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process
- 48.Pre-training Multi-party Dialogue Models with Latent Discourse Inference
- 49.DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
- 50.SimOAP: Improve Coherence and Consistency in Persona-based Dialogue Generation via Over-sampling and Post-evaluation
- 51.Improved Instruction Ordering in Recipe-Grounded Conversation
- 52.Dialog-Post: Multi-Level Self-Supervised Objectives and Hierarchical Model for Dialogue Post-Training
- 53.Language Detoxification with Attribute-Discriminative Latent Space
- 54.A Cognitive Stimulation Dialogue System with Multi-source Knowledge Fusion for Elders with Cognitive Impairment
- 55.A Synthetic Data Generation Framework for Grounded Dialogues
- 56.Your spouse needs professional help: Determining the Contextual Appropriateness of Messages through Modeling Social Relationships
- 57.XDailyDialog: A Multilingual Parallel Dialogue Corpus
- 58.HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation
- 59.Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization
- 60.RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue
- 61.Extrinsic Evaluation of Machine Translation Metrics
- 62.A Cross-Modality Context Fusion and Semantic Refinement Network for Emotion Recognition in Conversation
- 63.PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
- 64.Learning New Skills after Deployment: Improving open-domain internet-driven dialogue with human feedback
- 65.On the Compositional Generalization in Versatile Open-domain Dialogue
- 66.Dialogue Summarization with Static-Dynamic Structure Fusion Graph
- 67.Reference Matters: Benchmarking Factual Error Correction for Dialogue Summarization with Fine-grained Evaluation Framework
- 68.Multimodal Persona Based Generation of Comic Dialogs
- 69.Seen to Unseen: Exploring Compositional Generalization of Multi-Attribute Controllable Dialogue Generation
- 70.Towards Understanding Omission in Dialogue Summarization
- 71.Donâ??t Forget Your ABCâ??s: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems
- 72.Bridging The Gap: Entailment Fused-T5 for Open-retrieval Conversational Machine Reading Comprehension
- 73.LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically Constructed from Live Streaming
- 74.FactKG: Fact Verification via Reasoning on Knowledge Graphs
- 75.Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment
- 76.With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
- 77.Controllable Mixed-Initiative Dialogue Generation through Prompting
- 78.Listener Model for the PhotoBook Referential Game with CLIPScores as Implicit Reference Chain
- 79.Towards Fewer Hallucinations in Knowledge-Grounded Dialogue Generation via Augmentative and Contrastive Knowledge-Dialogue
Paper List
1.One Cannot Stand for Everyone! Leveraging Multiple User Simulators to train Task-oriented Dialogue Systems
Yajiao Liu,Xin Jiang,Yichun Yin,Yasheng Wang,Fei Mi,Qun Liu,Xiang Wan,Benyou Wang
Download URL
https://aclanthology.org/2023.acl-long.1/
abstract
AbstractUser simulators are agents designed to imitate human users; recent advances have found that Task-oriented Dialogue (ToD) systems optimized toward a user simulator could better satisfy the need of human users. However, this might result in a sub-optimal ToD system if it is tailored to only one ad hoc user simulator, since human users can behave differently. In this paper, we propose a framework called MUST to optimize ToD systems via leveraging Multiple User SimulaTors. The main challenges of implementing MUST fall in 1) how to adaptively determine which user simulator to interact with the ToD system at each optimization step, since the ToD system might be over-fitted to some specific user simulators, and simultaneously under-fitted to some others; 2) how to avoid catastrophic forgetting of the adaption for a simulator that is not selected for several consecutive optimization steps.To tackle these challenges, we formulate MUST as a Multi-armed bandits (MAB) problem and provide a method called MUSTadaptive that balances i) the boosting adaption for adaptive interactions between different user simulators and the ToD system andii) the uniform adaption to avoid the catastrophic forgetting issue.With both automatic evaluations and human evaluations, our experimental results on MultiWOZ show that the dialogue system trained by MUST achieves a better performance than those trained by a single user simulator. It also has a better generalization ability when testing with unseen user simulators.
2.SafeConv: Explaining and Correcting Conversational Unsafe Behavior
Mian Zhang,Lifeng Jin,Linfeng Song,Haitao Mi,Wenliang Chen,Dong Yu
Download URL
https://aclanthology.org/2023.acl-long.2/
abstract
AbstractOne of the main challenges open-domain end-to-end dialogue systems, or chatbots, face is the prevalence of unsafe behavior, such as toxic languages and harmful suggestions. However, existing dialogue datasets do not provide enough annotation to explain and correct such unsafe behavior. In this work, we construct a new dataset called SafeConv for the research of conversational safety: (1) Besides the utterance-level safety labels, SafeConv also provides unsafe spans in an utterance, information able to indicate which words contribute to the detected unsafe behavior; (2) SafeConv provides safe alternative responses to continue the conversation when unsafe behavior detected, guiding the conversation to a gentle trajectory. By virtue of the comprehensive annotation of SafeConv, we benchmark three powerful models for the mitigation of conversational unsafe behavior, including a checker to detect unsafe utterances, a tagger to extract unsafe spans, and a rewriter to convert an unsafe response to a safe version. Moreover, we explore the huge benefits brought by combining the models for explaining the emergence of unsafe behavior and detoxifying chatbots. Experiments show that the detected unsafe behavior could be well explained with unsafe spans and popular chatbots could be detoxified by a huge extent. The dataset is available at https://github.com/mianzhang/SafeConv.
3.Span-Selective Linear Attention Transformers for Effective and Robust Schema-Guided Dialogue State Tracking
Björn Bebensee,Haejun Lee
Download URL
https://aclanthology.org/2023.acl-long.6/
abstract
AbstractIn schema-guided dialogue state tracking models estimate the current state of a conversation using natural language descriptions of the service schema for generalization to unseen services. Prior generative approaches which decode slot values sequentially do not generalize well to variations in schema, while discriminative approaches separately encode history and schema and fail to account for inter-slot and intent-slot dependencies. We introduce SPLAT, a novel architecture which achieves better generalization and efficiency than prior approaches by constraining outputs to a limited prediction space. At the same time, our model allows for rich attention among descriptions and history while keeping computation costs constrained by incorporating linear-time attention. We demonstrate the effectiveness of our model on the Schema-Guided Dialogue (SGD) and MultiWOZ datasets. Our approach significantly improves upon existing models achieving 85.3 JGA on the SGD dataset. Further, we show increased robustness on the SGD-X benchmark: our model outperforms the more than 30x larger D3ST-XXL model by 5.0 points.
4.EM Pre-training for Multi-party Dialogue Response Generation
Yiyang Li,Hai Zhao
Download URL
https://aclanthology.org/2023.acl-long.7/
abstract
AbstractDialogue response generation requires an agent to generate a response according to the current dialogue history, in terms of which two-party dialogues have been well studied, but leaving a great gap for multi-party dialogues at the same time. Different from two-party dialogues where each response is a direct reply to its previous utterance, the addressee of a response utterance should be specified before it is generated in the multi-party scenario. Thanks to the huge amount of two-party conversational data, various pre-trained language models for two-party dialogue response generation have been proposed. However, due to the lack of annotated addressee labels in multi-party dialogue datasets, it is hard to use them to pre-train a response generation model for multi-party dialogues. To tackle this obstacle, we propose an Expectation-Maximization (EM) approach that iteratively performs the expectation steps to generate addressee labels, and the maximization steps to optimize a response generation model. Theoretical analyses and extensive experiments have justified the feasibility and effectiveness of our proposed method. The official implementation of this paper is available at https://github.com/EricLee8/MPDRG.
5.Evaluating Open-Domain Dialogues in Latent Space with Next Sentence Prediction and Mutual Information
Kun Zhao,Bohao Yang,Chenghua Lin,Wenge Rong,Aline Villavicencio,Xiaohui Cui
Download URL
https://aclanthology.org/2023.acl-long.33/
abstract
AbstractThe long-standing one-to-many issue of the open-domain dialogues poses significant challenges for automatic evaluation methods, i.e., there may be multiple suitable responses which differ in semantics for a given conversational context.To tackle this challenge, we propose a novel learning-based automatic evaluation metric (CMN), which can robustly evaluate open-domain dialogues by augmenting Conditional Variational Autoencoders (CVAEs) with a Next Sentence Prediction (NSP) objective and employing Mutual Information (MI) to model the semantic similarity of text in the latent space. Experimental results on two open-domain dialogue datasets demonstrate the superiority of our method compared with a wide range of baselines, especially in handling responses which are distant to the â??goldenâ?? reference responses in semantics.
6.DialoGPS: Dialogue Path Sampling in Continuous Semantic Space for Data Augmentation in Multi-Turn Conversations
Ang Lv,Jinpeng Li,Yuhan Chen,Gao Xing,Ji Zhang,Rui Yan
Download URL
https://aclanthology.org/2023.acl-long.70/
abstract
AbstractIn open-domain dialogue generation tasks, contexts and responses in most datasets are one-to-one mapped, violating an important many-to-many characteristic: a context leads to various responses, and a response answers multiple contexts. Without such patterns, models poorly generalize and prefer responding safely. Many attempts have been made in either multi-turn settings from a one-to-many perspective or in a many-to-many perspective but limited to single-turn settings. The major challenge to many-to-many augment multi-turn dialogues is that discretely replacing each turn with semantic similarity breaks fragile context coherence. In this paper, we propose DialoGue Path Sampling (DialoGPS) method in continuous semantic space, the first many-to-many augmentation method for multi-turn dialogues. Specifically, we map a dialogue to our extended Brownian Bridge, a special Gaussian process. We sample latent variables to form coherent dialogue paths in the continuous space. A dialogue path corresponds to a new multi-turn dialogue and is used as augmented training data. We show the effect of DialoGPS with both automatic and human evaluation.
7.DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization
Yu Li,Baolin Peng,Pengcheng He,Michel Galley,Zhou Yu,Jianfeng Gao
Download URL
https://aclanthology.org/2023.acl-long.76/
abstract
AbstractDialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues have limitations because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pre-train DIONYSUS, we create two pseudo summaries for each dialogue example: one from a fine-tuned summarization model and the other from important dialogue turns. We then choose one of these pseudo summaries based on information distribution differences in different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach on a large dialogue corpus. Our experiments show that DIONYSUS outperforms existing methods on six datasets, as demonstrated by its ROUGE scores in zero-shot and few-shot settings
8.Facilitating Multi-turn Emotional Support Conversation with Positive Emotion Elicitation: A Reinforcement Learning Approach
Jinfeng Zhou,Zhuang Chen,Bo Wang,Minlie Huang
Download URL
https://aclanthology.org/2023.acl-long.96/
abstract
AbstractEmotional support conversation (ESC) aims to provide emotional support (ES) to improve oneâ??s mental state. Existing works stay at fitting grounded responses and responding strategies (e.g., question), which ignore the effect on ES and lack explicit goals to guide emotional positive transition. To this end, we introduce a new paradigm to formalize multi-turn ESC as a process of positive emotion elicitation. Addressing this task requires finely adjusting the elicitation intensity in ES as the conversation progresses while maintaining conversational goals like coherence. In this paper, we propose Supporter, a mixture-of-expert-based reinforcement learning model, and well design ES and dialogue coherence rewards to guide policyâ??s learning for responding. Experiments verify the superiority of Supporter in achieving positive emotion elicitation during responding while maintaining conversational goals including coherence.
9.Query Enhanced Knowledge-Intensive Conversation via Unsupervised Joint Modeling
Mingzhu Cai,Siqi Bao,Xin Tian,Huang He,Fan Wang,Hua Wu
Download URL
https://aclanthology.org/2023.acl-long.97/
abstract
AbstractIn this paper, we propose an unsupervised query enhanced approach for knowledge-intensive conversations, namely QKConv. There are three modules in QKConv: a query generator, an off-the-shelf knowledge selector, and a response generator. QKConv is optimized through joint training, which produces the response by exploring multiple candidate queries and leveraging corresponding selected knowledge. The joint training solely relies on the dialogue context and target response, getting exempt from extra query annotations or knowledge provenances. To evaluate the effectiveness of the proposed QKConv, we conduct experiments on three representative knowledge-intensive conversation datasets: conversational question-answering, task-oriented dialogue, and knowledge-grounded conversation. Experimental results reveal that QKConv performs better than all unsupervised methods across three datasets and achieves competitive performance compared to supervised methods.
10.White-Box Multi-Objective Adversarial Attack on Dialogue Generation
Yufei Li,Zexin Li,Yingfan Gao,Cong Liu
Download URL
https://aclanthology.org/2023.acl-long.100/
abstract
AbstractPre-trained transformers are popular in state-of-the-art dialogue generation (DG) systems. Such language models are, however, vulnerable to various adversarial samples as studied in traditional tasks such as text classification, which inspires our curiosity about their robustness in DG systems. One main challenge of attacking DG models is that perturbations on the current sentence can hardly degrade the response accuracy because the unchanged chat histories are also considered for decision-making. Instead of merely pursuing pitfalls of performance metrics such as BLEU, ROUGE, we observe that crafting adversarial samples to force longer generation outputs benefits attack effectivenessâ??the generated responses are typically irrelevant, lengthy, and repetitive. To this end, we propose a white-box multi-objective attack method called DGSlow. Specifically, DGSlow balances two objectivesâ??generation accuracy and length, via a gradient-based multi-objective optimizer and applies an adaptive searching mechanism to iteratively craft adversarial samples with only a few modifications. Comprehensive experiments on four benchmark datasets demonstrate that DGSlow could significantly degrade state-of-the-art DG models with a higher success rate than traditional accuracy-based methods. Besides, our crafted sentences also exhibit strong transferability in attacking other models.
11.Divide, Conquer, and Combine: Mixture of Semantic-Independent Experts for Zero-Shot Dialogue State Tracking
Qingyue Wang,Liang Ding,Yanan Cao,Yibing Zhan,Zheng Lin,Shi Wang,Dacheng Tao,Li Guo
Download URL
https://aclanthology.org/2023.acl-long.114/
abstract
AbstractZero-shot transfer learning for Dialogue State Tracking (DST) helps to handle a variety of task-oriented dialogue domains without the cost of collecting in-domain data. Existing works mainly study common data- or model-level augmentation methods to enhance the generalization but fail to effectively decouple semantics of samples, limiting the zero-shot performance of DST. In this paper, we present a simple and effective â??divide, conquer and combineâ?? solution, which explicitly disentangles the semantics of seen data, and leverages the performance and robustness with the mixture-of-experts mechanism. Specifically, we divide the seen data into semantically independent subsets and train corresponding experts, the newly unseen samples are mapped and inferred with mixture-of-experts with our designed ensemble inference.Extensive experiments on MultiWOZ2.1 upon T5-Adapter show our schema significantly and consistently improves the zero-shot performance, achieving the SOTA on settings without external knowledge, with only 10M trainable parameters.
12.BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
Claytone Sikasote,Eunice Mukonde,Md Mahfuz Ibn Alam,Antonios Anastasopoulos
Download URL
https://aclanthology.org/2023.acl-long.115/
abstract
AbstractWe present BIG-C (Bemba Image Grounded Conversations), a large multimodal dataset for Bemba. While Bemba is the most populous language of Zambia, it exhibits a dearth of resources which render the development of language technologies or language processing research almost impossible. The dataset is comprised of multi-turn dialogues between Bemba speakers based on images, transcribed and translated into English. There are more than 92,000 utterances/sentences, amounting to more than 180 hours of audio data with corresponding transcriptions and English translations. We also provide baselines on speech recognition (ASR), machine translation (MT) and speech translation (ST) tasks, and sketch out other potential future multimodal uses of our dataset. We hope that by making the dataset available to the research community, this work will foster research and encourage collaboration across the language, speech, and vision communities especially for languages outside the â??traditionallyâ?? used high-resourced ones. All data and code are publicly available: [https://github.com/csikasote/bigc](https://github.com/csikasote/bigc).
13.Schema-Guided User Satisfaction Modeling for Task-Oriented Dialogues
Yue Feng,Yunlong Jiao,Animesh Prasad,Nikolaos Aletras,Emine Yilmaz,Gabriella Kazai
Download URL
https://aclanthology.org/2023.acl-long.116/
abstract
AbstractUser Satisfaction Modeling (USM) is one of the popular choices for task-oriented dialogue systems evaluation, where user satisfaction typically depends on whether the userâ??s task goals were fulfilled by the system. Task-oriented dialogue systems use task schema, which is a set of task attributes, to encode the userâ??s task goals. Existing studies on USM neglect explicitly modeling the userâ??s task goals fulfillment using the task schema. In this paper, we propose SG-USM, a novel schema-guided user satisfaction modeling framework. It explicitly models the degree to which the userâ??s preferences regarding the task attributes are fulfilled by the system for predicting the userâ??s satisfaction level. SG-USM employs a pre-trained language model for encoding dialogue context and task attributes. Further, it employs a fulfillment representation layer for learning how many task attributes have been fulfilled in the dialogue, an importance predictor component for calculating the importance of task attributes. Finally, it predicts the user satisfaction based on task attribute fulfillment and task attribute importance. Experimental results on benchmark datasets (i.e. MWOZ, SGD, ReDial, and JDDC) show that SG-USM consistently outperforms competitive existing methods. Our extensive analysis demonstrates that SG-USM can improve the interpretability of user satisfaction modeling, has good scalability as it can effectively deal with unseen tasks and can also effectively work in low-resource settings by leveraging unlabeled data.Code is available at https://github.com/amzn/user-satisfaction-modeling.
14.MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions
Hao Sun,Zhexin Zhang,Fei Mi,Yasheng Wang,Wei Liu,Jianwei Cui,Bin Wang,Qun Liu,Minlie Huang
Download URL
https://aclanthology.org/2023.acl-long.123/
abstract
AbstractMorality in dialogue systems has raised great attention in research recently. A moral dialogue system aligned with usersâ?? values could enhance conversation engagement and user connections. In this paper, we propose a framework, MoralDial to train and evaluate moral dialogue systems. In our framework, we first explore the communication mechanisms of morality and resolve expressed morality into three parts, which indicate the roadmap for building a moral dialogue system. Based on that, we design a simple yet effective method: constructing moral discussions between simulated specific users and the dialogue system. The constructed discussions consist of expressing, explaining, revising, and inferring moral views in dialogue exchanges, which makes conversational models learn morality well in a natural manner. Furthermore, we propose a novel evaluation method under the framework. We evaluate the multiple aspects of morality by judging the relation between dialogue responses and human values in discussions, where the multifaceted nature of morality is particularly considered. Automatic and manual experiments demonstrate that our framework is promising to train and evaluate moral dialogue systems.
15.Injecting knowledge into language generation: a case study in auto-charting after-visit care instructions from medical dialogue
Maksim Eremeev,Ilya Valmianski,Xavier Amatriain,Anitha Kannan
Download URL
https://aclanthology.org/2023.acl-long.133/
abstract
AbstractFactual correctness is often the limiting factor in practical applications of natural language generation in high-stakes domains such as healthcare. An essential requirement for maintaining factuality is the ability to deal with rare tokens. This paper focuses on rare tokens that appear in both the source and the reference sequences, and which, when missed during generation, decrease the factual correctness of the output text. For high-stake domains that are also knowledge-rich, we show how to use knowledge to (a) identify which rare tokens that appear in both source and reference are important and (b) uplift their conditional probability. We introduce the â??utilization rateâ?? that encodes knowledge and serves as a regularizer by maximizing the marginal probability of selected tokens. We present a study in a knowledge-rich domain of healthcare, where we tackle the problem of generating after-visit care instructions based on patient-doctor dialogues. We verify that, in our dataset, specific medical concepts with high utilization rates are underestimated by conventionally trained sequence-to-sequence models. We observe that correcting this with our approach to knowledge injection reduces the uncertainty of the model as well as improves factuality and coherence without negatively impacting fluency.
16.DiffusEmp: A Diffusion Model-Based Framework with Multi-Grained Control for Empathetic Response Generation
Guanqun Bi,Lei Shen,Yanan Cao,Meng Chen,Yuqiang Xie,Zheng Lin,Xiaodong He
Download URL
https://aclanthology.org/2023.acl-long.158/
abstract
AbstractEmpathy is a crucial factor in open-domain conversations, which naturally shows oneâ??s caring and understanding to others. Though several methods have been proposed to generate empathetic responses, existing works often lead to monotonous empathy that refers to generic and safe expressions. In this paper, we propose to use explicit control to guide the empathy expression and design a framework DiffusEmp based on conditional diffusion language model to unify the utilization of dialogue context and attribute-oriented control signals. Specifically, communication mechanism, intent, and semantic frame are imported as multi-grained signals that control the empathy realization from coarse to fine levels. We then design a specific masking strategy to reflect the relationship between multi-grained signals and response tokens, and integrate it into the diffusion model to influence the generative process. Experimental results on a benchmark dataset EmpatheticDialogue show that our framework outperforms competitive baselines in terms of controllability, informativeness, and diversity without the loss of context-relatedness.
17.BREAK: Breaking the Dialogue State Tracking Barrier with Beam Search and Re-ranking
Seungpil Won,Heeyoung Kwak,Joongbo Shin,Janghoon Han,Kyomin Jung
Download URL
https://aclanthology.org/2023.acl-long.159/
abstract
AbstractDespite the recent advances in dialogue state tracking (DST), the joint goal accuracy (JGA) of the existing methods on MultiWOZ 2.1 still remains merely 60%. In our preliminary error analysis, we find that beam search produces a pool of candidates that is likely to include the correct dialogue state. Motivated by this observation, we introduce a novel framework, called BREAK (Beam search and RE-rAnKing), that achieves outstanding performance on DST. BREAK performs DST in two stages: (i) generating k-best dialogue state candidates with beam search and (ii) re-ranking the candidates to select the correct dialogue state. This simple yet powerful framework shows state-of-the-art performance on all versions of MultiWOZ and M2M datasets. Most notably, we push the joint goal accuracy to 80-90% on MultiWOZ 2.1-2.4, which is an improvement of 23.6%, 26.3%, 21.7%, and 10.8% over the previous best-performing models, respectively. The data and code will be available at https://github.com/tony-won/DST-BREAK
18.Learning to Generate Equitable Text in Dialogue from Biased Training Data
Anthony Sicilia,Malihe Alikhani
Download URL
https://aclanthology.org/2023.acl-long.163/
abstract
AbstractThe ingrained principles of fairness in a dialogue systemâ??s decision-making process and generated responses are crucial for user engagement, satisfaction, and task achievement. Absence of equitable and inclusive principles can hinder the formation of common ground, which in turn negatively impacts the overall performance of the system. For example, misusing pronouns in a user interaction may cause ambiguity about the intended subject. Yet, there is no comprehensive study of equitable text generation in dialogue. Aptly, in this work, we use theories of computational learning to study this problem. We provide formal definitions of equity in text generation, and further, prove formal connections between learning human-likeness and learning equity: algorithms for improving equity ultimately reduce to algorithms for improving human-likeness (on augmented data). With this insight, we also formulate reasonable conditions under which text generation algorithms can learn to generate equitable text without any modifications to the biased training data on which they learn. To exemplify our theory in practice, we look at a group of algorithms for the GuessWhat?! visual dialogue game and, using this example, test our theory empirically. Our theory accurately predicts relative-performance of multiple algorithms in generating equitable text as measured by both human and automated evaluation.
19.TREA: Tree-Structure Reasoning Schema for Conversational Recommendation
Wendi Li,Wei Wei,Xiaoye Qu,Xian-Ling Mao,Ye Yuan,Wenfeng Xie,Dangyang Chen
Download URL
https://aclanthology.org/2023.acl-long.167/
abstract
AbstractConversational recommender systems (CRS) aim to timely trace the dynamic interests of users through dialogues and generate relevant responses for item recommendations. Recently, various external knowledge bases (especially knowledge graphs) are incorporated into CRS to enhance the understanding of conversation contexts. However, recent reasoning-based models heavily rely on simplified structures such as linear structures or fixed-hierarchical structures for causality reasoning, hence they cannot fully figure out sophisticated relationships among utterances with external knowledge. To address this, we propose a novel Tree structure Reasoning schEmA named TREA. TREA constructs a multi-hierarchical scalable tree as the reasoning structure to clarify the causal relationships between mentioned entities, and fully utilizes historical conversations to generate more reasonable and suitable responses for recommended results. Extensive experiments on two public CRS datasets have demonstrated the effectiveness of our approach.
20.CORE: Cooperative Training of Retriever-Reranker for Effective Dialogue Response Selection
Chongyang Tao,Jiazhan Feng,Tao Shen,Chang Liu,Juntao Li,Xiubo Geng,Daxin Jiang
Download URL
https://aclanthology.org/2023.acl-long.174/
abstract
AbstractEstablishing retrieval-based dialogue systems that can select appropriate responses from the pre-built index has gained increasing attention. Recent common practice is to construct a two-stage pipeline with a fast retriever (e.g., bi-encoder) for first-stage recall followed by a smart response reranker (e.g., cross-encoder) for precise ranking. However, existing studies either optimize the retriever and reranker in independent ways, or distill the knowledge from a pre-trained reranker into the retriever in an asynchronous way, leading to sub-optimal performance of both modules. Thus, an open question remains about how to train them for a better combination of the best of both worlds. To this end, we present a cooperative training of the response retriever and the reranker whose parameters are dynamically optimized by the ground-truth labels as well as list-wise supervision signals from each other. As a result, the two modules can learn from each other and evolve together throughout the training. Experimental results on two benchmarks demonstrate the superiority of our method.
21.PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism
Yongkang Liu,Shi Feng,Daling Wang,Yifei Zhang,Hinrich Schütze
Download URL
https://aclanthology.org/2023.acl-long.185/
abstract
AbstractWe investigate response generation for multi-turn dialogue in generative chatbots. Existing generative modelsbased on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the history, which makesmodels unable to capture the subtle variability observed in different dialogues and cannot distinguish the differencesbetween dialogues that are similar in composition. In this paper, we propose Pseudo-Variational Gated Recurrent Unit (PVGRU). The key novelty of PVGRU is a recurrent summarizing variable thataggregates the accumulated distribution variations of subsequences. We train PVGRU without relying on posterior knowledge, thus avoiding the training-inference inconsistency problem. PVGRU can perceive subtle semantic variability through summarizing variables that are optimized by two objectives we employ for training: distribution consistency and reconstruction. In addition, we build a Pseudo-Variational Hierarchical Dialogue(PVHD) model based on PVGRU. Experimental results demonstrate that PVGRU can broadly improve the diversity andrelevance of responses on two benchmark datasets.
22.MPCHAT: Towards Multimodal Persona-Grounded Conversation
Jaewoo Ahn,Yeda Song,Sangdoo Yun,Gunhee Kim
Download URL
https://aclanthology.org/2023.acl-long.189/
abstract
AbstractIn order to build self-consistent personalized dialogue agents, previous research has mostly focused on textual persona that delivers personal facts or personalities. However, to fully describe the multi-faceted nature of persona, image modality can help better reveal the speakerâ??s personal characteristics and experiences in episodic memory (Rubin et al., 2003; Conway, 2009). In this work, we extend persona-based dialogue to the multimodal domain and make two main contributions. First, we present the first multimodal persona-based dialogue dataset named MPCHAT, which extends persona with both text and images to contain episodic memories. Second, we empirically show that incorporating multimodal persona, as measured by three proposed multimodal persona-grounded dialogue tasks (i.e., next response prediction, grounding persona prediction, and speaker identification), leads to statistically significant performance improvements across all tasks. Thus, our work highlights that multimodal persona is crucial for improving multimodal dialogue comprehension, and our MPCHAT serves as a high-quality resource for this research.
23.Towards Boosting the Open-Domain Chatbot with Human Feedback
Hua Lu,Siqi Bao,Huang He,Fan Wang,Hua Wu,Haifeng Wang
Download URL
https://aclanthology.org/2023.acl-long.224/
abstract
AbstractMany open-domain dialogue models pre-trained with social media comments can generate coherent replies but have difficulties producing engaging responses. This phenomenon might mainly result from the deficiency of annotated human-human conversations and the misalignment with human preference. In this paper, we propose a novel and efficient framework Diamante to boost the open-domain chatbot, where two kinds of human feedback (including explicit demonstration and implicit preference) are collected and leveraged. By asking annotators to select or amend the model-generated candidate responses, Diamante efficiently collects the human demonstrated responses and constructs a Chinese chit-chat dataset. To enhance the alignment with human preference, Diamante leverages the implicit preference in the data collection process and introduces the generation-evaluation joint training. Comprehensive experiments indicate that the Diamante dataset and joint training paradigm can significantly boost the performance of pre-trained dialogue models. The overall engagingness of the previous state-of-the-art model has been improved remarkably by 50% in Chinese open-domain conversations.
24.Knowledge-enhanced Mixed-initiative Dialogue System for Emotional Support Conversations
Yang Deng,Wenxuan Zhang,Yifei Yuan,Wai Lam
Download URL
https://aclanthology.org/2023.acl-long.225/
abstract
AbstractUnlike empathetic dialogues, the system in emotional support conversations (ESC) is expected to not only convey empathy for comforting the help-seeker, but also proactively assist in exploring and addressing their problems during the conversation. In this work, we study the problem of mixed-initiative ESC where the user and system can both take the initiative in leading the conversation. Specifically, we conduct a novel analysis on mixed-initiative ESC systems with a tailor-designed schema that divides utterances into different types with speaker roles and initiative types. Four emotional support metrics are proposed to evaluate the mixed-initiative interactions. The analysis reveals the necessity and challenges of building mixed-initiative ESC systems. In the light of this, we propose a knowledge-enhanced mixed-initiative framework (KEMI) for ESC, which retrieves actual case knowledge from a large-scale mental health knowledge graph for generating mixed-initiative responses. Experimental results on two ESC datasets show the superiority of KEMI in both content-preserving evaluation and mixed initiative related analyses.
25.ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems
Sarik Ghazarian,Yijia Shao,Rujun Han,Aram Galstyan,Nanyun Peng
Download URL
https://aclanthology.org/2023.acl-long.241/
abstract
AbstractCommonsense reasoning is omnipresent in human communications and thus is an important feature for open-domain dialogue systems. However, evaluating commonsense in dialogue systems is still an open challenge. We take the first step by focusing on event commonsense that considers events and their relations, and is crucial in both dialogues and general commonsense reasoning. We propose ACCENT, an event commonsense evaluation metric empowered by commonsense knowledge bases (CSKBs). ACCENT first extracts event-relation tuples from a dialogue, and then evaluates the response by scoring the tuples in terms of their compatibility with the CSKB. To evaluate ACCENT, we construct the first public event commonsense evaluation dataset for open-domain dialogues.Our experiments show that ACCENT is an efficient metric for event commonsense evaluation, which achieves higher correlations with human judgments than existing baselines.
26.Towards Faithful Dialogues via Focus Learning
Yifan Deng,Xingsheng Zhang,Heyan Huang,Yue Hu
Download URL
https://aclanthology.org/2023.acl-long.250/
abstract
AbstractMaintaining faithfulness between responses and knowledge is an important research topic for building reliable knowledge-grounded dialogue systems. Existing models heavily rely on elaborate data engineering or increasing the modelâ??s parameters ignoring to track the tokens that significantly influence losses, which is decisive for the optimization direction of the model in each iteration. To address this issue, we propose Focus Learning (FocusL), a novel learning approach that adjusts the contribution of each token to the optimization direction by directly scaling the corresponding objective loss. Specifically, we first introduce a positioning method by utilizing similarity distributions between knowledge and each response token to locate knowledge-aware tokens. Then, we further design a similarity-to-weight transformation to provide dynamic token-level weights for the cross-entropy loss. Finally, we use the weighted loss to encourage the model to pay special attention to the knowledge utilization. Experimental results demonstrate that our method achieves the new state-of-the-art results and generates more reliable responses while maintaining training stability.
27.Prompter: Zero-shot Adaptive Prefixes for Dialogue State Tracking Domain Adaptation
Ibrahim Taha Aksu,Min-Yen Kan,Nancy Chen
Download URL
https://aclanthology.org/2023.acl-long.252/
abstract
AbstractA challenge in the Dialogue State Tracking (DST) field is adapting models to new domains without using any supervised data â?? zero-shot domain adaptation. Parameter-Efficient Transfer Learning (PETL) has the potential to address this problem due to its robustness. However, it has yet to be applied to the zero-shot scenarios, as it is not clear how to apply it unsupervisedly. Our method, Prompter, uses descriptions of target domain slots to generate dynamic prefixes that are concatenated to the key and values at each layerâ??s self-attention mechanism. This allows for the use of prefix-tuning in zero-shot. Prompter outperforms previous methods on both the MultiWOZ and SGD benchmarks. In generating prefixes, our analyses find that Prompter not only utilizes the semantics of slot descriptions but also how often the slots appear together in conversation. Moreover, Prompterâ??s gains are due to its improved ability to distinguish â??noneâ??-valued dialogue slots, compared against baselines.
28.Enhancing Dialogue Generation via Dynamic Graph Knowledge Aggregation
Chen Tang,Hongbo Zhang,Tyler Loakman,Chenghua Lin,Frank Guerin
Download URL
https://aclanthology.org/2023.acl-long.253/
abstract
AbstractIncorporating external graph knowledge into neural chatbot models has been proven effective for enhancing dialogue generation. However, in conventional graph neural networks (GNNs), message passing on a graph is independent from text, resulting in the graph representation hidden space differing from that of the text. This training regime of existing models therefore leads to a semantic gap between graph knowledge and text. In this study, we propose a novel framework for knowledge graph enhanced dialogue generation. We dynamically construct a multi-hop knowledge graph with pseudo nodes to involve the language model in feature aggregation within the graph at all steps. To avoid the semantic biases caused by learning on vanilla subgraphs, the proposed framework applies hierarchical graph attention to aggregate graph features on pseudo nodes and then attains a global feature. Therefore, the framework can better utilise the heterogeneous features from both the post and external graph knowledge. Extensive experiments demonstrate that our framework outperforms state-of-the-art (SOTA) baselines on dialogue generation. Further analysis also shows that our representation learning framework can fill the semantic gap by coagulating representations of both text and graph knowledge. Moreover, the language model also learns how to better select knowledge triples for a more informative response via exploiting subgraph patterns within our feature aggregation process. Our code and resources are available at https://github.com/tangg555/SaBART.
29.Privacy-Preserving Domain Adaptation of Semantic Parsers
Fatemehsadat Mireshghallah,Yu Su,Tatsunori Hashimoto,Jason Eisner,Richard Shin
Download URL
https://aclanthology.org/2023.acl-long.271/
abstract
AbstractTask-oriented dialogue systems often assist users with personal or confidential matters. For this reason, the developers of such a system are generally prohibited from observing actual usage. So how can they know where the system is failing and needs more training data or new functionality? In this work, we study ways in which realistic user utterances can be generated synthetically, to help increase the linguistic and functional coverage of the system, without compromising the privacy of actual users. To this end, we propose a two-stage Differentially Private (DP) generation method which first generates latent semantic parses, and then generates utterances based on the parses. Our proposed approach improves MAUVE by 2.5X and parse tree function-type overlap by 1.3X relative to current approaches for private synthetic data generation, improving both on fluency and semantic coverage. We further validate our approach on a realistic domain adaptation task of adding new functionality from private user data to a semantic parser, and show overall gains of 8.5% points on its accuracy with the new feature.
30.VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Yuxuan Wang,Zilong Zheng,Xueliang Zhao,Jinpeng Li,Yueqian Wang,Dongyan Zhao
Download URL
https://aclanthology.org/2023.acl-long.276/
abstract
AbstractVideo-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.
31.Enhancing Personalized Dialogue Generation with Contrastive Latent Variables: Combining Sparse and Dense Persona
Yihong Tang,Bo Wang,Miao Fang,Dongming Zhao,Kun Huang,Ruifang He,Yuexian Hou
Download URL
https://aclanthology.org/2023.acl-long.299/
abstract
AbstractThe personalized dialogue explores the consistent relationship between dialogue generation and personality. Existing personalized dialogue agents model persona profiles from three resources: sparse or dense persona descriptions and dialogue histories. However, sparse structured persona attributes are explicit but uninformative, dense persona texts contain rich persona descriptions with much noise, and dialogue history query is both noisy and uninformative for persona modeling. In this work, we combine the advantages of the three resources to obtain a richer and more accurate persona. We design a Contrastive Latent Variable-based model (CLV) that clusters the dense persona descriptions into sparse categories, which are combined with the history query to generate personalized responses. Experimental results on Chinese and English datasets demonstrate our modelâ??s superiority in personalization.
32.FutureTOD: Teaching Future Knowledge to Pre-trained Language Model for Task-Oriented Dialogue
Weihao Zeng,Keqing He,Yejie Wang,Chen Zeng,Jingang Wang,Yunsen Xian,Weiran Xu
Download URL
https://aclanthology.org/2023.acl-long.360/
abstract
AbstractPre-trained language models based on general text enable huge success in the NLP scenario. But the intrinsical difference of linguistic patterns between general text and task-oriented dialogues makes existing pre-trained language models less useful in practice. Current dialogue pre-training methods rely on a contrastive framework and face the challenges of both selecting true positives and hard negatives. In this paper, we propose a novel dialogue pre-training model, FutureTOD, which distills future knowledge to the representation of the previous dialogue context using a self-training framework. Our intuition is that a good dialogue representation both learns local context information and predicts future information. Extensive experiments on diverse downstream dialogue tasks demonstrate the effectiveness of our model, especially the generalization, robustness, and learning discriminative dialogue representations capabilities.
33.PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
Silin Gao,Beatriz Borges,Soyoung Oh,Deniz Bayazit,Saya Kanno,Hiromi Wakaki,Yuki Mitsufuji,Antoine Bosselut
Download URL
https://aclanthology.org/2023.acl-long.362/
abstract
AbstractSustaining coherent and engaging narratives requires dialogue or storytelling agents to understandhow the personas of speakers or listeners ground the narrative. Specifically, these agents must infer personas of their listeners to produce statements that cater to their interests. They must also learn to maintain consistent speaker personas for themselves throughout the narrative, so that their counterparts feel involved in a realistic conversation or story.However, personas are diverse and complex: they entail large quantities of rich interconnected world knowledge that is challenging to robustly represent in general narrative systems (e.g., a singer is good at singing, and may have attended conservatoire). In this work, we construct a new large-scale persona commonsense knowledge graph, PeaCoK, containing ~100K human-validated persona facts. Our knowledge graph schematizes five dimensions of persona knowledge identified in previous studies of human interactive behaviours, and distils facts in this schema from both existing commonsense knowledge graphs and large-scale pretrained language models. Our analysis indicates that PeaCoK contains rich and precise world persona inferences that help downstream systems generate more consistent and engaging narratives.
34.Retrieval-free Knowledge Injection through Multi-Document Traversal for Dialogue Models
Rui Wang,Jianzhu Bao,Fei Mi,Yi Chen,Hongru Wang,Yasheng Wang,Yitong Li,Lifeng Shang,Kam-Fai Wong,Ruifeng Xu
Download URL
https://aclanthology.org/2023.acl-long.364/
abstract
AbstractDialogue models are often enriched with extensive external knowledge to provide informative responses through a retrieval-augmented pipeline.Nevertheless, retrieval-augmented approaches rely on finely annotated retrieval training data and knowledge-grounded response generation data, making it costly to transfer. To tackle this challenge, this paper proposed a retrieval-free approach, KiDG, by automatically turning knowledge documents into simulated multi-turn dialogues through a Multi-Document Traversal algorithm. The simulated knowledge-intensive dialogues constructed by KiDG in one domain can be easily used to train and enhance pre-trained dialogue modelsâ?? knowledge w.r.t. this domain without costly annotation.We conduct extensive experiments comparing retrieval-augmented models and a variety of retrieval-free models. We found that dialogue models enhanced with data simulated with KiDG largely outperform state-of-the-art retrieval-free methods, and it achieves comparable performance compared to retrieval-augmented methods while being better, and cheaper at domain transfer.
35.Annotating and Detecting Fine-grained Factual Errors for Dialogue Summarization
Rongxin Zhu,Jianzhong Qi,Jey Han Lau
Download URL
https://aclanthology.org/2023.acl-long.377/
abstract
AbstractA series of datasets and models have been proposed for summaries generated for well-formatted documents such as news articles. Dialogue summaries, however, have been under explored. In this paper, we present the first dataset with fine-grained factual error annotations named DIASUMFACT. We define fine-grained factual error detection as a sentence-level multi-label classification problem, and weevaluate two state-of-the-art (SOTA) models on our dataset. Both models yield sub-optimal results, with a macro-averaged F1 score of around 0.25 over 6 error classes. We further propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models. Our model performs on par with the SOTA models while requiring fewer resources. These observations confirm the challenges in detecting factual errors from dialogue summaries, which call for further studies, for which our dataset and results offer a solid foundation.
36.MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation
Jiazhan Feng,Qingfeng Sun,Can Xu,Pu Zhao,Yaming Yang,Chongyang Tao,Dongyan Zhao,Qingwei Lin
Download URL
https://aclanthology.org/2023.acl-long.405/
abstract
AbstractResponding with multi-modal content has been recognized as an essential capability for an intelligent conversational agent. In this paper, we introduce the MMDialog dataset to facilitate multi-modal conversation better. MMDialog is composed of a curated set of 1.08 million real-world dialogues with 1.53 million unique images across 4,184 topics. MMDialog has two main and unique advantages. First, it is the largest multi-modal conversation dataset by the number of dialogues by 88x. Second, it contains massive topics to generalize the open domain. To build an engaging dialogue system with this dataset, we propose and normalize two response prediction tasks based on retrieval and generative scenarios. In addition, we build two baselines for the above tasks with state-of-the-art techniques and report their experimental performance. We also propose a novel evaluation metric MM-Relevance to measure the multi-modal responses. Our dataset is available in https://github.com/victorsungo/MMDialog.
37.Envisioning Future from the Past: Hierarchical Duality Learning for Multi-Turn Dialogue Generation
Ang Lv,Jinpeng Li,Shufang Xie,Rui Yan
Download URL
https://aclanthology.org/2023.acl-long.407/
abstract
AbstractIn this paper, we define a widely neglected property in dialogue text, duality, which is a hierarchical property that is reflected in human behaviours in daily conversations: Based on the logic in a conversation (or a sentence), people can infer follow-up utterances (or tokens) based on the previous text, and vice versa. We propose a hierarchical duality learning for dialogue (HDLD) to simulate this human cognitive ability, for generating high quality responses that connect both previous and follow-up dialogues. HDLD utilizes hierarchical dualities at token hierarchy and utterance hierarchy. HDLD maximizes the mutual information between past and future utterances. Thus, even if future text is invisible during inference, HDLD is capable of estimating future information implicitly based on dialogue history and generates both coherent and informative responses. In contrast to previous approaches that solely utilize future text as auxiliary information to encode during training, HDLD leverages duality to enable interaction between dialogue history and the future. This enhances the utilization of dialogue data, leading to the improvement in both automatic and human evaluation.
38.Can Language Models Make Fun? A Case Study in Chinese Comical Crosstalk
Jianquan Li,XiangBo Wu,Xiaokang Liu,Qianqian Xie,Prayag Tiwari,Benyou Wang
Download URL
https://aclanthology.org/2023.acl-long.419/
abstract
AbstractLanguage is the principal tool for human communication, in which humor is one of the most attractive parts. Producing natural language like humans using computers, a.k.a, Natural Language Generation (NLG), has been widely used for dialogue systems, chatbots, machine translation, as well as computer-aid creation e.g., idea generations, scriptwriting. However, the humor aspect of natural language is relatively under-investigated, especially in the age of pre-trained language models. In this work, we aim to preliminarily test *whether NLG can generate humor as humans do*. We build a largest dataset consisting of numerous **C**hinese **C**omical **C**rosstalk scripts (called **C**3 in short), which is for a popular Chinese performing art called â??Xiangshengâ?? or â??ç?¸å£°â?? since 1800s.We benchmark various generation approaches including training-from-scratch Seq2seq, fine-tuned middle-scale PLMs, and large-scale PLMs (with and without fine-tuning). Moreover, we also conduct a human assessment, showing that 1) *large-scale pretraining largely improves crosstalk generation quality*; and 2) *even the scripts generated from the best PLM is far from what we expect*. We conclude humor generation could be largely improved using large-scaled PLMs, but it is still in its infancy. The data and benchmarking code are publicly available in [https://github.com/anonNo2/crosstalk-generation](https://github.com/anonNo2/crosstalk-generation).
39.A Dataset of Argumentative Dialogues on Scientific Papers
Federico Ruggeri,Mohsen Mesgar,Iryna Gurevych
Download URL
https://aclanthology.org/2023.acl-long.425/
abstract
AbstractWith recent advances in question-answering models, various datasets have been collected to improve and study the effectiveness of these models on scientific texts. Questions and answers in these datasets explore a scientific paper by seeking factual information from the paperâ??s content. However, these datasets do not tackle the argumentative content of scientific papers, which is of huge importance in persuasiveness of a scientific discussion. We introduce ArgSciChat, a dataset of 41 argumentative dialogues between scientists on 20 NLP papers. The unique property of our dataset is that it includes both exploratory and argumentative questions and answers in a dialogue discourse on a scientific paper. Moreover, the size of ArgSciChat demonstrates the difficulties in collecting dialogues for specialized domains.Thus, our dataset is a challenging resource to evaluate dialogue agents in low-resource domains, in which collecting training data is costly. We annotate all sentences of dialogues in ArgSciChat and analyze them extensively. The results confirm that dialogues in ArgSciChat include exploratory and argumentative interactions. Furthermore, we use our dataset to fine-tune and evaluate a pre-trained document-grounded dialogue agent. The agent achieves a low performance on our dataset, motivating a need for dialogue agents with a capability to reason and argue about their answers. We publicly release ArgSciChat.
40.Contextual Knowledge Learning for Dialogue Generation
Wen Zheng,Natasa Milic-Frayling,Ke Zhou
Download URL
https://aclanthology.org/2023.acl-long.433/
abstract
AbstractIncorporating conversational context and knowledge into dialogue generation models has been essential for improving the quality of the generated responses. The context, comprising utterances from previous dialogue exchanges, is used as a source of content for response generation and as a means of selecting external knowledge. However, to avoid introducing irrelevant content, it is key to enable fine-grained scoring of context and knowledge. In this paper, we present a novel approach to context and knowledge weighting as an integral part of model training.We guide the model training through a Contextual Knowledge Learning (CKL) process which involves Latent Vectors for context and knowledge, respectively. CKL Latent Vectors capture the relationship between context, knowledge, and responses through weak supervision and enable differential weighting of context utterances and knowledge sentences during the training process. Experiments with two standard datasets and human evaluation demonstrate that CKL leads to a significant improvement compared with the performance of six strong baseline models and shows robustness with regard to reduced sizes of training sets.
41.Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
Tianshu Yu,Haoyu Gao,Ting-En Lin,Min Yang,Yuchuan Wu,Wentao Ma,Chao Wang,Fei Huang,Yongbin Li
Download URL
https://aclanthology.org/2023.acl-long.438/
abstract
AbstractRecently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.
42.MidMed: Towards Mixed-Type Dialogues for Medical Consultation
Xiaoming Shi,Zeming Liu,Chuan Wang,Haitao Leng,Kui Xue,Xiaofan Zhang,Shaoting Zhang
Download URL
https://aclanthology.org/2023.acl-long.453/
abstract
AbstractMost medical dialogue systems assume that patients have clear goals (seeking a diagnosis, medicine querying, etc.) before medical consultation. However, in many real situations, due to the lack of medical knowledge, it is usually difficult for patients to determine clear goals with all necessary slots. In this paper, we identify this challenge as how to construct medical consultation dialogue systems to help patients clarify their goals. For further study, we create a novel human-to-human mixed-type medical consultation dialogue corpus, termed MidMed, covering four dialogue types: task-oriented dialogue for diagnosis, recommendation, QA, and chitchat. MidMed covers four departments (otorhinolaryngology, ophthalmology, skin, and digestive system), with 8,309 dialogues. Furthermore, we build benchmarking baselines on MidMed and propose an instruction-guiding medical dialogue generation framework, termed InsMed, to handle mixed-type dialogues. Experimental results show the effectiveness of InsMed.
43.CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation
Jinfeng Zhou,Chujie Zheng,Bo Wang,Zheng Zhang,Minlie Huang
Download URL
https://aclanthology.org/2023.acl-long.457/
abstract
AbstractEmpathetic conversation is psychologically supposed to be the result of conscious alignment and interaction between the cognition and affection of empathy. However, existing empathetic dialogue models usually consider only the affective aspect or treat cognition and affection in isolation, which limits the capability of empathetic response generation. In this work, we propose the CASE model for empathetic dialogue generation. It first builds upon a commonsense cognition graph and an emotional concept graph and then aligns the userâ??s cognition and affection at both the coarse-grained and fine-grained levels. Through automatic and manual evaluation, we demonstrate that CASE outperforms state-of-the-art baselines of empathetic dialogues and can generate more empathetic and informative responses.
44.RECAP: Retrieval-Enhanced Context-Aware Prefix Encoder for Personalized Dialogue Response Generation
Shuai Liu,Hyundong Cho,Marjorie Freedman,Xuezhe Ma,Jonathan May
Download URL
https://aclanthology.org/2023.acl-long.468/
abstract
AbstractEndowing chatbots with a consistent persona is essential to an engaging conversation, yet it remains an unresolved challenge. In this work, we propose a new retrieval-enhanced approach for personalized response generation. Specifically, we design a hierarchical transformer retriever trained on dialogue domain data to perform personalized retrieval and a context-aware prefix encoder that fuses the retrieved information to the decoder more effectively. Extensive experiments on a real-world dataset demonstrate the effectiveness of our model at generating more fluent and personalized responses. We quantitatively evaluate our modelâ??s performance under a suite of human and automatic metrics and find it to be superior compared to state-of-the-art baselines on English Reddit conversations.
45.Dual Class Knowledge Propagation Network for Multi-label Few-shot Intent Detection
Feng Zhang,Wei Chen,Fei Ding,Tengjiao Wang
Download URL
https://aclanthology.org/2023.acl-long.480/
abstract
AbstractMulti-label intent detection aims to assign multiple labels to utterances and attracts increasing attention as a practical task in task-oriented dialogue systems. As dialogue domains change rapidly and new intents emerge fast, the lack of annotated data motivates multi-label few-shot intent detection. However, previous studies are confused by the identical representation of the utterance with multiple labels and overlook the intrinsic intra-class and inter-class interactions. To address these two limitations, we propose a novel dual class knowledge propagation network in this paper. In order to learn well-separated representations for utterances with multiple intents, we first introduce a label-semantic augmentation module incorporating class name information. For better consideration of the inherent intra-class and inter-class relations, an instance-level and a class-level graph neural network are constructed, which not only propagate label information but also propagate feature structure. And we use a simple yet effective method to predict the intent count of each utterance. Extensive experimental results on two multi-label intent datasets have demonstrated that our proposed method outperforms strong baselines by a large margin.
46.The CRINGE Loss: Learning what language not to model
Leonard Adolphs,Tianyu Gao,Jing Xu,Kurt Shuster,Sainbayar Sukhbaatar,Jason Weston
Download URL
https://aclanthology.org/2023.acl-long.493/
abstract
AbstractStandard language model training employs gold human documents or human-human interaction data, and treats all training data as positive examples. Growing evidence shows that even with very large amounts of positive training data, issues remain that can be alleviated with relatively small amounts of negative data â?? examples of what the model should not do. In this work, we propose a novel procedure to train with such data called the â??CRINGEâ?? loss (ContRastive Iterative Negative GEneration). We show the effectiveness of this approach across three different experiments on the tasks of safe generation, contradiction avoidance, and open-domain dialogue. Our models outperform multiple strong baselines and are conceptually simple, easy to train and implement.
47.Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process
Fanghua Ye,Zhiyuan Hu,Emine Yilmaz
Download URL
https://aclanthology.org/2023.acl-long.494/
abstract
AbstractDialogue systems have received increasing attention while automatically evaluating their performance remains challenging. User satisfaction estimation (USE) has been proposed as an alternative. It assumes that the performance of a dialogue system can be measured by user satisfaction and uses an estimator to simulate users. The effectiveness of USE depends heavily on the estimator. Existing estimators independently predict user satisfaction at each turn and ignore satisfaction dynamics across turns within a dialogue. In order to fully simulate users, it is crucial to take satisfaction dynamics into account. To fill this gap, we propose a new estimator ASAP (sAtisfaction eStimation via HAwkes Process) that treats user satisfaction across turns as an event sequence and employs a Hawkes process to effectively model the dynamics in this sequence. Experimental results on four benchmark dialogue datasets demonstrate that ASAP can substantially outperform state-of-the-art baseline estimators.
48.Pre-training Multi-party Dialogue Models with Latent Discourse Inference
Yiyang Li,Xinting Huang,Wei Bi,Hai Zhao
Download URL
https://aclanthology.org/2023.acl-long.533/
abstract
AbstractMulti-party dialogues are more difficult for models to understand than one-to-one two-party dialogues, since they involve multiple interlocutors, resulting in interweaving reply-to relations and information flows. To step over these obstacles, an effective way is to pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying. However, due to the lack of explicitly annotated discourse labels in multi-party dialogue corpora, previous works fail to scale up the pre-training process by putting aside the unlabeled multi-party conversational data for nothing. To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model by unsupervised latent variable inference methods. Experiments on multiple downstream tasks show that our pre-trained model outperforms strong baselines by large margins and achieves state-of-the-art (SOTA) results, justifying the effectiveness of our method. The official implementation of this paper is available at https://github.com/EricLee8/MPD_EMVI.
49.DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
Pei Ke,Fei Huang,Fei Mi,Yasheng Wang,Qun Liu,Xiaoyan Zhu,Minlie Huang
Download URL
https://aclanthology.org/2023.acl-long.539/
abstract
AbstractExisting evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation score for each dimension without revealing the evidence to interpret how this score is obtained. To deal with these challenges, we propose a simple yet effective metric called DecompEval. This metric formulates NLG evaluation as an instruction-style question answering task and utilizes instruction-tuned pre-trained language models (PLMs) without training on evaluation datasets, aiming to enhance the generalization ability. To make the evaluation process more interpretable, we decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result. Experimental results show that DecompEval achieves state-of-the-art performance in untrained metrics for evaluating text summarization and dialogue generation, which also exhibits strong dimension-level / task-level generalization ability and interpretability.
50.SimOAP: Improve Coherence and Consistency in Persona-based Dialogue Generation via Over-sampling and Post-evaluation
Junkai Zhou,Liang Pang,Huawei Shen,Xueqi Cheng
Download URL
https://aclanthology.org/2023.acl-long.553/
abstract
AbstractLanguage models trained on large-scale corpora can generate remarkably fluent results in open-domain dialogue. However, for the persona-based dialogue generation task, consistency and coherence are also key factors, which are great challenges for language models. Existing works mainly focus on valuable data filtering, model structure modifying, or objective function designing, while their improvements are limited and hard to generalize to all types of pre-trained language models. However, we find that language models can produce consistent and coherent responses if we consider enough generations. Thus, the problems lay in large-scale response generation and target response selection. In this work, a simple but effective two-stage SimOAP strategy is proposed, i.e., over-sampling and post-evaluation. The over-sampling stage takes large-scale responses from existing trained models efficiently via off-the-shelf distilling and compressing methods, and the post-evaluation stage selects a good response based on multiple well-designed evaluation metrics from large-scale candidates. Experimental results show that the proposed plug-in SimOAP strategy improves the backbone models and outperforms the baseline strategies in both automatic and human evaluations.
51.Improved Instruction Ordering in Recipe-Grounded Conversation
Duong Le,Ruohao Guo,Wei Xu,Alan Ritter
Download URL
https://aclanthology.org/2023.acl-long.561/
abstract
AbstractIn this paper, we study the task of instructional dialogue and focus on the cooking domain. Analyzing the generated output of the GPT-J model, we reveal that the primary challenge for a recipe-grounded dialog system is how to provide the instructions in the correct order. We hypothesize that this is due to the modelâ??s lack of understanding of user intent and inability to track the instruction state (i.e., which step was last instructed). Therefore, we propose to explore two auxiliary subtasks, namely User Intent Detection and Instruction State Tracking, to support Response Generation with improved instruction grounding. Experimenting with our newly collected dataset, ChattyChef, shows that incorporating user intent and instruction state information helps the response generation model mitigate the incorrect order issue. Furthermore, to investigate whether ChatGPT has completely solved this task, we analyze its outputs and find that it also makes mistakes (10.7% of the responses), about half of which are out-of-order instructions. We will release ChattyChef to facilitate further research in this area at: https://github.com/octaviaguo/ChattyChef.
52.Dialog-Post: Multi-Level Self-Supervised Objectives and Hierarchical Model for Dialogue Post-Training
Zhenyu Zhang,Lei Shen,Yuming Zhao,Meng Chen,Xiaodong He
Download URL
https://aclanthology.org/2023.acl-long.564/
abstract
AbstractDialogue representation and understanding aim to convert conversational inputs into embeddings and fulfill discriminative tasks. Compared with free-form text, dialogue has two important characteristics, hierarchical semantic structure and multi-facet attributes. Therefore, directly applying the pretrained language models (PLMs) might result in unsatisfactory performance. Recently, several work focused on the dialogue-adaptive post-training (DialPost) that further trains PLMs to fit dialogues. To model dialogues more comprehensively, we propose a DialPost method, Dialog-Post, with multi-level self-supervised objectives and a hierarchical model. These objectives leverage dialogue-specific attributes and use self-supervised signals to fully facilitate the representation and understanding of dialogues. The novel model is a hierarchical segment-wise self-attention network, which contains inner-segment and inter-segment self-attention sub-layers followed by an aggregation and updating module. To evaluate the effectiveness of our methods, we first apply two public datasets for the verification of representation ability. Then we conduct experiments on a newly-labelled dataset that is annotated with 4 dialogue understanding tasks. Experimental results show that our method outperforms existing SOTA models and achieves a 3.3% improvement on average.
53.Language Detoxification with Attribute-Discriminative Latent Space
Jin Myung Kwak,Minseon Kim,Sung Ju Hwang
Download URL
https://aclanthology.org/2023.acl-long.565/
abstract
AbstractTransformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks, but they can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications. To overcome this issue, a few text generation approaches aim to detoxify toxic texts using additional LMs or perturbations. However, previous methods require excessive memory, computations, and time which are serious bottlenecks in their real-world application. To address such limitations, we propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space. Specifically, we project the latent space of an original Transformer LM onto a discriminative latent space that well-separates texts by their attributes using a projection block and an attribute discriminator. This allows the LM to control the text generation to be non-toxic with minimal memory and computation overhead. We validate our model, Attribute-Discriminative Language Model (ADLM) on detoxified language and dialogue generation tasks, on which our method significantly outperforms baselines both in performance and efficiency.
54.A Cognitive Stimulation Dialogue System with Multi-source Knowledge Fusion for Elders with Cognitive Impairment
Jiyue Jiang,Sheng Wang,Qintong Li,Lingpeng Kong,Chuan Wu
Download URL
https://aclanthology.org/2023.acl-long.593/
abstract
AbstractWhen communicating with elders with cognitive impairment, cognitive stimulation (CS) help to maintain the cognitive health of elders. Data sparsity is the main challenge in building CS-based dialogue systems, particularly in the Chinese language. To fill this gap, we construct a Chinese CS conversation (CSConv) dataset, which contains about 2.6K groups of dialogues with therapy principles and emotional support strategy labels. Making chit chat while providing emotional support is overlooked by the majority of existing cognitive dialogue systems. In this paper, we propose a multi-source knowledge fusion method for CS dialogue (CSD), to generate open-ended responses guided by the therapy principle and emotional support strategy. We first use a progressive mask method based on external knowledge to learn encoders as effective classifiers, which is the prerequisite to predict the therapy principle and emotional support strategy of the target response. Then a decoder interacts with the perceived therapy principle and emotional support strategy to generate responses. Extensive experiments conducted on the CSConv dataset demonstrate the effectiveness of the proposed method, while there is still a large space for improvement compared to human performance.
55.A Synthetic Data Generation Framework for Grounded Dialogues
Jianzhu Bao,Rui Wang,Yasheng Wang,Aixin Sun,Yitong Li,Fei Mi,Ruifeng Xu
Download URL
https://aclanthology.org/2023.acl-long.608/
abstract
AbstractTraining grounded response generation models often requires a large collection of grounded dialogues. However, it is costly to build such dialogues. In this paper, we present a synthetic data generation framework (SynDG) for grounded dialogues. The generation process utilizes large pre-trained language models and freely available knowledge data (e.g., Wikipedia pages, persona profiles, etc.). The key idea of designing SynDG is to consider dialogue flow and coherence in the generation process. Specifically, given knowledge data, we first heuristically determine a dialogue flow, which is a series of knowledge pieces. Then, we employ T5 to incrementally turn the dialogue flow into a dialogue. To ensure coherence of both the dialogue flow and the synthetic dialogue, we design a two-level filtering strategy, at the flow-level and the utterance-level respectively. Experiments on two public benchmarks show that the synthetic grounded dialogue data produced by our framework is able to significantly boost model performance in both full training data and low-resource scenarios.
56.Your spouse needs professional help: Determining the Contextual Appropriateness of Messages through Modeling Social Relationships
David Jurgens,Agrima Seth,Jackson Sargent,Athena Aghighi,Michael Geraci
Download URL
https://aclanthology.org/2023.acl-long.616/
abstract
AbstractUnderstanding interpersonal communication requires, in part, understanding the social context and norms in which a message is said. However, current methods for identifying offensive content in such communication largely operate independent of context, with only a few approaches considering community norms or prior conversation as context. Here, we introduce a new approach to identifying inappropriate communication by explicitly modeling the social relationship between the individuals. We introduce a new dataset of contextually-situated judgments of appropriateness and show that large language models can readily incorporate relationship information to accurately identify appropriateness in a given context. Using data from online conversations and movie dialogues, we provide insight into how the relationships themselves function as implicit norms and quantify the degree to which context-sensitivity is needed in different conversation settings. Further, we also demonstrate that contextual-appropriateness judgments are predictive of other social factors expressed in language such as condescension and politeness.
57.XDailyDialog: A Multilingual Parallel Dialogue Corpus
Zeming Liu,Ping Nie,Jie Cai,Haifeng Wang,Zheng-Yu Niu,Peng Zhang,Mrinmaya Sachan,Kaiping Peng
Download URL
https://aclanthology.org/2023.acl-long.684/
abstract
AbstractHigh-quality datasets are significant to the development of dialogue models.However, most existing datasets for open-domain dialogue modeling are limited to a single language.The absence of multilingual open-domain dialog datasets not only limits the research on multilingual or cross-lingual transfer learning, but also hinders the development of robust open-domain dialog systems that can be deployed in other parts of the world.In this paper, we provide a multilingual parallel open-domain dialog dataset, XDailyDialog, to enable researchers to explore the challenging task of multilingual and cross-lingual open-domain dialog. XDailyDialog includes 13K dialogues aligned across 4 languages (52K dialogues and 410K utterances in total). We then propose a dialog generation model, kNN-Chat, which has a novel kNN-search mechanism to support unified response retrieval for monolingual, multilingual, and cross-lingual dialogue. Experiment results show the effectiveness of this framework. We will make XDailyDialog and kNN-Chat publicly available soon.
58.HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation
Qianyu He,Yikai Zhang,Jiaqing Liang,Yuncheng Huang,Yanghua Xiao,Yunwen Chen
Download URL
https://aclanthology.org/2023.acl-long.702/
abstract
AbstractSimiles play an imperative role in creative writing such as story and dialogue generation. Proper evaluation metrics are like a beacon guiding the research of simile generation (SG). However, it remains under-explored as to what criteria should be considered, how to quantify each criterion into metrics, and whether the metrics are effective for comprehensive, efficient, and reliable SG evaluation. To address the issues, we establish HAUSER, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion. Through extensive experiments, we verify that our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics. Resources of HAUSER are publicly available at https://github.com/Abbey4799/HAUSER.
59.Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization
Artidoro Pagnoni,Alex Fabbri,Wojciech Kryscinski,Chien-Sheng Wu
Download URL
https://aclanthology.org/2023.acl-long.713/
abstract
AbstractIn long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
60.RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue
Zhengliang Shi,Weiwei Sun,Shuo Zhang,Zhen Zhang,Pengjie Ren,Zhaochun Ren
Download URL
https://aclanthology.org/2023.acl-long.719/
abstract
AbstractEvaluating open-domain dialogue systems is challenging for reasons such as the one-to-many problem, i.e., many appropriate responses other than just the golden response. As of now, automatic evaluation methods need better consistency with humans, while reliable human evaluation can be time- and cost-intensive. To this end, we propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework, which leverages the pre-created utterance as reference other than the gold response to relief the one-to-many problem. Specifically, RADE explicitly compares reference and the candidate response to predict their overall scores.Moreover, an auxiliary response generation task enhances prediction via a shared encoder.To support RADE, we extend three datasets with additional rated responses other than just a golden response by human annotation.Experiments on our three datasets and two existing benchmarks demonstrate the effectiveness of our method, where Pearson, Spearman, and Kendall correlations with human evaluation outperform state-of-the-art baselines.
61.Extrinsic Evaluation of Machine Translation Metrics
Nikita Moghe,Tom Sherborne,Mark Steedman,Alexandra Birch
Download URL
https://aclanthology.org/2023.acl-long.730/
abstract
AbstractAutomatic machine translation (MT) metrics are widely used to distinguish the quality of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the segment-level quality by correlating metrics with how useful the translations are for downstream task.We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model and a translation model. We calculate the correlation between the metricâ??s ability to predict a good/bad translation with the success/failure on the final task for the machine translated test sentences. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable, in large part due to having undefined ranges. We synthesise our analysis into recommendations for future MT metrics to produce labels rather than scores for more informative interaction between machine translation and multilingual language understanding.
62.A Cross-Modality Context Fusion and Semantic Refinement Network for Emotion Recognition in Conversation
Xiaoheng Zhang,Yang Li
Download URL
https://aclanthology.org/2023.acl-long.732/
abstract
AbstractEmotion recognition in conversation (ERC) has attracted enormous attention for its applications in empathetic dialogue systems. However, most previous researches simply concatenate multimodal representations, leading to an accumulation of redundant information and a limited context interaction between modalities. Furthermore, they only consider simple contextual features ignoring semantic clues, resulting in an insufficient capture of the semantic coherence and consistency in conversations. To address these limitations, we propose a cross-modality context fusion and semantic refinement network (CMCF-SRNet). Specifically, we first design a cross-modal locality-constrained transformer to explore the multimodal interaction. Second, we investigate a graph-based semantic refinement transformer, which solves the limitation of insufficient semantic relationship information between utterances. Extensive experiments on two public benchmark datasets show the effectiveness of our proposed method compared with other state-of-the-art methods, indicating its potential application in emotion recognition. Our model will be available at https://github.com/zxiaohen/CMCF-SRNet.
63.PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
Yunshui Li,Binyuan Hui,ZhiChao Yin,Min Yang,Fei Huang,Yongbin Li
Download URL
https://aclanthology.org/2023.acl-long.749/
abstract
AbstractPerceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes PaCE, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.
64.Learning New Skills after Deployment: Improving open-domain internet-driven dialogue with human feedback
Jing Xu,Megan Ung,Mojtaba Komeili,Kushal Arora,Y-Lan Boureau,Jason Weston
Download URL
https://aclanthology.org/2023.acl-long.758/
abstract
AbstractFrozen models trained to mimic static datasets can never improve their performance. Models that can employ internet-retrieval for up-to-date information and obtain feedback from humans during deployment provide the promise of both adapting to new information, and improving their performance. In this work we study how to improve internet-driven conversational skills in such a learning framework. We collect deployment data, which we make publicly available, of human interactions, and collect various types of human feedback â?? including binary quality measurements, free-form text feedback, and fine-grained reasons for failure. We then study various algorithms for improving from such feedback, including standard supervised learning, rejection sampling, model-guiding and reward-based learning, in order to make recommendations on which type of feed- back and algorithms work best. We find the recently introduced DIRECTOR model (Arora et al., 2022) shows significant improvements over other existing approaches.
65.On the Compositional Generalization in Versatile Open-domain Dialogue
Tingchen Fu,Xueliang Zhao,Lemao Liu,Rui Yan
Download URL
https://aclanthology.org/2023.acl-long.760/
abstract
AbstractPrevious research has demonstrated the potential of multi-task learning to foster a conversational agentâ??s ability to acquire a variety of skills. However, these approaches either suffer from interference among different datasets (also known as negative transfer), or fail to effectively reuse knowledge and skills learned from other datasets. In contrast to previous works, we develop a sparsely activated modular network: (1) We propose a well-rounded set of operators and instantiate each operator with an independent module; (2) We formulate dialogue generation as the execution of a generated programme which recursively composes and assembles modules. Extensive experiments on 9 datasets verify the efficacy of our methods through automatic evaluation and human evaluation. Notably, our model outperforms state-of-the-art supervised approaches on 4 datasets with only 10% training data thanks to the modular architecture and multi-task learning.
66.Dialogue Summarization with Static-Dynamic Structure Fusion Graph
Shen Gao,Xin Cheng,Mingzhe Li,Xiuying Chen,Jinpeng Li,Dongyan Zhao,Rui Yan
Download URL
https://aclanthology.org/2023.acl-long.775/
abstract
AbstractDialogue, the most fundamental and specially privileged arena of language, gains increasing ubiquity across the Web in recent years. Quickly going through the long dialogue context and capturing salient information scattered over the whole dialogue session benefit users in many real-world Web applications such as email thread summarization and meeting minutes draft. Dialogue summarization is a challenging task in that dialogue has dynamic interaction nature and presumably inconsistent information flow among various speakers. Many researchers address this task by modeling dialogue with pre-computed static graph structure using external linguistic toolkits. However, such methods heavily depend on the reliability of external tools and the static graph construction is disjoint with the graph representation learning phase, which makes the graph canâ??t be dynamically adapted for the downstream summarization task. In this paper, we propose a Static-Dynamic graph-based Dialogue Summarization model (SDDS), which fuses prior knowledge from human expertise and adaptively learns the graph structure in an end-to-end learning fashion. To verify the effectiveness of SDDS, we conduct experiments on three benchmark datasets (SAMSum, MediaSum, and DialogSum) and the results verify the superiority of SDDS.
67.Reference Matters: Benchmarking Factual Error Correction for Dialogue Summarization with Fine-grained Evaluation Framework
Mingqi Gao,Xiaojun Wan,Jia Su,Zhefeng Wang,Baoxing Huai
Download URL
https://aclanthology.org/2023.acl-long.779/
abstract
AbstractFactuality is important to dialogue summarization. Factual error correction (FEC) of model-generated summaries is one way to improve factuality. Current FEC evaluation that relies on factuality metrics is not reliable and detailed enough. To address this problem, we are the first to manually annotate a FEC dataset for dialogue summarization containing 4000 items and propose FERRANTI, a fine-grained evaluation framework based on reference correction that automatically evaluates the performance of FEC models on different error categories. Using this evaluation framework, we conduct sufficient experiments with FEC approaches under a variety of settings and find the best training modes and significant differences in the performance of the existing approaches on different factual error categories.
68.Multimodal Persona Based Generation of Comic Dialogs
Harsh Agrawal,Aditya Mishra,Manish Gupta,Mausam -
Download URL
https://aclanthology.org/2023.acl-long.791/
abstract
AbstractWe focus on the novel problem of persona based dialogue generation for comic strips. Dialogs in comic strips is a unique and unexplored area where every strip contains utterances from various characters with each one building upon the previous utterances and the associated visual scene. Previous works like DialoGPT, PersonaGPT and other dialog generation models encode two-party dialogues and do not account for the visual information. To the best of our knowledge we are the first to propose the paradigm of multimodal persona based dialogue generation. We contribute a novel dataset, ComSet, consisting of 54K strips, harvested from 13 popular comics available online. Further, we propose a multimodal persona-based architecture, MPDialog, to generate dialogues for the next panel in the strip which decreases the perplexity score by ~10 points over strong dialogue generation baseline models. We demonstrate that there is still ample opportunity for improvement, highlighting the importance of building stronger dialogue systems that are able to generate persona-consistent dialogues and understand the context through various modalities.
69.Seen to Unseen: Exploring Compositional Generalization of Multi-Attribute Controllable Dialogue Generation
Weihao Zeng,Lulu Zhao,Keqing He,Ruotong Geng,Jingang Wang,Wei Wu,Weiran Xu
Download URL
https://aclanthology.org/2023.acl-long.793/
abstract
AbstractExisting controllable dialogue generation work focuses on the single-attribute control and lacks generalization capability to out-of-distribution multiple attribute combinations. In this paper, we explore the compositional generalization for multi-attribute controllable dialogue generation where a model can learn from seen attribute values and generalize to unseen combinations. We propose a prompt-based disentangled controllable dialogue generation model, DCG. It learns attribute concept composition by generating attribute-oriented prompt vectors and uses a disentanglement loss to disentangle different attributes for better generalization. Besides, we design a unified reference-free evaluation framework for multiple attributes with different levels of granularities. Experiment results on two benchmarks prove the effectiveness of our method and the evaluation metric.
70.Towards Understanding Omission in Dialogue Summarization
Yicheng Zou,Kaitao Song,Xu Tan,Zhongkai Fu,Qi Zhang,Dongsheng Li,Tao Gui
Download URL
https://aclanthology.org/2023.acl-long.798/
abstract
AbstractDialogue summarization aims to condense the lengthy dialogue into a concise summary, and has recently achieved significant progress. However, the result of existing methods is still far from satisfactory. Previous works indicated that omission is a major factor in affecting the quality of summarization, but few of them have further explored the omission problem, such as how omission affects summarization results and how to detect omission, which is critical for reducing omission and improving summarization quality. Moreover, analyzing and detecting omission relies on summarization datasets with omission labels (i.e., which dialogue utterances are omitted in the summarization), which are not available in the current literature. In this paper, we propose the OLDS dataset, which provides high-quality omission labels for dialogue summarization. By analyzing this dataset, we find that a large improvement in summarization quality can be achieved by providing ground-truth omission labels for the summarization model to recover omission information, which demonstrates the importance of omission detection for omission mitigation in dialogue summarization. Therefore, we formulate an omission detection task and demonstrate our proposed dataset can support the training and evaluation of this task well. We also call for research action on omission detection based on our proposed datasets. Our dataset and codes are publicly available.
71.Donâ??t Forget Your ABCâ??s: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems
Sarah E. Finch,James D. Finch,Jinho D. Choi
Download URL
https://aclanthology.org/2023.acl-long.839/
abstract
AbstractDespite tremendous advancements in dialogue systems, stable evaluation still requires human judgments producing notoriously high-variance metrics due to their inherent subjectivity.Moreover, methods and labels in dialogue evaluation are not fully standardized, especially for open-domain chats, with a lack of work to compare and assess the validity of those approaches.The use of inconsistent evaluation can misinform the performance of a dialogue system, which becomes a major hurdle to enhance it.Thus, a dimensional evaluation of chat-oriented open-domain dialogue systems that reliably measures several aspects of dialogue capabilities is desired.This paper presents a novel human evaluation method to estimate the rates of many{pasted macro â??LNâ??} dialogue system behaviors.Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.The analysis demonstrates that our behavior method is more suitable than alternative Likert-style or comparative approaches for dimensional evaluation of these systems.
72.Bridging The Gap: Entailment Fused-T5 for Open-retrieval Conversational Machine Reading Comprehension
Xiao Zhang,Heyan Huang,Zewen Chi,Xian-Ling Mao
Download URL
https://aclanthology.org/2023.acl-long.857/
abstract
AbstractOpen-retrieval conversational machine reading comprehension (OCMRC) simulates real-life conversational interaction scenes. Machines are required to make a decision of â??Yes/No/Inquireâ?? or generate a follow-up question when the decision is â??Inquireâ?? based on retrieved rule texts, user scenario, user question and dialogue history. Recent studies try to reduce the information gap between decision-making and question generation, in order to improve the performance of generation. However, the information gap still persists because these methods are still limited in pipeline framework, where decision-making and question generation are performed separately, making it hard to share the entailment reasoning used in decision-making across all stages. To tackle the above problem, we propose a novel one-stage end-to-end framework, called Entailment Fused-T5 (EFT), to bridge the information gap between decision-making and question generation in a global understanding manner. The extensive experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on the OR-ShARC benchmark. Our model and code are publicly available at an anonymous link.
73.LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically Constructed from Live Streaming
Jingsheng Gao,Yixin Lian,Ziyi Zhou,Yuzhuo Fu,Baoyuan Wang
Download URL
https://aclanthology.org/2023.acl-long.858/
abstract
AbstractOpen-domain dialogue systems have made promising progress in recent years. While the state-of-the-art dialogue agents are built upon large-scale social media data and large pre-trained models, there is no guarantee these agents could also perform well in fast-growing scenarios, such as live streaming, due to the bounded transferability of pre-trained models and biased distributions of public datasets from Reddit and Weibo, etc. To improve the essential capability of responding and establish a benchmark in the live open-domain scenario, we introduce the LiveChat dataset, composed of 1.33 million real-life Chinese dialogues with almost 3800 average sessions across 351 personas and fine-grained profiles for each persona. LiveChat is automatically constructed by processing numerous live videos on the Internet and naturally falls within the scope of multi-party conversations, where the issues of Who says What to Whom should be considered. Therefore, we target two critical tasks of response modeling and addressee recognition and propose retrieval-based baselines grounded on advanced techniques. Experimental results have validated the positive effects of leveraging persona profiles and larger average sessions per persona. In addition, we also benchmark the transferability of advanced generation-based models on LiveChat and pose some future directions for current challenges.
74.FactKG: Fact Verification via Reasoning on Knowledge Graphs
Jiho Kim,Sungjin Park,Yeonsu Kwon,Yohan Jo,James Thorne,Edward Choi
Download URL
https://aclanthology.org/2023.acl-long.895/
abstract
AbstractIn real world applications, knowledge graphs (KG) are widely used in various domains (e.g. medical applications and dialogue agents). However, for fact verification, KGs have not been adequately utilized as a knowledge source. KGs can be a valuable knowledge source in fact verification due to their reliability and broad applicability. A KG consists of nodes and edges which makes it clear how concepts are linked together, allowing machines to reason over chains of topics. However, there are many challenges in understanding how these machine-readable concepts map to information in text. To enable the community to better use KGs, we introduce a new dataset, FactKG: Fact Verificationvia Reasoning on Knowledge Graphs. It consists of 108k natural language claims with five types of reasoning: One-hop, Conjunction, Existence, Multi-hop, and Negation. Furthermore, FactKG contains various linguistic patterns, including colloquial style claims as well as written style claims to increase practicality. Lastly, we develop a baseline approach and analyze FactKG over these reasoning types. We believe FactKG can advance both reliability and practicality in KG-based fact verification.
75.Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment
Roni Rabin,Alexandre Djerbetian,Roee Engelberg,Lidan Hackmon,Gal Elidan,Reut Tsarfaty,Amir Globerson
Download URL
https://aclanthology.org/2023.acl-short.20/
abstract
AbstractHuman communication often involves information gaps between the interlocutors. For example, in an educational dialogue a student often provides an answer that is incomplete, and there is a gap between this answer and the perfect one expected by the teacher. Successful dialogue then hinges on the teacher asking about this gap in an effective manner, thus creating a rich and interactive educational experience. We focus on the problem of generating such gap-focused questions (GFQs) automatically. We define the task, highlight key desired aspects of a good GFQ, and propose a model that satisfies these. Finally, we provide an evaluation by human annotators of our generated questions compared against human generated ones, demonstrating competitive performance.
76.With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
Julius Steen,Juri Opitz,Anette Frank,Katja Markert
Download URL
https://aclanthology.org/2023.acl-short.79/
abstract
AbstractConditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth of prior research and data. But recent research suggests that NLI models require costly additional machinery to perform reliably across datasets, e.g., by running inference on a cartesian product of input and generated sentences, or supporting them with a question-generation/answering step.In this work we show that pure NLI models _can_ outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures. We propose: (1) Augmenting NLI training data toadapt NL inferences to the specificities of faithfulness prediction in dialogue;(2) Making use of both entailment and contradiction probabilities in NLI, and(3) Using Monte-Carlo dropout during inference.Applied to the TRUE benchmark, which combines faithfulness datasets across diverse domains and tasks, our approach strongly improves a vanilla NLI model and significantly outperforms previous work, while showing favourable computational cost.
77.Controllable Mixed-Initiative Dialogue Generation through Prompting
Maximillian Chen,Xiao Yu,Weiyan Shi,Urvi Awasthi,Zhou Yu
Download URL
https://aclanthology.org/2023.acl-short.82/
abstract
AbstractMixed-initiative dialogue tasks involve repeated exchanges of information and conversational control. Conversational agents gain control by generating responses that follow particular dialogue intents or strategies, prescribed by a policy planner. The standard approach has been fine-tuning pre-trained language models to perform generation conditioned on these intents. However, these supervised generation models are limited by the cost and quality of data annotation.We instead prompt large language models as a drop-in replacement to fine-tuning on conditional generation. We formalize prompt construction for controllable mixed-initiative dialogue. Our findings show improvements over fine-tuning and ground truth responses according to human evaluation and automatic metrics for two tasks: PersuasionForGood and Emotional Support Conversations.
78.Listener Model for the PhotoBook Referential Game with CLIPScores as Implicit Reference Chain
Shih-Lun Wu,Yi-Hui Chou,Liangze Li
Download URL
https://aclanthology.org/2023.acl-short.121/
abstract
AbstractPhotoBook is a collaborative dialogue game where two players receive private, partially-overlapping sets of images and resolve which images they have in common.It presents machines with a great challenge to learn how people build common ground around multimodal context to communicate effectively.Methods developed in the literature, however, cannot be deployed to real gameplaysince they only tackle some subtasks of the game,and they require additional reference chains inputs, whose extraction process is imperfect.Therefore, we propose a reference chain-free listener modelthat directly addresses the gameâ??s predictive task, i.e., deciding whether an image is shared with partner.Our DeBERTa-based listener model reads the full dialogue, and utilizesCLIPScore features to assess utterance-image relevance.We achieve >77% accuracy on unseen sets of images/game themes, outperforming baseline by >17 points.
79.Towards Fewer Hallucinations in Knowledge-Grounded Dialogue Generation via Augmentative and Contrastive Knowledge-Dialogue
Bin Sun,Yitong Li,Fei Mi,Fanhu Bie,Yiwei Li,Kan Li
Download URL
https://aclanthology.org/2023.acl-short.148/
abstract
AbstractExisting knowledge-grounded open-domain dialogue generation models often face the hallucination problem, i.e. the dialogue generative model will persist in an inappropriate knowledge and generate responses that inconsistent with the facts. We argue that this problem mainly stems from the polarized optimization objectives and weak knowledge generation ability. To mitigate the hallucination, we take inspiration from human communicating that people will replay euphemistic responses for the unclear or unrecognizable knowledge, and propose an Augmentative and Contrastive Knowledge Dialogue Expansion Framework (ACK-DEF). ACK-DEF constructs the augmentative and contrastive knowledge dialogue samples, which consist of the knowledge of different degrees of errors and the response of manual design, to expand the original training set and smooth the polarized optimization objective that enables models to generate ground-truth with or without gold knowledge. Not only the knowledge, ACK-DEF also provides the tactful responses of manual design corresponding to the incomplete correct knowledge. Experimental results on the Wikipedia of Wizard dataset show that employing the ACK-DEF is effective to alleviate the hallucination problem.