This blog summarizes the latest research development of speech papers published in ACL2023 conferences. This year there are total 54 papers related to speech in ACL2023. Most of the authors' affiliations are top research institutes (Google Research, DeepMind, Meta FAIR) and universities (Stanford, Berkeley, MIT, CMU and others).
Navigation
- 1.Rule By Example: Harnessing Logical Rules for Explainable Hate Speech Detection
- 2.Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation
- 3.A Theory of Unsupervised Speech Recognition
- 4.Why Arenâ??t We NER Yet? Artifacts of ASR Errors in Named Entity Recognition in Spontaneous Speech Transcripts
- 5.BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
- 6.Helping a Friend or Supporting a Cause? Disentangling Active and Passive Cosponsorship in the U.S. Congress
- 7.WACO: Word-Aligned Contrastive Learning for Speech Translation
- 8.Back Translation for Speech-to-text Translation Without Transcripts
- 9.What the DAAM: Interpreting Stable Diffusion Using Cross Attention
- 10.Counterspeeches up my sleeve! Intent Distribution Learning and Persistent Fusion for Intent-Conditioned Counterspeech Generation
- 11.DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation
- 12.APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning
- 13.OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
- 14.Prompting Language Models for Linguistic Structure
- 15.Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis
- 16.SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created through Human-Machine Collaboration
- 17.Automatic Annotation of Direct Speech in Written French Narratives
- 18.CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation
- 19.Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
- 20.READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises
- 21.AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
- 22.SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks
- 23.BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
- 24.NLPositionality: Characterizing Design Biases of Datasets and Models
- 25.CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training
- 26.How poor is the stimulus? Evaluating hierarchical generalization in neural networks trained on child-directed speech
- 27.Simple and Effective Unsupervised Speech Translation
- 28.MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages
- 29.Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning
- 30.How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases
- 31.Introducing Semantics into Speech Encoders
- 32.MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
- 33.From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models
- 34.SLABERT Talk Pretty One Day: Modeling Second Language Acquisition with BERT
- 35.Towards Domain-Agnostic and Domain-Adaptive Dementia Detection from Spoken Language
- 36.Transforming Visual Scene Graphs to Image Captions
- 37.Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks
- 38.Language of Bargaining
- 39.CTC-based Non-autoregressive Speech Translation
- 40.Attention as a Guide for Simultaneous Speech Translation
- 41.MeetingQA: Extractive Question-Answering on Meeting Transcripts
- 42.From Dogwhistles to Bullhorns: Unveiling Coded Rhetoric with Language Models
- 43.Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition
- 44.Toward Interactive Dictation
- 45.UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
- 46.Understanding and Bridging the Modality Gap for Speech Translation
- 47.SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
- 48.A Weakly Supervised Classifier and Dataset of White Supremacist Language
- 49.An (unhelpful) guide to selecting the best ASR architecture for your under-resourced language
- 50.Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation
- 51.MOSPC: MOS Prediction Based on Pairwise Comparison
- 52.When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants
- 53.STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions
- 54.A Simple Concatenation can Effectively Improve Speech Translation
Paper List
1.Rule By Example: Harnessing Logical Rules for Explainable Hate Speech Detection
Christopher Clarke,Matthew Hall,Gaurav Mittal,Ye Yu,Sandra Sajeev,Jason Mars,Mei Chen
Download URL
https://aclanthology.org/2023.acl-long.22/
abstract
AbstractClassic approaches to content moderation typically apply a rule-based heuristic approach to flag content. While rules are easily customizable and intuitive for humans to interpret, they are inherently fragile and lack the flexibility or robustness needed to moderate the vast amount of undesirable content found online today. Recent advances in deep learning have demonstrated the promise of using highly effective deep neural models to overcome these challenges. However, despite the improved performance, these data-driven models lack transparency and explainability, often leading to mistrust from everyday users and a lack of adoption by many platforms. In this paper, we present Rule By Example (RBE): a novel exemplar-based contrastive learning approach for learning from logical rules for the task of textual content moderation. RBE is capable of providing rule-grounded predictions, allowing for more explainable and customizable predictions compared to typical deep learning-based approaches. We demonstrate that our approach is capable of learning rich rule embedding representations using only a few data examples. Experimental results on 3 popular hate speech classification datasets show that RBE is able to outperform state-of-the-art deep learning classifiers as well as the use of rules in both supervised and unsupervised settings while providing explainable model predictions via rule-grounding.
2.Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation
Martijn Bartelds,Nay San,Bradley McDonnell,Dan Jurafsky,Martijn Wieling
Download URL
https://aclanthology.org/2023.acl-long.42/
abstract
AbstractThe performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance.
3.A Theory of Unsupervised Speech Recognition
Liming Wang,Mark Hasegawa-Johnson,Chang Yoo
Download URL
https://aclanthology.org/2023.acl-long.67/
abstract
AbstractUnsupervised speech recognition ({pasted macro â??ASRUâ??}/) is the problem of learning automatic speech recognition (ASR) systems from unpaired speech-only and text-only corpora. While various algorithms exist to solve this problem, a theoretical framework is missing to study their properties and address such issues as sensitivity to hyperparameters and training instability. In this paper, we proposed a general theoretical framework to study the properties of {pasted macro â??ASRUâ??}/ systems based on random matrix theory and the theory of neural tangent kernels. Such a framework allows us to prove various learnability conditions and sample complexity bounds of {pasted macro â??ASRUâ??}/. Extensive {pasted macro â??ASRUâ??}/ experiments on synthetic languages with three classes of transition graphs provide strong empirical evidence for our theory (code available at https://github.com/cactuswiththoughts/UnsupASRTheory.gitcactuswiththoughts/UnsupASRTheory.git).
4.Why Arenâ??t We NER Yet? Artifacts of ASR Errors in Named Entity Recognition in Spontaneous Speech Transcripts
Piotr SzymaÅ?ski,Lukasz Augustyniak,Mikolaj Morzy,Adrian Szymczak,Krzysztof Surdyk,Piotr Å»elasko
Download URL
https://aclanthology.org/2023.acl-long.98/
abstract
AbstractTranscripts of spontaneous human speech present a significant obstacle for traditional NER models. The lack of grammatical structure of spoken utterances and word errors introduced by the ASR make downstream NLP tasks challenging. In this paper, we examine in detail the complex relationship between ASR and NER errors which limit the ability of NER models to recover entity mentions from spontaneous speech transcripts. Using publicly available benchmark datasets (SWNE, Earnings-21, OntoNotes), we present the full taxonomy of ASR-NER errors and measure their true impact on entity recognition. We find that NER models fail spectacularly even if no word errors are introduced by the ASR. We also show why the F1 score is inadequate to evaluate NER models on conversational transcripts.
5.BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
Claytone Sikasote,Eunice Mukonde,Md Mahfuz Ibn Alam,Antonios Anastasopoulos
Download URL
https://aclanthology.org/2023.acl-long.115/
abstract
AbstractWe present BIG-C (Bemba Image Grounded Conversations), a large multimodal dataset for Bemba. While Bemba is the most populous language of Zambia, it exhibits a dearth of resources which render the development of language technologies or language processing research almost impossible. The dataset is comprised of multi-turn dialogues between Bemba speakers based on images, transcribed and translated into English. There are more than 92,000 utterances/sentences, amounting to more than 180 hours of audio data with corresponding transcriptions and English translations. We also provide baselines on speech recognition (ASR), machine translation (MT) and speech translation (ST) tasks, and sketch out other potential future multimodal uses of our dataset. We hope that by making the dataset available to the research community, this work will foster research and encourage collaboration across the language, speech, and vision communities especially for languages outside the â??traditionallyâ?? used high-resourced ones. All data and code are publicly available: [https://github.com/csikasote/bigc](https://github.com/csikasote/bigc).
6.Helping a Friend or Supporting a Cause? Disentangling Active and Passive Cosponsorship in the U.S. Congress
Giuseppe Russo,Christoph Gote,Laurence Brandenberger,Sophia Schlosser,Frank Schweitzer
Download URL
https://aclanthology.org/2023.acl-long.166/
abstract
AbstractIn the U.S. Congress, legislators can use active and passive cosponsorship to support bills.We show that these two types of cosponsorship are driven by two different motivations: the backing of political colleagues and the backing of the billâ??s content.To this end, we develop an Encoder+RGCN based model that learns legislator representations from bill texts and speech transcripts. These representations predict active and passive cosponsorship with an F1-score of 0.88.Applying our representations to predict voting decisions, we show that they are interpretable and generalize to unseen tasks.
7.WACO: Word-Aligned Contrastive Learning for Speech Translation
Siqi Ouyang,Rong Ye,Lei Li
Download URL
https://aclanthology.org/2023.acl-long.216/
abstract
AbstractEnd-to-end Speech Translation (E2E ST) aims to directly translate source speech into target text. Existing ST methods perform poorly when only extremely small speech-text data are available for training. We observe that an ST modelâ??s performance closely correlates with its embedding similarity between speech and source transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a simple and effective method for extremely low-resource speech-to-text translation. Our key idea is bridging word-level representations for both speech and text modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark, and on a low-resource direction Maltese-English from IWSLT 2023. Our experiments demonstrate that WACO outperforms the best baseline by 9+ BLEU points with only 1-hour parallel ST data. Code is available at https://github.com/owaski/WACO.
8.Back Translation for Speech-to-text Translation Without Transcripts
Qingkai Fang,Yang Feng
Download URL
https://aclanthology.org/2023.acl-long.251/
abstract
AbstractThe success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large amounts of target-side monolingual data to enhance ST without transcripts. Motivated by the remarkable success of back translation in MT, we develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data. To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units and achieve back translation by cascading a target-to-unit model and a unit-to-speech model. With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets. More experiments show that our method is especially effective in low-resource scenarios.
9.What the DAAM: Interpreting Stable Diffusion Using Cross Attention
Raphael Tang,Linqing Liu,Akshat Pandey,Zhiying Jiang,Gefei Yang,Karun Kumar,Pontus Stenetorp,Jimmy Lin,Ferhan Ture
Download URL
https://aclanthology.org/2023.acl-long.310/
abstract
AbstractDiffusion models are a milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce attribution maps, we upscale and aggregate cross-attention maps in the denoising module, naming our method DAAM. We validate it by testing its segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. On two generated datasets, we attain a competitive 58.8-64.8 mIoU on noun segmentation and fair to good mean opinion scores (3.4-4.2) on generalized attribution. Then, we apply DAAM to study the role of syntax in the pixel space across headâ??dependent heat map interaction patterns for ten common dependency relations. We show that, for some relations, the head map consistently subsumes the dependent, while the opposite is true for others. Finally, we study several semantic phenomena, focusing on feature entanglement; we find that the presence of cohyponyms worsens generation quality by 9%, and descriptive adjectives attend too broadly. We are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future research. Our code is at https://github.com/castorini/daam.
10.Counterspeeches up my sleeve! Intent Distribution Learning and Persistent Fusion for Intent-Conditioned Counterspeech Generation
Rishabh Gupta,Shaily Desai,Manvi Goel,Anil Bandhakavi,Tanmoy Chakraborty,Md. Shad Akhtar
Download URL
https://aclanthology.org/2023.acl-long.318/
abstract
AbstractCounterspeech has been demonstrated to be an efficacious approach for combating hate speech. While various conventional and controlled approaches have been studied in recent years to generate counterspeech, a counterspeech with a certain intent may not be sufficient in every scenario. Due to the complex and multifaceted nature of hate speech, utilizing multiple forms of counter-narratives with varying intents may be advantageous in different circumstances. In this paper, we explore intent-conditioned counterspeech generation. At first, we develop IntentCONAN, a diversified intent-specific counterspeech dataset with 6831 counterspeeches conditioned on five intents, i.e., informative, denouncing, question, positive, and humour. Subsequently, we propose QUARC, a two-stage framework for intent-conditioned counterspeech generation. QUARC leverages vector-quantized representations learned for each intent category along with PerFuMe, a novel fusion module to incorporate intent-specific information into the model. Our evaluation demonstrates that QUARC outperforms several baselines by an average of ~10% across evaluation metrics. An extensive human evaluation supplements our hypothesis of better and more appropriate responses than comparative systems.
11.DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation
Suraj Kothawade,Anmol Mekala,D.Chandra Sekhara Hetha Havya,Mayank Kothyari,Rishabh Iyer,Ganesh Ramakrishnan,Preethi Jyothi
Download URL
https://aclanthology.org/2023.acl-long.319/
abstract
AbstractState-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that compared to other speech selection methods, DITTO is 3-5 times as label-efficient for its improvements on the Indic-TTS and L2 datasets.
12.APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning
Soumya Sanyal,Yichong Xu,Shuohang Wang,Ziyi Yang,Reid Pryzant,Wenhao Yu,Chenguang Zhu,Xiang Ren
Download URL
https://aclanthology.org/2023.acl-long.347/
abstract
AbstractLogical reasoning over text is an important ability that requires understanding the semantics of the text and reasoning through them to arrive at correct inferences. Prior works on pretraining language models to improve the logical reasoning ability require complex processing of training data (e.g., aligning symbolic knowledge to text), yielding task-specific data augmentation that is not easy to adapt to any general text corpus. In this work, we propose APOLLO, a simple adaptive pretraining approach to improve the logical reasoning skills of language models. We select a subset of Wikipedia for adaptive pretraining using a set of logical inference keywords as filter words. Further, we propose two self-supervised loss functions for training. First, we modify the masked language modeling loss only to mask specific parts-of-speech words that likely require higher-order reasoning to predict them. Second, we propose a sentence-level classification loss that teaches the model to distinguish between entailment and contradiction types of sentences. The proposed pretraining paradigm is both simple and independent of task formats. We demonstrate the effectiveness of APOLLO by comparing it with prior baselines on two logical reasoning datasets. APOLLO performs comparably on ReClor and outperforms baselines on LogiQA.
13.OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
Xize Cheng,Tao Jin,Linjun Li,Wang Lin,Xinyu Duan,Zhou Zhao
Download URL
https://aclanthology.org/2023.acl-long.363/
abstract
AbstractSpeech Recognition builds a bridge between the multimedia streaming (audio-only, visual-only or audio-visual) and the corresponding text transcription. However, when training the specific model of new domain, it often gets stuck in the lack of new-domain utterances, especially the labeled visual utterances. To break through this restriction, we attempt to achieve zero-shot modality transfer by maintaining the multi-modality alignment in phoneme space learned with unlabeled multimedia utterances in the high resource domain during the pre-training, and propose a training system Open-modality Speech Recognition (OpenSR) that enables the models trained on a single modality (e.g., audio-only) applicable to more modalities (e.g., visual-only and audio-visual). Furthermore, we employ a cluster-based prompt tuning strategy to handle the domain shift for the scenarios with only common words in the new domain utterances. We demonstrate that OpenSR enables modality transfer from one to any in three different settings (zero-, few- and full-shot), and achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods. To the best of our knowledge, OpenSR achieves the state-of-the-art performance of word error rate in LRS2 on audio-visual speech recognition and lip-reading with 2.7% and 25.0%, respectively.
14.Prompting Language Models for Linguistic Structure
Terra Blevins,Hila Gonen,Luke Zettlemoyer
Download URL
https://aclanthology.org/2023.acl-long.367/
abstract
AbstractAlthough pretrained language models (PLMs) can be prompted to perform a wide range of language tasks, it remains an open question how much this ability comes from generalizable linguistic understanding versus surface-level lexical patterns. To test this, we present a structured prompting approach for linguistic structured prediction tasks, allowing us to perform zero- and few-shot sequence tagging with autoregressive PLMs. We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking, demonstrating strong few-shot performance in all cases. We also find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels. These findings indicate that the in-context learning ability and linguistic knowledge of PLMs generalizes beyond memorization of their training data.
15.Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis
Agam Shah,Suvan Paturi,Sudheer Chava
Download URL
https://aclanthology.org/2023.acl-long.368/
abstract
AbstractMonetary policy pronouncements by Federal Open Market Committee (FOMC) are a major driver of financial market returns. We construct the largest tokenized and annotated dataset of FOMC speeches, meeting minutes, and press conference transcripts in order to understand how monetary policy influences financial markets. In this study, we develop a novel task of hawkish-dovish classification and benchmark various pre-trained language models on the proposed dataset. Using the best-performing model (RoBERTa-large), we construct a measure of monetary policy stance for the FOMC document release days. To evaluate the constructed measure, we study its impact on the treasury market, stock market, and macroeconomic indicators. Our dataset, models, and code are publicly available on Huggingface and GitHub under CC BY-NC 4.0 license.
16.SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created through Human-Machine Collaboration
Hwaran Lee,Seokhee Hong,Joonsuk Park,Takyoung Kim,Meeyoung Cha,Yejin Choi,Byoungpil Kim,Gunhee Kim,Eun-Ju Lee,Yong Lim,Alice Oh,Sangchul Park,Jung-Woo Ha
Download URL
https://aclanthology.org/2023.acl-long.370/
abstract
AbstractThe potential social harms that large language models pose, such as generating offensive content and reinforcing biases, are steeply rising. Existing works focus on coping with this concern while interacting with ill-intentioned users, such as those who explicitly make hate speech or elicit harmful responses. However, discussions on sensitive issues can become toxic even if the users are well-intentioned. For safer models in such scenarios, we present the Sensitive Questions and Acceptable Response (SQuARe) dataset, a large-scale Korean dataset of 49k sensitive questions with 42k acceptable and 46k non-acceptable responses. The dataset was constructed leveraging HyperCLOVA in a human-in-the-loop manner based on real news headlines. Experiments show that acceptable response generation significantly improves for HyperCLOVA and GPT-3, demonstrating the efficacy of this dataset.
17.Automatic Annotation of Direct Speech in Written French Narratives
Noé Durandard,Viet Anh Tran,Gaspard Michel,Elena Epure
Download URL
https://aclanthology.org/2023.acl-long.393/
abstract
AbstractThe automatic annotation of direct speech (AADS) in written text has been often used in computational narrative understanding. Methods based on either rules or deep neural networks have been explored, in particular for English or German languages. Yet, for French, our target language, not many works exist. Our goal is to create a unified framework to design and evaluate AADS models in French. For this, we consolidated the largest-to-date French narrative dataset annotated with DS per word; we adapted various baselines for sequence labelling or from AADS in other languages; and we designed and conducted an extensive evaluation focused on generalisation. Results show that the task still requires substantial efforts and emphasise characteristics of each baseline. Although this framework could be improved, it is a step further to encourage more research on the topic.
18.CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation
Yan Zhou,Qingkai Fang,Yang Feng
Download URL
https://aclanthology.org/2023.acl-long.436/
abstract
AbstractEnd-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose Cross-modal Mixup via Optimal Transport (CMOT) to overcome the modality gap. We find the alignment between speech and text sequences via optimal transport and then mix up the sequences from different modalities at a token level using the alignment. Experiments on the MuST-C ST benchmark demonstrate that CMOT achieves an average BLEU of 30.0 in 8 translation directions, outperforming previous methods. Further analysis shows CMOT can adaptively find the alignment between modalities, which helps alleviate the modality gap between speech and text.
19.Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment
Tianshu Yu,Haoyu Gao,Ting-En Lin,Min Yang,Yuchuan Wu,Wentao Ma,Chao Wang,Fei Huang,Yongbin Li
Download URL
https://aclanthology.org/2023.acl-long.438/
abstract
AbstractRecently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.
20.READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises
Chenglei Si,Zhengyan Zhang,Yingfa Chen,Xiaozhi Wang,Zhiyuan Liu,Maosong Sun
Download URL
https://aclanthology.org/2023.acl-long.460/
abstract
AbstractFor many real-world applications, the user-generated inputs usually contain various noises due to speech recognition errors caused by linguistic variations or typographical errors (typos). Thus, it is crucial to test model performance on data with realistic input noises to ensure robustness and fairness. However, little study has been done to construct such benchmarks for Chinese, where various language-specific input noises happen in the real world. In order to fill this important gap, we construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises. READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input. We designed our annotation pipeline to maximize diversity, for example by instructing the annotators to use diverse input method editors (IMEs) for keyboard noises and recruiting speakers from diverse dialectical groups for speech noises. We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN even with robustness methods like data augmentation. As the first large-scale attempt in creating a benchmark with noises geared towards user-generated inputs, we believe that READIN serves as an important complement to existing Chinese NLP benchmarks. The source code and dataset can be obtained from https://github.com/ thunlp/READIN.
21.AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Rongjie Huang,Huadai Liu,Xize Cheng,Yi Ren,Linjun Li,Zhenhui Ye,Jinzheng He,Lichao Zhang,Jinglin Liu,Xiang Yin,Zhou Zhao
Download URL
https://aclanthology.org/2023.acl-long.479/
abstract
AbstractDirect speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines.Audio samples are available at https://AV-TranSpeech.github.io/.
22.SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks
Suwon Shon,Siddhant Arora,Chyi-Jiunn Lin,Ankita Pasad,Felix Wu,Roshan S Sharma,Wei-Lun Wu,Hung-yi Lee,Karen Livescu,Shinji Watanabe
Download URL
https://aclanthology.org/2023.acl-long.496/
abstract
AbstractSpoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will release a new benchmark suite, including for each task (i) curated annotations for a relatively small fine-tuning set, (ii) reproducible pipeline (speech recognizer + text model) and end-to-end baseline models and evaluation metrics, (iii) baseline model performance in various types of systems for easy comparisons. We present the details of data collection and annotation and the performance of the baseline models. We also analyze the sensitivity of pipeline modelsâ?? performance to the speech recognition accuracy, using more than 20 publicly availablespeech recognition models.
23.BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
Mingda Chen,Paul-Ambroise Duquenne,Pierre Andrews,Justine Kao,Alexandre Mourachko,Holger Schwenk,Marta R. Costa-jussÃ
Download URL
https://aclanthology.org/2023.acl-long.504/
abstract
AbstractEnd-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems.In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions.The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
24.NLPositionality: Characterizing Design Biases of Datasets and Models
Sebastin Santy,Jenny Liang,Ronan Le Bras,Katharina Reinecke,Maarten Sap
Download URL
https://aclanthology.org/2023.acl-long.505/
abstract
AbstractDesign biases in NLP systems, such as performance differences for different populations, often stem from their creatorâ??s positionality, i.e., views and lived experiences shaped by identity and background. Despite the prevalence and risks of design biases, they are hard to quantify because researcher, system, and dataset positionality is often unobserved. We introduce NLPositionality, a framework for characterizing design biases and quantifying the positionality of NLP datasets and models. Our framework continuously collects annotations from a diverse pool of volunteer participants on LabintheWild, and statistically quantifies alignment with dataset labels and model predictions. We apply NLPositionality to existing datasets and models for two tasksâ??social acceptability and hate speech detection. To date, we have collected 16,299 annotations in over a year for 600 instances from 1,096 annotators across 87 countries.We find that datasets and models align predominantly with Western, White, college-educated, and younger populations. Additionally, certain groups, such as non-binary people and non-native English speakers, are further marginalized by datasets and models as they rank least in alignment across all tasks. Finally, we draw from prior literature to discuss how researchers can examine their own positionality and that of their datasets and models, opening the door for more inclusive NLP systems.
25.CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training
Zhenhui Ye,Rongjie Huang,Yi Ren,Ziyue Jiang,Jinglin Liu,Jinzheng He,Xiang Yin,Zhou Zhao
Download URL
https://aclanthology.org/2023.acl-long.518/
abstract
AbstractImproving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that learns from the prosody variance of the same text token under different contexts. Specifically, 1) with the design of a text encoder and a prosody encoder, we encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space; 2) we introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. 3) we show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization ability to adapt to multiple languages and multi-speaker text-to-speech. We also deeply analyze the principle behind the performance of CLAPSpeech. Ablation studies demonstrate the necessity of each component in CLAPSpeech. Source code and audio samples are available at https://clapspeech.github.io.
26.How poor is the stimulus? Evaluating hierarchical generalization in neural networks trained on child-directed speech
Aditya Yedetore,Tal Linzen,Robert Frank,R. Thomas McCoy
Download URL
https://aclanthology.org/2023.acl-long.521/
abstract
AbstractWhen acquiring syntax, children consistently choose hierarchical rules over competing non-hierarchical possibilities. Is this preference due to a learning bias for hierarchical structure, or due to more general biases that interact with hierarchical cues in childrenâ??s linguistic input? We explore these possibilities by training LSTMs and Transformers - two types of neural networks without a hierarchical bias - on data similar in quantity and content to childrenâ??s linguistic input: text from the CHILDES corpus. We then evaluate what these models have learned about English yes/no questions, a phenomenon for which hierarchical structure is crucial. We find that, though they perform well at capturing the surface statistics of child-directed speech (as measured by perplexity), both model types generalize in a way more consistent with an incorrect linear rule than the correct hierarchical rule. These results suggest that human-like generalization from text alone requires stronger biases than the general sequence-processing biases of standard neural network architectures.
27.Simple and Effective Unsupervised Speech Translation
Changhan Wang,Hirofumi Inaguma,Peng-Jen Chen,Ilia Kulikov,Yun Tang,Wei-Ning Hsu,Michael Auli,Juan Pino
Download URL
https://aclanthology.org/2023.acl-long.602/
abstract
AbstractThe amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognition, machine translation and speech synthesis, either in a pipeline approach, or to generate pseudo-labels for training end-to-end speech translation models. Furthermore, we present an unsupervised domain adaptation technique for pre-trained speech models which improves the performance of downstream unsupervised speech recognition, especially for low-resource settings. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art by 3.2 BLEU on the Libri-Trans benchmark, on CoVoST 2, our best systems outperform the best supervised end-to-end models (without pre-training) from only two years ago by an average of 5.0 BLEU over five X-En directions. We also report competitive results on MuST-C and CVSS benchmarks.
28.MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages
Cheikh M. Bamba Dione,David Ifeoluwa Adelani,Peter Nabende,Jesujoba Alabi,Thapelo Sindane,Happy Buzaaba,Shamsuddeen Hassan Muhammad,Chris Chinenye Emezue,Perez Ogayo,Anuoluwapo Aremu,Catherine Gitau,Derguene Mbaye,Jonathan Mukiibi,Blessing Sibanda,Bonaventure F. P. Dossou,Andiswa Bukula,Rooweither Mabuya,Allahsera Auguste Tapo,Edwin Munkoh-Buabeng,Victoire Memdjokam Koagne,Fatoumata Ouoba Kabore,Amelia Taylor,Godson Kalipe,Tebogo Macucwa,Vukosi Marivate,Tajuddeen Gwadabe,Mboning Tchiaze Elvis,Ikechukwu Onyenwe,Gratien Atindogbe,Tolulope Adelani,Idris Akinade,Olanrewaju Samuel,Marien Nahimana,Théogène Musabeyezu,Emile Niyomutabazi,Ester Chimhenga,Kudzai Gotosa,Patrick Mizha,Apelete Agbolo,Seydou Traore,Chinedu Uchechukwu,Aliyu Yusuf,Muhammad Abdullahi,Dietrich Klakow
Download URL
https://aclanthology.org/2023.acl-long.609/
abstract
AbstractIn this paper, we present AfricaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the universal dependencies (UD) guidelines. We conducted extensive POS baseline experiments using both conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in the UD. Evaluating on the AfricaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with parameter-fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems to be more effective for POS tagging in unseen languages.
29.Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning
Ye Wang,Wang Lin,Shengyu Zhang,Tao Jin,Linjun Li,Xize Cheng,Zhou Zhao
Download URL
https://aclanthology.org/2023.acl-long.611/
abstract
AbstractThe task of spoken video grounding aims to localize moments in videos that are relevant to descriptive spoken queries. However, extracting semantic information from speech and modeling the cross-modal correlation pose two critical challenges. Previous studies solve them by representing spoken queries based on the matched video frames, which require tremendous effort for frame-level labeling. In this work, we investigate weakly-supervised spoken video grounding, i.e., learning to localize moments without expensive temporal annotations. To effectively represent the cross-modal semantics, we propose Semantic Interaction Learning (SIL), a novel framework consisting of the acoustic-semantic pre-training (ASP) and acoustic-visual contrastive learning (AVCL). In ASP, we pre-train an effective encoder for the grounding task with three comprehensive tasks, where the robustness task enhances stability by explicitly capturing the invariance between time- and frequency-domain features, the conciseness task avoids over-smooth attention by compressing long sequence into segments, and the semantic task improves spoken language understanding by modeling the precise semantics. In AVCL, we mine pseudo labels with discriminative sampling strategies and directly strengthen the interaction between speech and video by maximizing their mutual information. Extensive experiments demonstrate the effectiveness and superiority of our method.
30.How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases
Aaron Mueller,Tal Linzen
Download URL
https://aclanthology.org/2023.acl-long.629/
abstract
AbstractAccurate syntactic representations are essential for robust generalization in natural language. Recent work has found that pre-training can teach language models to rely on hierarchical syntactic featuresâ??as opposed to incorrect linear featuresâ??when performing tasks after fine-tuning. We test what aspects of pre-training are important for endowing encoder-decoder Transformers with an inductive bias that favors hierarchical syntactic generalizations. We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus, diagnosing inductive biases using two syntactic transformation tasks: question formation and passivization, both in English. We find that the number of parameters alone does not explain hierarchical generalization: model depth plays greater role than model width. We also find that pre-training on simpler language, such as child-directed speech, induces a hierarchical bias using an order-of-magnitude less data than pre-training on more typical datasets based on web text or Wikipedia; this suggests that in cognitively plausible language acquisition settings, neural language models may be more data-efficient than previously thought.
31.Introducing Semantics into Speech Encoders
Derek Xu,Shuyan Dong,Changhan Wang,Suyoun Kim,Zhaojiang Lin,Bing Liu,Akshat Shrivastava,Shang-Wen Li,Liang-Hsuan Tseng,Guan-Ting Lin,Alexei Baevski,Hung-yi Lee,Yizhou Sun,Wei Wang
Download URL
https://aclanthology.org/2023.acl-long.639/
abstract
AbstractRecent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding (SLU) performance by over 5% on intent classification (IC), with modest gains in named entity resolution (NER) and slot filling (SF), and spoken question answering (SQA) FF1 score by over 2%. Our approach, which uses no ASR data, achieves similar performance as methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders.
32.MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Yuchen Hu,Chen Chen,Ruizhe Li,Heqing Zou,Eng Siong Chng
Download URL
https://aclanthology.org/2023.acl-long.649/
abstract
AbstractAudio-visual speech recognition (AVSR) attracts a surge of research interest recently by leveraging multimodal signals to understand human speech. Mainstream approaches addressing this task have developed sophisticated architectures and techniques for multi-modality fusion and representation learning. However, the natural heterogeneity of different modalities causes distribution gap between their representations, making it challenging to fuse them. In this paper, we aim to learn the shared representations across modalities to bridge their gap. Different from existing similar methods on other multimodal tasks like sentiment analysis, we focus on the temporal contextual dependencies considering the sequence-to-sequence task setting of AVSR. In particular, we propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN), which captures the commonality across modalities to ease the subsequent multimodal fusion process. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach outperforms the state-of-the-arts.
33.From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models
Shangbin Feng,Chan Young Park,Yuhan Liu,Yulia Tsvetkov
Download URL
https://aclanthology.org/2023.acl-long.656/
abstract
AbstractLanguage models (LMs) are pretrained on diverse data sourcesâ??news, discussion forums, books, online encyclopedias. A significant portion of this data includes facts and opinions which, on one hand, celebrate democracy and diversity of ideas, and on the other hand are inherently socially biased. Our work develops new methods to (1) measure media biases in LMs trained on such corpora, along social and economic axes, and (2) measure the fairness of downstream NLP models trained on top of politically biased LMs. We focus on hate speech and misinformation detection, aiming to empirically quantify the effects of political (social, economic) biases in pretraining data on the fairness of high-stakes social-oriented tasks. Our findings reveal that pretrained LMs do have political leanings which reinforce the polarization present in pretraining corpora, propagating social biases into hate speech predictions and media biases into misinformation detectors. We discuss the implications of our findings for NLP research and propose future directions to mitigate unfairness.
34.SLABERT Talk Pretty One Day: Modeling Second Language Acquisition with BERT
Aditya Yadavalli,Alekhya Yadavalli,Vera Tobin
Download URL
https://aclanthology.org/2023.acl-long.657/
abstract
AbstractSecond language acquisition (SLA) research has extensively studied cross-linguistic transfer, the influence of linguistic structure of a speakerâ??s native language [L1] on the successful acquisition of a foreign language [L2]. Effects of such transfer can be positive (facilitating acquisition) or negative (impeding acquisition). We find that NLP literature has not given enough attention to the phenomenon of negative transfer. To understand patterns of both positive and negative transfer between L1 and L2, we model sequential second language acquisition in LMs. Further, we build a Mutlilingual Age Ordered CHILDES (MAO-CHILDES)â??a dataset consisting of 5 typologically diverse languages, i.e., German, French, Polish, Indonesian, and Japaneseâ??to understand the degree to which native Child-Directed Speech (CDS) [L1] can help or conflict with English language acquisition [L2]. To examine the impact of native CDS, we use the TILT-based cross lingual transfer learning approach established by Papadimitriou and Jurafsky (2020) and find that, as in human SLA, language family distance predicts more negative transfer. Additionally, we find that conversational speech data shows greater facilitation for language acquisition than scripted speech data. Our findings call for further research using our novel Transformer-based SLA models and we would like to encourage it by releasing our code, data, and models.
35.Towards Domain-Agnostic and Domain-Adaptive Dementia Detection from Spoken Language
Shahla Farzana,Natalie Parde
Download URL
https://aclanthology.org/2023.acl-long.668/
abstract
AbstractHealth-related speech datasets are often small and varied in focus. This makes it difficult to leverage them to effectively support healthcare goals. Robust transfer of linguistic features across different datasets orbiting the same goal carries potential to address this concern. To test this hypothesis, we experiment with domain adaptation (DA) techniques on heterogeneous spoken language data to evaluate generalizability across diverse datasets for a common task: dementia detection. We find that adapted models exhibit better performance across conversational and task-oriented datasets. The feature-augmented DA method achieves a 22% increase in accuracy adapting from a conversational to task-specific dataset compared to a jointly trained baseline. This suggests promising capacity of these techniques to allow for productive use of disparate data for a complex spoken language healthcare task.
36.Transforming Visual Scene Graphs to Image Captions
Xu Yang,Jiawei Peng,Zihua Wang,Haiyang Xu,Qinghao Ye,Chenliang Li,Songfang Huang,Fei Huang,Zhangzikang Li,Yu Zhang
Download URL
https://aclanthology.org/2023.acl-long.694/
abstract
AbstractWe propose to TransForm Scene Graphs into more descriptive Captions (TFSGC). In TFSGC, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a simple and homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TFSGC. The code is in: https://anonymous.4open.science/r/ACL23_TFSGC.
37.Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks
Yun Tang,Anna Sun,Hirofumi Inaguma,Xinyue Chen,Ning Dong,Xutai Ma,Paden Tomasello,Juan Pino
Download URL
https://aclanthology.org/2023.acl-long.695/
abstract
AbstractTransducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new method leverages AEDâ??s strength in non-monotonic sequence to sequence learning while retaining Transducerâ??s streaming property. In the proposed framework, Transducer and AED share the same speech encoder. The predictor in Transducer is replaced by the decoder in the AED model, and the outputs of the decoder are conditioned on the speech inputs instead of outputs from an unconditioned language model. The proposed solution ensures that the model is optimized by covering all possible read/write scenarios and creates a matched environment for streaming applications. We evaluate the proposed approach on the MuST-C dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks. In the streaming case, TAED outperforms Transducer in the ASR task and one ST direction while comparable results are achieved in another translation direction.
38.Language of Bargaining
Mourad Heddaya,Solomon Dworkin,Chenhao Tan,Rob Voigt,Alexander Zentefis
Download URL
https://aclanthology.org/2023.acl-long.735/
abstract
AbstractLeveraging an established exercise in negotiation education, we build a novel dataset for studying how the use of language shapes bilateral bargaining. Our dataset extends existing work in two ways: 1) we recruit participants via behavioral labs instead of crowdsourcing platforms and allow participants to negotiate through audio, enabling more naturalistic interactions; 2) we add a control setting where participants negotiate only through alternating, written numeric offers. Despite the two contrasting forms of communication, we find that the average agreed prices of the two treatments are identical. But when subjects can talk, fewer offers are exchanged, negotiations finish faster, the likelihood of reaching agreement rises, and the variance of prices at which subjects agree drops substantially. We further propose a taxonomy of speech acts in negotiation and enrich the dataset with annotated speech acts. We set up prediction tasks to predict negotiation success and find that being reactive to the arguments of the other party is advantageous over driving the negotiation.
39.CTC-based Non-autoregressive Speech Translation
Chen Xu,Xiaoqian Liu,Xiaowen Liu,Qingxuan Sun,Yuhao Zhang,Murun Yang,Qianqian Dong,Tom Ko,Mingxuan Wang,Tong Xiao,Anxiang Ma,Jingbo Zhu
Download URL
https://aclanthology.org/2023.acl-long.744/
abstract
AbstractCombining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency.In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST).In particular, we develop a model consisting of two encoders that are guided by CTC to predict the source and target texts, respectively.Introducing CTC into NAST on both language sides has obvious challenges: 1) the conditional independent generation somewhat breaks the interdependency among tokens, and 2) the monotonic alignment assumption in standard CTC does not hold in translation tasks.In response, we develop a prediction-aware encoding approach and a cross-layer attention approach to address these issues.We also use curriculum learning to improve convergence of training.Experiments on the MuST-C ST benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67Ã?, which is comparable to the autoregressive counterpart and even outperforms the previous best result of 0.9 BLEU points.
40.Attention as a Guide for Simultaneous Speech Translation
Sara Papi,Matteo Negri,Marco Turchi
Download URL
https://aclanthology.org/2023.acl-long.745/
abstract
AbstractIn simultaneous speech translation (SimulST), effective policies that determine when to write partial translations are crucial to reach high output quality with low latency. Towards this objective, we propose EDAtt (Encoder-Decoder Attention), an adaptive policy that exploits the attention patterns between audio source and target textual translation to guide an offline-trained ST model during simultaneous inference. EDAtt exploits the attention scores modeling the audio-translation relation to decide whether to emit a partial hypothesis or wait for more audio input. This is done under the assumption that, if attention is focused towards the most recently received speech segments, the information they provide can be insufficient to generate the hypothesis (indicating that the system has to wait for additional audio input). Results on en->de, es show that EDAtt yields better results compared to the SimulST state of the art, with gains respectively up to 7 and 4 BLEU points for the two languages, and with a reduction in computational-aware latency up to 1.4s and 0.7s compared to existing SimulST policies applied to offline-trained models.
41.MeetingQA: Extractive Question-Answering on Meeting Transcripts
Archiki Prasad,Trung Bui,Seunghyun Yoon,Hanieh Deilamsalehy,Franck Dernoncourt,Mohit Bansal
Download URL
https://aclanthology.org/2023.acl-long.837/
abstract
AbstractWith the ubiquitous use of online meeting platforms and robust automatic speech recognition systems, meeting transcripts have emerged as a promising domain for natural language tasks. Most recent works on meeting transcripts primarily focus on summarization and extraction of action items. However, meeting discussions also have a useful question-answering (QA) component, crucial to understanding the discourse or meeting content, and can be used to build interactive interfaces on top of long transcripts. Hence, in this work, we leverage this inherent QA component of meeting discussions and introduce MeetingQA, an extractive QA dataset comprising of questions asked by meeting participants and corresponding responses. As a result, questions can be open-ended and actively seek discussions, while the answers can be multi-span and distributed across multiple speakers. Our comprehensive empirical study of several robust baselines including long-context language models and recent instruction-tuned models reveals that models perform poorly on this task (F1 = 57.3) and severely lag behind human performance (F1 = 84.6), thus presenting a challenging new task for the community to improve upon.
42.From Dogwhistles to Bullhorns: Unveiling Coded Rhetoric with Language Models
Julia Mendelsohn,Ronan Le Bras,Yejin Choi,Maarten Sap
Download URL
https://aclanthology.org/2023.acl-long.845/
abstract
AbstractDogwhistles are coded expressions that simultaneously convey one meaning to a broad audience and a second, often hateful or provocative, meaning to a narrow in-group; they are deployed to evade both political repercussions and algorithmic content moderation. For example, the word â??cosmopolitanâ?? in a sentence such as â??we need to end the cosmopolitan experimentâ?? can mean â??worldlyâ?? to many but also secretly mean â??Jewishâ?? to a select few. We present the first large-scale computational investigation of dogwhistles. We develop a typology of dogwhistles, curate the largest-to-date glossary of over 300 dogwhistles with rich contextual information and examples, and analyze their usage in historical U.S. politiciansâ?? speeches. We then assess whether a large language model (GPT-3) can identify dogwhistles and their meanings, and find that GPT-3â??s performance varies widely across types of dogwhistles and targeted groups. Finally, we show that harmful content containing dogwhistles avoids toxicity detection, highlighting online risks presented by such coded language. This work sheds light on the theoretical and applied importance of dogwhistles in both NLP and computational social science, and provides resources to facilitate future research in modeling dogwhistles and mitigating their online harms.
43.Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition
Yuchen Hu,Ruizhe Li,Chen Chen,Chengwei Qin,Qiu-Shi Zhu,Eng Siong Chng
Download URL
https://aclanthology.org/2023.acl-long.848/
abstract
AbstractAudio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually faced with two practical challenges: 1) lack of sufficient labeled noisy audio-visual training data in some real-world scenarios and 2) less optimal model generality to unseen testing noises. In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation. Inspired by human perception mechanism, we propose a universal viseme-phoneme mapping (UniVPM) approach to implement modality transfer, which can restore clean audio from visual signals to enable speech recognition under any noisy conditions. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach achieves the state-of-the-art under various noisy as well as clean conditions. In addition, we also outperform previous state-of-the-arts on visual speech recognition task.
44.Toward Interactive Dictation
Belinda Z. Li,Jason Eisner,Adam Pauls,Sam Thomson
Download URL
https://aclanthology.org/2023.acl-long.854/
abstract
AbstractVoice dictation is an increasingly important text input modality. Existing systems that allow both dictation and editing-by-voice restrict their command language to flat templates invoked by trigger words. In this work, we study the feasibility of allowing users to interrupt their dictation with spoken editing commands in open-ended natural language. We introduce a new task and dataset, TERTiUS, to experiment with such systems. To support this flexibility in real-time, a system must incrementally segment and classify spans of speech as either dictation or command, and interpret the spans that are commands. We experiment with using large pre-trained language models to predict the edited text, or alternatively, to predict a small text-editing program. Experiments show a natural trade-off between model accuracy and latency: a smaller model achieves 30% end-state accuracy with 1.3 seconds of latency, while a larger model achieves 55% end-state accuracy with 7 seconds of latency.
45.UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
Hirofumi Inaguma,Sravya Popuri,Ilia Kulikov,Peng-Jen Chen,Changhan Wang,Yu-An Chung,Yun Tang,Ann Lee,Shinji Watanabe,Juan Pino
Download URL
https://aclanthology.org/2023.acl-long.872/
abstract
AbstractDirect speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
46.Understanding and Bridging the Modality Gap for Speech Translation
Qingkai Fang,Yang Feng
Download URL
https://aclanthology.org/2023.acl-long.884/
abstract
AbstractHow to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, we introduce token-level adaptive training which assigns different training weights to target tokens to handle difficult cases with large modality gaps. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves significant improvements over a strong baseline in all eight directions of the MuST-C dataset.
47.SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
Paul-Ambroise Duquenne,Hongyu Gong,Ning Dong,Jingfei Du,Ann Lee,Vedanuj Goswami,Changhan Wang,Juan Pino,Benoît Sagot,Holger Schwenk
Download URL
https://aclanthology.org/2023.acl-long.899/
abstract
AbstractWe present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models will be publicly released
48.A Weakly Supervised Classifier and Dataset of White Supremacist Language
Michael Yoder,Ahmad Diab,David Brown,Kathleen Carley
Download URL
https://aclanthology.org/2023.acl-short.17/
abstract
AbstractWe present a dataset and classifier for detecting the language of white supremacist extremism, a growing issue in online hate speech. Our weakly supervised classifier is trained on large datasets of text from explicitly white supremacist domains paired with neutral and anti-racist data from similar domains. We demonstrate that this approach improves generalization performance to new domains. Incorporating anti-racist texts as counterexamples to white supremacist language mitigates bias.
49.An (unhelpful) guide to selecting the best ASR architecture for your under-resourced language
Robert Jimerson,Zoey Liu,Emily Prudâ??hommeaux
Download URL
https://aclanthology.org/2023.acl-short.87/
abstract
AbstractAdvances in deep neural models for automatic speech recognition (ASR) have yielded dramatic improvements in ASR quality for resource-rich languages, with English ASR now achieving word error rates comparable to that of human transcribers. The vast majority of the worldâ??s languages, however, lack the quantity of data necessary to approach this level of accuracy. In this paper we use four of the most popular ASR toolkits to train ASR models for eleven languages with limited ASR training resources: eleven widely spoken languages of Africa, Asia, and South America, one endangered language of Central America, and three critically endangered languages of North America. We find that no single architecture consistently outperforms any other. These differences in performance so far do not appear to be related to any particular feature of the datasets or characteristics of the languages. These findings have important implications for future research in ASR for under-resourced languages. ASR systems for languages with abundant existing media and available speakers may derive the most benefit simply by collecting large amounts of additional acoustic and textual training data. Communities using ASR to support endangered language documentation efforts, who cannot easily collect more data, might instead focus on exploring multiple architectures and hyperparameterizations to optimize performance within the constraints of their available data and resources.
50.Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation
Yuchen Han,Chen Xu,Tong Xiao,Jingbo Zhu
Download URL
https://aclanthology.org/2023.acl-short.115/
abstract
AbstractPre-training and fine-tuning is a paradigm for alleviating the data scarcity problem in end-to-end speech translation (E2E ST). The commonplace â??modality gapâ?? between speech and text data often leads to inconsistent inputs between pre-training and fine-tuning. However, we observe that this gap occurs in the early stages of fine-tuning, but does not have a major impact on the final performance. On the other hand, we find that there has another gap, which we call the â??capacity gapâ??: high resource tasks (such as ASR and MT) always require a large model to fit, when the model is reused for a low resource task (E2E ST), it will get a sub-optimal performance due to the over-fitting. In a case study, we find that the regularization plays a more important role than the well-designed modality adaption method, which achieves 29.0 for en-de and 40.3 for en-fr on the MuST-C dataset.
51.MOSPC: MOS Prediction Based on Pairwise Comparison
Kexin Wang,Yunlong Zhao,Qianqian Dong,Tom Ko,Mingxuan Wang
Download URL
https://aclanthology.org/2023.acl-short.132/
abstract
AbstractAs a subjective metric to evaluate the quality of synthesized speech, Mean opinion score(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech when the MOS scores are close. However, in practical applications, it is more important to correctly rank the quality of synthesis systems or sentences than simply predicting MOS scores. Meanwhile, as each annotator scores multiple audios during annotation, the score is probably a relative value based on the first or the first few speech scores given by the annotator. Motivated by the above two points, we propose a general framework for MOS prediction based on pair comparison (MOSPC), and we utilize C-Mixup algorithm to enhance the generalization performance of MOSPC.The experiments on BVCC and VCC2018 show that our framework outperforms the baselines on most of the correlation coefficient metrics, especially on the metric KTAU related to quality ranking. And our framework also surpasses the strong baseline in ranking accuracy on each fine-grained segment. These results indicate that our framework contributes to improving the ranking accuracy of speech quality.
52.When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants
Anuj Diwan,Eunsol Choi,David Harwath
Download URL
https://aclanthology.org/2023.acl-short.141/
abstract
AbstractWe present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision. We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models, using a variety of efficiency metrics (latency, throughput, and memory). To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model. We observe that these thresholds are (a) much higher than typical dataset sequence lengths and (b) dependent on the metric and modality, showing that choosing the right model depends on modality, task type (long-form vs. typical context) and resource constraints (time vs. memory). By visualising the breakdown of the computational costs for transformer components, we also show that non-self-attention components exhibit significant computational costs. We release our profiling toolkit at https://github.com/ajd12342/profiling-transformers .
53.STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions
Michel Plüss,Jan Deriu,Yanick Schraner,Claudio Paonessa,Julia Hartmann,Larissa Schmidt,Christian Scheller,Manuela Hürlimann,Tanja SamardžiÄ?,Manfred Vogel,Mark Cieliebak
Download URL
https://aclanthology.org/2023.acl-short.150/
abstract
AbstractWe present STT4SG-350, a corpus of Swiss German speech, annotated with Standard German text at the sentence level. The data is collected using a web app in which the speakers are shown Standard German sentences, which they translate to Swiss German and record. We make the corpus publicly available. It contains 343 hours of speech from all dialect regions and is the largest public speech corpus for Swiss German to date. Application areas include automatic speech recognition (ASR), text-to-speech, dialect identification, and speaker recognition. Dialect information, age group, and gender of the 316 speakers are provided. Genders are equally represented and the corpus includes speakers of all ages. Roughly the same amount of speech is provided per dialect region, which makes the corpus ideally suited for experiments with speech technology for different dialects. We provide training, validation, and test splits of the data. The test set consists of the same spoken sentences for each dialect region and allows a fair evaluation of the quality of speech technologies in different dialects. We train an ASR model on the training set and achieve an average BLEU score of 74.7 on the test set. The model beats the best published BLEU scores on 2 other Swiss German ASR test sets, demonstrating the quality of the corpus.
54.A Simple Concatenation can Effectively Improve Speech Translation
Linlin Zhang,Kai Fan,Boxing Chen,Luo Si
Download URL
https://aclanthology.org/2023.acl-short.153/
abstract
AbstractA triple speech translation data comprises speech, transcription, and translation.In the end-to-end paradigm, text machine translation (MT) usually plays the role of a teacher model for the speech translation (ST) via knowledge distillation. Parameter sharing with the teacher is often adopted to construct the ST model architecture, however, the two modalities are independently fed and trained via different losses. This situation does not match STâ??s properties across two modalities and also limits the upper bound of the performance. Inspired by the works of video Transformer, we propose a simple unified cross-modal ST method, which concatenates speech and text as the input, and builds a teacher that can utilize both cross-modal information simultaneously. Experimental results show that in our unified ST framework, models can effectively utilize the auxiliary information from speech and text, and achieve compelling results on MuST-C datasets.