X

Awesome-LLM-Interpretability

Information

# Awesome-LLM-Interpretability A curated list of LLM Interpretability related material. ## ToC - [Tutorial](#tutorial) - [History](#history) - [Code](#code) - [Library](#library) - [Codebase](#codebase) - [Survey](#survey) - [Video](#video) - [Paper & Blog](#paper--blog) - [By Source](#by-source) - [By Topic](#by-topic) - [Tools/Techniques/Methods](#toolstechniquesmethods) - [General](#general) - [Embedding Projection](#embedding-projection) - [Probing](#probing) - [Causal Intervention](#causal-intervention) - [Automation](#automation) - [Sparse Coding](#sparse-coding) - [Visualization](#visualization) - [Translation](#translation) - [Evaluation/Dataset/Benchmark](#evaluationdatasetbenchmark) - [Task Solving/Function/Ability](#task-solvingfunctionability) - [General](#general-1) - [Reasoning](#reasoning) - [Function](#function) - [Arithmetic Ability](#arithmetic-ability) - [In-context Learning](#in-context-learning) - [Factual Knowledge](#factual-knowledge) - [Multilingual/Crosslingual](#multilingualcrosslingual) - [Multimodal](#multimodal) - [Component](#component) - [General](#general-2) - [Attention](#attention) - [MLP/FFN](#mlpffn) - [Neuron](#neuron) - [Learning Dynamics](#learning-dynamics) - [General](#general-3) - [Phase Transition/Grokking](#phase-transitiongrokking) - [Fine-tuning](#fine-tuning) - [Feature Representation/Probing-based](#feature-representationprobing-based) - [General](#general-4) - [Linearity](#linearity) - [Application](#application) - [Inference-Time Intervention/Activation Steering](#inference-time-interventionactivation-steering) - [Knowledge/Model Editing](#knowledgemodel-editing) - [Hallucination](#hallucination) - [Pruning/Redundancy Analysis](#pruningredundancy-analysis) ## Tutorial * **Concrete Steps to Get Started in Transformer Mechanistic Interpretability** [[Neel Nanda's blog]](https://www.neelnanda.io/mechanistic-interpretability/getting-started) * **Mechanistic Interpretability Quickstart Guide** [[Neel Nanda's blog]](https://www.neelnanda.io/mechanistic-interpretability/getting-started) * **ARENA Mechanistic Interpretability Tutorials by Callum McDougall** [[website]](https://arena-ch1-transformers.streamlit.app/) * **200 Concrete Open Problems in Mechanistic Interpretability: Introduction by Neel Nanda** [[AlignmentForum]](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj) * **Transformer-specific Interpretability** [[EACL 2023 Tutorial]](https://projects.illc.uva.nl/indeep/tutorial/) ## History * **Mechanistic?** [[BlackBoxNLP workshop at EMNLP 2024]](https://arxiv.org/abs/2410.09087) * This paper explores the multiple definitions and uses of "mechanistic interpretability," tracing its evolution in NLP research and revealing a critical divide within the interpretability community. ## Code ### Library * **TransformerLens** [[github]](https://github.com/neelnanda-io/TransformerLens) * A library for mechanistic interpretability of GPT-style language models * **CircuitsVis** [[github]](https://github.com/alan-cooney/CircuitsVis) * Mechanistic Interpretability visualizations * **baukit** [[github]](https://github.com/davidbau/baukit) * Contains some methods for tracing and editing internal activations in a network. * **transformer-debugger** [[github]](https://github.com/openai/transformer-debugger) * Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders. * **pyvene** [[github]](https://github.com/stanfordnlp/pyvene) * Supports customizable interventions on a range of different PyTorch modules * Supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. * **ViT-Prisma** [[github]](https://github.com/soniajoseph/ViT-Prisma) * An open-source mechanistic interpretability library for vision and multimodal models. * **pyreft** [[github]](https://github.com/stanfordnlp/pyreft) * A Powerful, Parameter-Efficient, and Interpretable way of fine-tuning * **SAELens** [[github]](https://github.com/jbloomAus/SAELens) * Training and analyzing sparse autoencoders on Language Models ### Codebase * **mamba interpretability** [[github]](https://github.com/Phylliida/mamba_interp) ## Survey * **A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models** [[arxiv2503]](https://arxiv.org/abs/2503.05613) * **Representation Engineering for Large-Language Models: Survey and Research Challenges** [[arxiv2502]](http://arxiv.org/abs/2502.17601) * **Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks** [[SaTML 2023]](https://ieeexplore.ieee.org/abstract/document/10136140) [[arxiv 2207]](https://arxiv.org/abs/2207.13243) * **Neuron-level Interpretation of Deep NLP Models: A Survey** [[TACL 2022]](https://aclanthology.org/2022.tacl-1.74) * **Explainability for Large Language Models: A Survey** [[TIST 2024]](https://dl.acm.org/doi/10.1145/3639372) [[arxiv 2309]](https://arxiv.org/abs/2309.01029) * **Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability** [[arxiv 2402]](http://arxiv.org/abs/2402.10688) * **Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era** [[arxiv 2403]](http://arxiv.org/abs/2403.08946) * **Mechanistic Interpretability for AI Safety -- A Review** [[arxiv 2404]](http://arxiv.org/abs/2404.14082) * **A Primer on the Inner Workings of Transformer-based Language Models** [[arxiv 2405]](https://arxiv.org/abs/2405.00208) * **A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models** [[arxiv 2407]](http://arxiv.org/abs/2407.02646) * **Internal Consistency and Self-Feedback in Large Language Models: A Survey** [[arxiv 2407]](https://arxiv.org/abs/2407.14507) * **The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability** [[arxiv 2408]](https://arxiv.org/abs/2408.01416) * **Attention Heads of Large Language Models: A Survey** [[arxiv 2409]](https://arxiv.org/abs/2409.03752) [[github]](https://github.com/IAAR-Shanghai/Awesome-Attention-Heads) *Note: These Alignment surveys discuss the relation between Interpretability and LLM Alignment.* * **Large Language Model Alignment: A Survey** [[arxiv 2309]](https://arxiv.org/abs//2309.15025) * **AI Alignment: A Comprehensive Survey** [[arxiv 2310]](https://arxiv.org/abs/2310.19852) [[github]](https://github.com/PKU-Alignment/AlignmentSurvey) [[website]](https://alignmentsurvey.com/) ## Video * **Neel Nanda's Channel** [[Youtube]](https://www.youtube.com/@neelnanda2469) * **Chris Olah - Looking Inside Neural Networks with Mechanistic Interpretability** [[Youtube]](https://www.youtube.com/watch?v=2Rdp9GvcYOE) * **Concrete Open Problems in Mechanistic Interpretability: Neel Nanda at SERI MATS** [[Youtube]](https://www.youtube.com/watch?v=FnNTbqSG8w4) * **BlackboxNLP's Channel** [[Youtube]](https://www.youtube.com/@blackboxnlp) ## Paper & Blog ### By Source * **ICML 2024 Workshop on Mechanistic Interpretability** [[openreview]](https://openreview.net/group?id=ICML.cc/2024/Workshop/MI#tab-accept-oral) * **Transformer Circuits Thread** [[blog]](https://transformer-circuits.pub/) * **BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP** [[workshop]](https://aclanthology.org/venues/blackboxnlp/) * **AI Alignment Forum** [[forum]](https://www.alignmentforum.org/) * **Lesswrong** [[forum]](https://www.lesswrong.com/) * **Neel Nanda** [[blog]](https://www.neelnanda.io/) [[google scholar]](https://scholar.google.com/citations?user=GLnX3MkAAAAJ) * **Mor Geva** [[google scholar]](https://scholar.google.com/citations?user=GxpQbSkAAAAJ) * **David Bau** [[google scholar]](https://scholar.google.com/citations?hl=en&user=CYI6cKgAAAAJ) * **Jacob Steinhardt** [[google scholar]](https://scholar.google.com/citations?hl=en&user=LKv32bgAAAAJ) * **Yonatan Belinkov** [[google scholar]](https://scholar.google.com/citations?user=K-6ujU4AAAAJ) ### By Topic [[Interactive UI]](https://cooperleong00.github.io/llminterp/) ![https://cooperleong00.github.io/llminterp/](screenshot.png) #### Tools/Techniques/Methods ##### General * **A mathematical framework for transformer circuits** [[blog]](https://transformer-circuits.pub/2021/framework/index.html) * **Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models** [[arxiv]](http://arxiv.org/abs/2401.06102) ##### Embedding Projection * **interpreting GPT: the logit lens** [[Lesswrong 2020]](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens) * **Analyzing Transformers in Embedding Space** [[ACL 2023]](https://aclanthology.org/2023.acl-long.893) * **Eliciting Latent Predictions from Transformers with the Tuned Lens** [[arxiv 2303]](https://arxiv.org/abs/2303.08112) * **An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l** [arxiv 2310](http://arxiv.org/abs/2310.07325) * **Future Lens: Anticipating Subsequent Tokens from a Single Hidden State** [[CoNLL 2023]](https://aclanthology.org/2023.conll-1.37/) * **SelfIE: Self-Interpretation of Large Language Model Embeddings** [[arxiv 2403]](https://arxiv.org/abs/2403.10949) * **InversionView: A General-Purpose Method for Reading Information from Neural Activations** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=P7MW0FahEq) ##### Probing * **Enhancing Neural Network Transparency through Representation Analysis** [[arxiv 2310]](https://arxiv.org/abs/2310.01405) [[openreview]](https://openreview.net/forum?id=aCgybhcZFi) ##### Causal Intervention * **Analyzing And Editing Inner Mechanisms of Backdoored Language Models** [[arxiv 2303]](http://arxiv.org/abs/2302.12461) * **Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations** [[arxiv 2303]](https://arxiv.org/abs/2303.02536) * **Localizing Model Behavior with Path Patching** [[arxiv 2304]](https://arxiv.org/abs/2304.05969) * **Interpretability at Scale: Identifying Causal Mechanisms in Alpaca** [[NIPS 2023]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/f6a8b109d4d4fd64c75e94aaf85d9697-Abstract-Conference.html) * **Towards Best Practices of Activation Patching in Language Models: Metrics and Methods** [[ICLR 2024]](https://openreview.net/forum?id=Hf17y6u9BC) * **Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching** [[ICLR 2024]](https://openreview.net/forum?id=Ebt7JgMHv1) * **A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments** [[arxiv 2401]](https://arxiv.org/abs/2401.12631) * **CausalGym: Benchmarking causal interpretability methods on linguistic tasks** [[arxiv 2402]](http://arxiv.org/abs/2402.12560) * **How to use and interpret activation patching** [[arxiv 2404]](http://arxiv.org/abs/2404.15255) ##### Automation * **Towards Automated Circuit Discovery for Mechanistic Interpretability** [[NIPS 2023]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Abstract-Conference.html) * **Neuron to Graph: Interpreting Language Model Neurons at Scale** [[arxiv 2305]](https://arxiv.org/abs/2305.19911) [[openreview]](https://openreview.net/forum?id=JBLHIR8kBZ) * **Discovering Variable Binding Circuitry with Desiderata** [[arxiv 2307]](http://arxiv.org/abs/2307.03637) * **Discovering Knowledge-Critical Subnetworks in Pretrained Language Models** [[openreview]](https://openreview.net/forum?id=Mkdwvl3Y8L) * **Attribution Patching Outperforms Automated Circuit Discovery** [[arxiv 2310]](https://arxiv.org/abs/2310.10348) * **AtP\*: An efficient and scalable method for localizing LLM behaviour to components** [[arxiv 2403]](https://arxiv.org/abs/2403.00745) * **Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms** [[arxiv 2403]](http://arxiv.org/abs/2403.17806) * **Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models** [[arxiv 2403]](https://arxiv.org/abs/2403.19647) * **Automatically Identifying Local and Global Circuits with Linear Computation Graphs** [[arxiv 2405]](https://arxiv.org/abs/2405.13868) * **Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models** [[arxiv 2405]](http://arxiv.org/abs/2405.12522) * **Hypothesis Testing the Circuit Hypothesis in LLMs** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=ibSNv9cldu) ##### Sparse Coding * **Towards monosemanticity: Decomposing language models with dictionary learning** [[Transformer Circuits Thread]](https://transformer-circuits.pub/2023/monosemantic-features) * **Sparse Autoencoders Find Highly Interpretable Features in Language Models** [[ICLR 2024]](https://openreview.net/forum?id=F76bwRSLeK) * **Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small** [[Alignment Forum]](https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream) * **Attention SAEs Scale to GPT-2 Small** [[Alignment Forum]](https://www.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr/attention-saes-scale-to-gpt-2-small) * **We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To** [[Alignment Forum]](https://www.alignmentforum.org/posts/xmegeW5mqiBsvoaim/we-inspected-every-head-in-gpt-2-small-using-saes-so-you-don) * **Understanding SAE Features with the Logit Lens** [[Alignment Forum]](https://www.alignmentforum.org/posts/qykrYY6rXXM7EEs8Q/understanding-sae-features-with-the-logit-lens) * **Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet** [[Transformer Circuits Thread]](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) * **Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models** [[arxiv 2405]](http://arxiv.org/abs/2405.12522) * **Scaling and evaluating sparse autoencoders** [[arxiv 2406]](https://arxiv.org/abs/2406.04093) [[code]](https://github.com/openai/sparse_autoencoder/) * **Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=qzsDKwGJyB) * **Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=JdrVuEQih5) * **Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=bcV7rhBEcM) * **Transcoders find interpretable LLM feature circuits** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=GWqzUR2dOX) * **Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders** [[arxiv 2407]](http://arxiv.org/abs/2407.14435) * **Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models** [[arxiv 2410]](https://arxiv.org/abs/2410.01280) * **Mechanistic Permutability: Match Features Across Layers** [[arxiv 2410]](https://arxiv.org/abs/2410.07656) * **Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models** [[arxiv 2410]](https://arxiv.org/abs/2410.06981) * **Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs** [[arxiv 2410]](https://arxiv.org/abs/2410.12555) ##### Visualization * **Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of GPT** [[arxiv 2305]]([10.48550/arXiv.2305.13417](http://arxiv.org/abs/2305.13417)) [[github]](https://github.com/shacharKZ/Visualizing-the-Information-Flow-of-GPT) * **Sparse AutoEncoder Visulization** [[github]](https://github.com/callummcdougall/sae_vis) * **SAE-VIS: Announcement Post** [[lesswrong]](https://www.lesswrong.com/posts/nAhy6ZquNY7AD3RkD/sae-vis-announcement-post-1) * **LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models** [[arxiv 2404]](http://arxiv.org/abs/2404.07004) [[github]](https://github.com/facebookresearch/ llm-transparency-tool) ##### Translation * **Tracr: Compiled Transformers as a Laboratory for Interpretability** [[arxiv 2301]](http://arxiv.org/abs/2301.05062) * **Opening the AI black box: program synthesis via mechanistic interpretability** [[arxiv 2402]](http://arxiv.org/abs/2402.05110) * **An introduction to graphical tensor notation for mechanistic interpretability** [[arxiv 2402]](http://arxiv.org/abs/2402.01790) ##### Evaluation/Dataset/Benchmark * **Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models** [[arxiv 2312]](http://arxiv.org/abs/2312.10091) * **RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations** [[arxiv 2402]](https://arxiv.org/abs/2402.17700) * **Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control** [[arxiv 2405]](http://arxiv.org/abs/2405.08366) * **InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques** [[arxiv 2407]](http://arxiv.org/abs/2407.14494) #### Task Solving/Function/Ability ##### General * **Circuit Component Reuse Across Tasks in Transformer Language Models** [[ICLR 2024 spotlight]](https://openreview.net/forum?id=fpoAYV6Wsk) * **Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures** [[arxvi 2410]](https://arxiv.org/abs/2410.06672) * **From Tokens to Words: On the Inner Lexicon of LLMs** [[arxiv 2410]](https://arxiv.org/abs/2410.05864) ##### Reasoning * **Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.299) * **How Large Language Models Implement Chain-of-Thought?** [[openreview]](https://openreview.net/forum?id=b2XfOm3RJa) * **Do Large Language Models Latently Perform Multi-Hop Reasoning?** [[arxiv 2402]](http://arxiv.org/abs/2402.16837) * **How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning** [[arxiv 2402]](https://arxiv.org/abs/2402.18312) * **Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning** [[arxiv 2402]](https://arxiv.org/abs/2402.18344) * **Iteration Head: A Mechanistic Study of Chain-of-Thought** [[arxiv 2406]](https://arxiv.org/abs/2406.02128) * **From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency** [[arxiv 2410]](https://arxiv.org/abs/2410.05459) ##### Function * **Interpretability in the wild: a circuit for indirect object identification in GPT-2 small** [[ICLR 2023]](https://openreview.net/forum?id=NpsVSN6o4ul) * **Entity Tracking in Language Models** [[ACL 2023]](https://aclanthology.org/2023.acl-long.213) * **How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model** [[NIPS 2023]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/efbba7719cc5172d175240f24be11280-Abstract-Conference.html) * **Can Transformers Learn to Solve Problems Recursively?** [[arxiv 2305]](http://arxiv.org/abs/2305.14699) * **Analyzing And Editing Inner Mechanisms of Backdoored Language Models** [[NeurIPS 2023 Workshop]](https://openreview.net/forum?id=e9F4fB23o0) * **Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla** [[arxiv 2307]](http://arxiv.org/abs/2307.09458) * **Refusal mechanisms: initial experiments with Llama-2-7b-chat** [[AlignmentForum 2312]](https://www.alignmentforum.org/posts/pYcEhoAoPfHhgJ8YC/refusal-mechanisms-initial-experiments-with-llama-2-7b-chat) * **Forbidden Facts: An Investigation of Competing Objectives in Llama-2** [[arxiv 2312]](http://arxiv.org/abs/2312.08793) * **How do Language Models Bind Entities in Context?** [[ICLR 2024]](https://openreview.net/forum?id=zb3b6oKO77) * **How Language Models Learn Context-Free Grammars?** [[openreview]](https://openreview.net/forum?id=qnbLGV9oFL) * **A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity** [[arxiv 2401]](http://arxiv.org/abs/2401.01967) * **Do Llamas Work in English? On the Latent Language of Multilingual Transformers** [[arxiv 2402]](http://arxiv.org/abs/2402.10588) * **Evidence of Learned Look-Ahead in a Chess-Playing Neural Network** [[arxiv2406]](https://arxiv.org/abs/2406.00877) * **How much do contextualized representations encode long-range context?** [[arxiv 2410]](https://arxiv.org/abs/2410.12292) ##### Arithmetic Ability * **Progress measures for grokking via mechanistic interpretability** [[ICLR 2023]](https://openreview.net/forum?id=9XFSbDPmdW) * **The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks** [[NIPS 2023]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/56cbfbf49937a0873d451343ddc8c57d-Abstract-Conference.html) * **Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition** [[openreview]](https://openreview.net/forum?id=VpCqrMMGVm) * **Arithmetic with Language Models: from Memorization to Computation** [[openreview]](https://openreview.net/forum?id=YxzEPTH4Ny) * **Carrying over Algorithm in Transformers** [[openreview]](https://openreview.net/forum?id=t3gOYtv1xV) * **A simple and interpretable model of grokking modular arithmetic tasks** [[openreview]](https://openreview.net/forum?id=0ZUKLCxwBo) * **Understanding Addition in Transformers** [[ICLR 2024]](https://openreview.net/forum?id=rIx1YXVWZb) * **Increasing Trust in Language Models through the Reuse of Verified Circuits** [[arxiv 2402]](http://arxiv.org/abs/2402.02619) * **Pre-trained Large Language Models Use Fourier Features to Compute Addition** [[arxiv 2406]](https://arxiv.org/abs/2406.03445) ##### In-context Learning * **In-context learning and induction heads** [[Transformer Circuits Thread]](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) * **In-Context Learning Creates Task Vectors** [[EMNLP 2023 Findings]](https://aclanthology.org/2023.findings-emnlp.624) * **Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.609) * EMNLP 2023 best paper * **LLMs Represent Contextual Tasks as Compact Function Vectors** [[ICLR 2024]](https://openreview.net/forum?id=AwyxtyMwaG) * **Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions** [[ICLR 2024]](https://openreview.net/forum?id=ekeyCgeRfC) * **Where Does In-context Machine Translation Happen in Large Language Models?** [[openreview]](https://openreview.net/forum?id=3i7iNGxw6r) * **In-Context Learning in Large Language Models: A Neuroscience-inspired Analysis of Representations** [[openreview]](https://openreview.net/forum?id=UEdS2lIgfY) * **Analyzing Task-Encoding Tokens in Large Language Models** [[arxiv 2401]](http://arxiv.org/abs/2401.11323) * **How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning** [[arxiv 2402]](http://arxiv.org/abs/2402.02872) * **Parallel Structures in Pre-training Data Yield In-Context Learning** [[arxiv 2402]](http://arxiv.org/abs/2402.12530) * **What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation** [[arxiv 2404]](http://arxiv.org/abs/2404.07129) * **Task Diversity Shortens the ICL Plateau** [[arxiv 2410]](https://arxiv.org/abs/2410.05448) * **Inference and Verbalization Functions During In-Context Learning** [[arxiv 2410]](https://arxiv.org/abs/2410.09349) ##### Factual Knowledge * **Dissecting Recall of Factual Associations in Auto-Regressive Language Models** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.751) * **Characterizing Mechanisms for Factual Recall in Language Models** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.615/) * **Summing Up the Facts: Additive Mechanisms behind Factual Recall in LLMs** [[openreview]](https://openreview.net/forum?id=P2gnDEHGu3) * **A Mechanism for Solving Relational Tasks in Transformer Language Models** [[openreview]](https://openreview.net/forum?id=ZmzLrl8nTa) * **Overthinking the Truth: Understanding how Language Models Process False Demonstrations** [[ICLR 2024 spotlight]](https://openreview.net/forum?id=Tigr1kMDZy) * **Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level** [[AlignmentForum 2312]](https://www.alignmentforum.org/s/hpWHhjvjn67LJ4xXX/p/iGuwZTHWb6DFY3sKB) * **Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models** [[arxiv 2402]](https://arxiv.org/abs/2402.18154) * **Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals** [[arxiv 2402]](http://arxiv.org/abs/2402.11655) * **A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia** [[arxiv 2403]](http://arxiv.org/abs/2312.02073) * **Mechanisms of non-factual hallucinations in language models** [[arxiv 2403]](https://arxiv.org/abs/2403.18167) * **Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models** [[arxiv 2403]](arXiv:2403.19521) * **Locating and Editing Factual Associations in Mamba** [[arxiv 2404]](arXiv:2404.03646) * **Probing Language Models on Their Knowledge Source** [[arxiv 2410]](https://arxiv.org/abs/2410.05817\} ##### Multilingual/Crosslingual * **Do Llamas Work in English? On the Latent Language of Multilingual Transformers** [[arxiv 2402]](http://arxiv.org/abs/2402.10588) * **Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models** [[arxiv 2402]](http://arxiv.org/abs/2402.16438) * **How do Large Language Models Handle Multilingualism?** [[arxiv 2402]](https://arxiv.org/abs/2402.18815) * **Large Language Models are Parallel Multilingual Learners** [[arxiv 2403]](https://arxiv.org/abs/2403.09073) * **Understanding the role of FFNs in driving multilingual behaviour in LLMs** [[arxiv 2404]](http://arxiv.org/abs/2404.13855) * **How do Llamas process multilingual text? A latent exploration through activation patching** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=0ku2hIm4BS) * **Concept Space Alignment in Multilingual LLMs** [[EMNLP 2024]](https://arxiv.org/abs/2410.01079) * **On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task** [[EMNLP 2024 Findings]](https://arxiv.org/abs/2410.06496) ##### Multimodal * **Interpreting CLIP's Image Representation via Text-Based Decomposition** [[ICLR 2024 oral]](https://openreview.net/forum?id=5Ca9sSzuDp) * **Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)** [[NIPS 2024]](https://arxiv.org/abs/2402.10376) * **Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines** [[arxiv 2403]](https://arxiv.org/abs/2403.05846) * **The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?** [[arxiv 2403]](https://arxiv.org/abs/2403.09037) * **Understanding Information Storage and Transfer in Multi-modal Large Language Models** [[arxiv 2406]](https://arxiv.org/abs/2406.04236) * **Towards Interpreting Visual Information Processing in Vision-Language Models** [[arxiv 2410]](https://arxiv.org/abs/2410.07149) * **Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models** [[arxiv 2410]](https://arxiv.org/abs/2410.12662) * **Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models** [[arxiv 2410]](https://arxiv.org/abs/2410.12011) #### Component ##### General * **The Hydra Effect: Emergent Self-repair in Language Model Computations** [[arxiv 2307]](https://arxiv.org/abs/2307.15771) * **Unveiling A Core Linguistic Region in Large Language Models** [[arxiv 2310]](http://arxiv.org/abs/2310.14928) * **Exploring the Residual Stream of Transformers** [[arxiv 2312]](http://arxiv.org/abs/2312.12141) * **Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation** [[arxiv 2312]](https://arxiv.org/abs/2312.01648) * **Explorations of Self-Repair in Language Models** [[arxiv 2402]](http://arxiv.org/abs/2402.15390) * **Massive Activations in Large Language Models** [[arxiv 2402]](https://arxiv.org/abs/2402.17762) * **Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions** [[arxiv 2402]](https://arxiv.org/abs/2402.15055) * **Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics** [[arxiv 2403]](https://arxiv.org/abs/2403.01509) * **The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models** [[arxiv 2403]](http://arxiv.org/abs/2403.03942) * **Localizing Paragraph Memorization in Language Models** [[github 2403]](http://arxiv.org/abs/2403.19851) ##### Attention * **Awesome-Attention-Heads** [[github]](https://github.com/IAAR-Shanghai/Awesome-Attention-Heads) * A carefully compiled list that summarizes the diverse functions of the attention heads. * **In-context learning and induction heads** [[Transformer Circuits Thread]](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) * **On the Expressivity Role of LayerNorm in Transformers' Attention** [[ACL 2023 Findings]](https://aclanthology.org/2023.findings-acl.895.pdf) * **On the Role of Attention in Prompt-tuning** [[ICML 2023]](https://openreview.net/forum?id=qorOnDor89) * **Copy Suppression: Comprehensively Understanding an Attention Head** [[ICLR 2024]](https://openreview.net/forum?id=g8oaZRhDcf) * **Successor Heads: Recurring, Interpretable Attention Heads In The Wild** [[ICLR 2024]](https://openreview.net/forum?id=kvcbV8KQsi) * **A phase transition between positional and semantic learning in a solvable model of dot-product attention** [[arxiv 2024]](http://arxiv.org/abs/2402.03902) * **Retrieval Head Mechanistically Explains Long-Context Factuality** [[arxiv 2404]](http://arxiv.org/abs/2404.15574) * **Iteration Head: A Mechanistic Study of Chain-of-Thought** [[arxiv 2406]](https://arxiv.org/abs/2406.02128) * **When Attention Sink Emerges in Language Models: An Empirical View** [[arxiv 2410]](https://arxiv.org/abs/2410.10781) ##### MLP/FFN * **Transformer Feed-Forward Layers Are Key-Value Memories** [[EMNLP 2021]](https://aclanthology.org/2021.emnlp-main.446) * **Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space** [[EMNLP 2022]](https://aclanthology.org/2022.emnlp-main.3) * **What does GPT store in its MLP weights? A case study of long-range dependencies** [[openreview]](https://openreview.net/forum?id=nUGFpDCu3W) * **Understanding the role of FFNs in driving multilingual behaviour in LLMs** [[arxiv 2404]](http://arxiv.org/abs/2404.13855) ##### Neuron * **Toy Models of Superposition** [[Transformer Circuits Thread]](https://transformer-circuits.pub/2022/toy_model/index.html) * **Knowledge Neurons in Pretrained Transformers** [[ACL 2022]](https://aclanthology.org/2022.acl-long.581) * **Polysemanticity and Capacity in Neural Networks** [[arxiv 2210]](http://arxiv.org/abs/2210.01892) * **Finding Neurons in a Haystack: Case Studies with Sparse Probing** [[TMLR 2023]](https://openreview.net/forum?id=JYs1R9IMJr) * **DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.174) * **Neurons in Large Language Models: Dead, N-gram, Positional** [[arxiv 2309]](http://arxiv.org/abs/2309.04827) * **Universal Neurons in GPT2 Language Models** [[arxiv 2401]](http://arxiv.org/abs/2401.12181) * **Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models** [[arxiv 2402]](http://arxiv.org/abs/2402.16438) * **How do Large Language Models Handle Multilingualism?** [[arxiv 2402]](https://arxiv.org/abs/2402.18815) * **PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits** [[arxiv 2404]](http://arxiv.org/abs/2404.06453) #### Learning Dynamics ##### General * **JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention** [[ICLR 2024]](https://openreview.net/forum?id=LbJqRGNYCf) * **Learning Associative Memories with Gradient Descent** [[arxiv 2402]](https://arxiv.org/abs/2402.18724) * **Mechanics of Next Token Prediction with Self-Attention** [[arxiv 2402]](http://arxiv.org/abs/2403.08081) * **The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language Models** [[arxiv 2403]](http://arxiv.org/abs/2403.08739) * **LLM Circuit Analyses Are Consistent Across Training and Scale** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=1WeLXvaNJP) * **Geometric Signatures of Compositionality Across a Language Model's Lifetime** [[arxiv 2410]](https://arxiv.org/abs/2410.01444) ##### Phase Transition/Grokking * **Progress measures for grokking via mechanistic interpretability** [[ICLR 2023]](https://openreview.net/forum?id=9XFSbDPmdW) * **A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations** [[ICML 2023]](https://openreview.net/forum?id=jCOrkuUpss) * **The Mechanistic Basis of Data Dependence and Abrupt Learning in an In-Context Classification Task** [[ICLR 2024 oral]](https://openreview.net/forum?id=aN4Jf6Cx69) * Highest scores at ICLR 2024: 10, 10, 8, 8. And by one author only! * **Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs** [[ICLR 2024 spotlight]](https://openreview.net/forum?id=MO5PiKHELW) * **A simple and interpretable model of grokking modular arithmetic tasks** [[openreview]](https://openreview.net/forum?id=0ZUKLCxwBo) * **Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition** [[arxiv 2402]](http://arxiv.org/abs/2402.15175) * **Interpreting Grokked Transformers in Complex Modular Arithmetic** [[arxiv 2402]](https://arxiv.org/abs/2402.16726) * **Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models** [[arxiv 2402]](https://arxiv.org/abs/2402.19465) * **Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks** [[arxiv 2406]](https://arxiv.org/abs/2406.02550) * **Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=ns8IH5Sn5y) ##### Fine-tuning * **Studying Large Language Model Generalization with Influence Functions** [[arxiv 2308]](http://arxiv.org/abs/2308.03296) * **Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks** [[ICLR 2024]](https://openreview.net/forum?id=A0HKeKl4Nl) * **Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking** [[ICLR 2024]](https://openreview.net/forum?id=8sKcAWOf2D) * **The Hidden Space of Transformer Language Adapters** [[arxiv 2402]](http://arxiv.org/abs/2402.13137) * **Dissecting Fine-Tuning Unlearning in Large Language Models** [[EMNLP 2024]](https://arxiv.org/abs/2410.06606) #### Feature Representation/Probing-based ##### General * **Implicit Representations of Meaning in Neural Language Models** [[ACL 2021]](https://aclanthology.org/2021.acl-long.143) * **All Roads Lead to Rome? Exploring the Invariance of Transformers' Representations** [[arxiv 2305]](http://arxiv.org/abs/2305.14555) * **Observable Propagation: Uncovering Feature Vectors in Transformers** [[openreview]](https://openreview.net/forum?id=sNWQUTkDmA) * **In-Context Learning in Large Language Models: A Neuroscience-inspired Analysis of Representations** [[openreview]](https://openreview.net/forum?id=UEdS2lIgfY) * **Challenges with unsupervised LLM knowledge discovery** [[arxiv 2312]](https://arxiv.org/abs/2312.10029) * **Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks** [[arxiv 2307]](http://arxiv.org/abs/2307.00175) * **Position Paper: Toward New Frameworks for Studying Model Representations** [[arxiv 2402]](http://arxiv.org/abs/2402.03855) * **How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study** [[arxiv 2402]](http://arxiv.org/abs/2402.16061) * **More than Correlation: Do Large Language Models Learn Causal Representations of Space** [[arxiv 2312]](https://arxiv.org/abs/2312.16257) * **Do Large Language Models Mirror Cognitive Language Processing?** [[arxiv 2402]](https://arxiv.org/abs/2402.18023) * **On the Scaling Laws of Geographical Representation in Language Models** [[arxiv 2402]](https://arxiv.org/abs/2402.19406) * **Monotonic Representation of Numeric Properties in Language Models** [[arxiv 2403]](http://arxiv.org/abs/2403.10381) * **Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?** [[arxiv 2404]](http://arxiv.org/abs/2404.07066) * **Simple probes can catch sleeper agents** [[Anthropic Blog]](https://www.anthropic.com/research/probes-catch-sleeper-agents) * **PaCE: Parsimonious Concept Engineering for Large Language Models** [[arxiv 2406]](https://arxiv.org/abs/2406.04331) * **The Geometry of Categorical and Hierarchical Concepts in Large Language Models** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=KXuYjuBzKo) * **Concept Space Alignment in Multilingual LLMs** [[EMNLP 2024]](https://arxiv.org/abs/2410.01079) * **Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models** [[arxiv 2410]](https://arxiv.org/abs/2410.06981) ##### Linearity * **Actually, Othello-GPT Has A Linear Emergent World Representation** [[Neel Nanda's blog]](https://www.neelnanda.io/mechanistic-interpretability/othello) * **Language Models Linearly Represent Sentiment** [[openreview]](https://openreview.net/forum?id=iGDWZFc7Ya) * **Language Models Represent Space and Time** [[openreview]](https://openreview.net/forum?id=jE8xbmvFin) * **The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets** [[openreview]](https://openreview.net/forum?id=CeJEfNKstt) * **Linearity of Relation Decoding in Transformer Language Models** [[ICLR 2024]](https://openreview.net/forum?id=w7LU2s14kE) * **The Linear Representation Hypothesis and the Geometry of Large Language Models** [[arxiv 2311]](https://arxiv.org/abs/2311.03658) * **Language Models Represent Beliefs of Self and Others** [[arxiv 2402]](https://arxiv.org/abs/2402.18496) * **On the Origins of Linear Representations in Large Language Models** [[arxiv 2403]](http://arxiv.org/abs/2403.03867) * **Refusal in LLMs is mediated by a single direction** [[Lesswrong 2024]](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction) #### Application ##### Training * **Aligning Large Language Models with Human Preferences through Representation Engineering** [[arxiv2312]](http://arxiv.org/abs/2312.15997) * **ReFT: Representation Finetuning for Language Models** [[arxiv 2404]](https://arxiv.org/abs/2404.03592) [[github]](https://github.com/stanfordnlp/pyreft) * **Direct Preference Optimization Using Sparse Feature-Level Constraints** [[arxiv2411]](https://arxiv.org/abs/2411.07618) * **LLM Pretraining with Continuous Concepts** [[arxiv2502]](https://arxiv.org/abs/2502.08524) ##### Inference-Time Intervention/Activation Steering * **Inference-Time Intervention: Eliciting Truthful Answers from a Language Model** [[NIPS 2023]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html) [[github]](https://github.com/likenneth/honest_llama) * **Activation Addition: Steering Language Models Without Optimization** [[arxiv 2308]](http://arxiv.org/abs/2308.10248) * **Self-Detoxifying Language Models via Toxification Reversal** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.269) * **DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models** [[arxiv 2309]](https://arxiv.org/abs/2309.03883) * **In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering** [[arxiv 2311]](http://arxiv.org/abs/2311.06668) * **Steering Llama 2 via Contrastive Activation Addition** [[arxiv 2312]](http://arxiv.org/abs/2312.06681) * **A Language Model's Guide Through Latent Space** [[arxiv 2402]](http://arxiv.org/abs/2402.14433) * **Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment** [[arxiv 2311]](https://arxiv.org/abs/2311.09433) * **Extending Activation Steering to Broad Skills and Multiple Behaviours** [[arxiv 2403]](https://arxiv.org/abs/2403.05767) * **Spectral Editing of Activations for Large Language Model Alignment** [[arxiv 2405]](http://arxiv.org/abs/2405.09719) * **Controlling Large Language Model Agents with Entropic Activation Steering** [[arxiv 2406]](https://arxiv.org/abs/2406.00244) * **Analyzing the Generalization and Reliability of Steering Vectors** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=akCsMk4dDL) * **Towards Inference-time Category-wise Safety Steering for Large Language Models** [[arxiv 2410]](https://arxiv.org/abs/2410.01174) * **A Timeline and Analysis for Representation Plasticity in Large Language Models** [[arxiv 2410]](https://arxiv.org/abs/2410.06225) * **Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors** [[arxiv 2410]](https://arxiv.org/abs/2410.12299) ##### Knowledge/Model Editing * **Locating and Editing Factual Associations in GPT** (*ROME*) [[NIPS 2022]](https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html) [[github]](https://github.com/kmeng01/rome) * **Memory-Based Model Editing at Scale** [[ICML 2022]](https://proceedings.mlr.press/v162/mitchell22a.html) * **Editing models with task arithmetic** [[ICLR 2023]](https://openreview.net/forum?id=6t0Kwf8-jrj) * **Mass-Editing Memory in a Transformer** [[ICLR 2023]](https://openreview.net/forum?id=MkbcAHIYgyS) * **Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark** [[ACL 2023 Findings]](https://aclanthology.org/2023.findings-acl.733) * **Can LMs Learn New Entities from Descriptions? Challenges in Propagating Injected Knowledge** [[ACL 2023]](https://aclanthology.org/2023.acl-long.300) * **Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models** [[NIPS 2023]](https://proceedings.neurips.cc/paper_files/paper/2023/hash/3927bbdcf0e8d1fa8aa23c26f358a281-Abstract-Conference.html) * **Inspecting and Editing Knowledge Representations in Language Models** [[arxiv 2304]](http://arxiv.org/abs/2304.00740) [[github]](https://github.com/evandez/REMEDI) * **Methods for Measuring, Updating, and Visualizing Factual Beliefs in Language Models** [[EACL 2023]](https://aclanthology.org/2023.eacl-main.199) * **Editing Common Sense in Transformers** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.511) * **DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.174) * **MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions** [[EMNLP 2023]](https://aclanthology.org/2023.emnlp-main.971) * **PMET: Precise Model Editing in a Transformer** [[arxiv 2308]](http://arxiv.org/abs/2308.08742) * **Untying the Reversal Curse via Bidirectional Language Model Editing** [[arxiv 2310]](http://arxiv.org/abs/2310.10322) * **Unveiling the Pitfalls of Knowledge Editing for Large Language Models** [[ICLR 2024]](https://openreview.net/forum?id=fNktD3ib16) * **A Comprehensive Study of Knowledge Editing for Large Language Models** [[arxiv 2401]](http://arxiv.org/abs/2401.01286) * **Trace and Edit Relation Associations in GPT** [[arxiv 2401]](http://arxiv.org/abs/2401.02976) * **Model Editing with Canonical Examples** [[arxiv 2402]](https://arxiv.org/abs/2402.06155) * **Updating Language Models with Unstructured Facts: Towards Practical Knowledge Editing** [[arxiv 2402]](http://arxiv.org/abs/2402.18909) * **Editing Conceptual Knowledge for Large Language Models** [[arxiv 2403]](https://arxiv.org/abs/2403.06259) * **Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models** [[arxiv 2406]](https://arxiv.org/abs/2406.01436) * **Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing** [[arxiv 2410]](https://arxiv.org/abs/2410.06331) * **Keys to Robust Edits: from Theoretical Insights to Practical Advances** [[arxiv 2410]](https://arxiv.org/abs/2410.09338) ##### Hallucination * **The Internal State of an LLM Knows When It's Lying** [[EMNLP 2023 Findings]](https://arxiv.org/abs/2304.13734) * **Do Androids Know They're Only Dreaming of Electric Sheep?** [[arxiv 2312]](https://arxiv.org/abs/2312.17249) * **INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection** [[ICLR 2024]](https://openreview.net/forum?id=Zj12nzlQbz) * **TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space** [[arxiv 2402]](https://arxiv.org/abs/2402.17811) * **Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension** [[arxiv 2402]](https://arxiv.org/abs/2402.18048) * **Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models** [[arxiv 2402]](https://arxiv.org/abs/2402.19103) * **In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation** [[arxiv 2403]](http://arxiv.org/abs/2403.01548) * **Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models** [[arxiv 2403]](https://arxiv.org/abs/2403.06448) * **Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories** [[arxiv 2406]](https://arxiv.org/abs/2406.00034) ##### Pruning/Redundancy Analysis * **Not all Layers of LLMs are Necessary during Inference** [[arxiv 2403]](http://arxiv.org/abs/2403.02181) * **ShortGPT: Layers in Large Language Models are More Redundant Than You Expect** [[arxiv 2403]](http://arxiv.org/abs/2403.03853) * **The Unreasonable Ineffectiveness of the Deeper Layers** [[arxiv 2403]](http://arxiv.org/abs/2403.17887) * **The Remarkable Robustness of LLMs: Stages of Inference?** [[ICML 2024 MI Workshop]](https://openreview.net/forum?id=R5unwb9KPc)

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos