X

WeDLM-8B-Instruct

Information

# WeDLM-8B-Instruct ⭐ **WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B). **Highlights:** - 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks - Outperforms base Qwen3-8B-Instruct on most benchmarks - Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs) For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base), which is based on Qwen3-8B-Base. Paper (Coming Soon) | [Project Page](https://wedlm.github.io) | [GitHub](https://github.com/tencent/WeDLM) ## Model Details | Attribute | Value | |:----------|:------| | Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base) | | Parameters | 8B | | Context Length | 32,768 | ## Quick Start (Recommended) For **fast inference**, use the \`wedlm\` engine: \`\`\`bash pip install git+https://github.com/tencent/WeDLM.git \`\`\` \`\`\`python from transformers import AutoTokenizer from wedlm import LLM, SamplingParams llm = LLM(model="tencent/WeDLM-8B-Instruct") tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True) prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?" messages = [\{"role": "user", "content": prompt\}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512)) print(outputs[0]["text"]) \`\`\` ### Multi-turn Conversation \`\`\`python messages = [ \{"role": "user", "content": "What is the derivative of x^2?"\}, \{"role": "assistant", "content": "The derivative of x² is 2x."\}, \{"role": "user", "content": "What about x^3?"\} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256)) \`\`\` ### Batch Inference \`\`\`python prompts = [ "Explain quantum entanglement simply.", "Write a Python function to check if a number is prime.", "What are the main causes of climate change?" ] messages_batch = [[\{"role": "user", "content": p\}] for p in prompts] texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch] outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512)) for i, output in enumerate(outputs): print(f"=== Response \{i+1\} ===\n\{output['text']\}\n") \`\`\` ## HuggingFace Transformers For **training** or simple forward passes: \`\`\`python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "tencent/WeDLM-8B-Instruct", trust_remote_code=True, torch_dtype="auto", device_map="auto" ) messages = [\{"role": "user", "content": "Hello!"\}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model(**inputs) \`\`\` > ️ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the \`wedlm\` engine above. ## Performance ### Generation Quality | Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct | |:----------|:-----------------:|:-----------------:| | ARC-C (0-shot) | 91.47 | **92.92** | | GSM8K (3-shot) | 89.91 | **92.27** | | MATH (4-shot) | **69.60** | 64.80 | | HumanEval (4-shot) | 71.95 | **80.49** | | MMLU (5-shot) | 71.52 | **75.14** | | GPQA-Diamond (5-shot) | 41.41 | **44.95** | | **Average** | 75.12 | **77.53** | ### Inference Speed Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct): | Scenario | Speedup | Notes | |:---------|:-------:|:------| | Math Reasoning (GSM8K) | 3-6× | Structured, predictable output | | Code Generation | 2-3× | Deterministic syntax | | Open-ended QA | 1.5-2× | Higher entropy limits parallelism | ## Citation (Coming soon) ## License Apache 2.0

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos