WeDLM-8B-Instruct

Rating

ALL

Similar

Qwen3 235B A22B

DeepSeek-OCR

Qwen3-VL-8B-Instruct

PaddleOCR-VL

Cursor Composer

HunyuanWorld-Mirror

Information

# WeDLM-8B-Instruct ⭐ **WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B). **Highlights:** - 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks - Outperforms base Qwen3-8B-Instruct on most benchmarks - Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs) For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base), which is based on Qwen3-8B-Base. Paper (Coming Soon) | [Project Page](https://wedlm.github.io) | [GitHub](https://github.com/tencent/WeDLM) ## Model Details | Attribute | Value | |:----------|:------| | Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base) | | Parameters | 8B | | Context Length | 32,768 | ## Quick Start (Recommended) For **fast inference**, use the \`wedlm\` engine: \`\`\`bash pip install git+https://github.com/tencent/WeDLM.git \`\`\` \`\`\`python from transformers import AutoTokenizer from wedlm import LLM, SamplingParams llm = LLM(model="tencent/WeDLM-8B-Instruct") tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True) prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?" messages = [\{"role": "user", "content": prompt\}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512)) print(outputs[0]["text"]) \`\`\` ### Multi-turn Conversation \`\`\`python messages = [ \{"role": "user", "content": "What is the derivative of x^2?"\}, \{"role": "assistant", "content": "The derivative of x² is 2x."\}, \{"role": "user", "content": "What about x^3?"\} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256)) \`\`\` ### Batch Inference \`\`\`python prompts = [ "Explain quantum entanglement simply.", "Write a Python function to check if a number is prime.", "What are the main causes of climate change?" ] messages_batch = [[\{"role": "user", "content": p\}] for p in prompts] texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch] outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512)) for i, output in enumerate(outputs): print(f"=== Response \{i+1\} ===\n\{output['text']\}\n") \`\`\` ## HuggingFace Transformers For **training** or simple forward passes: \`\`\`python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "tencent/WeDLM-8B-Instruct", trust_remote_code=True, torch_dtype="auto", device_map="auto" ) messages = [\{"role": "user", "content": "Hello!"\}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model(**inputs) \`\`\` > ️ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the \`wedlm\` engine above. ## Performance ### Generation Quality | Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct | |:----------|:-----------------:|:-----------------:| | ARC-C (0-shot) | 91.47 | **92.92** | | GSM8K (3-shot) | 89.91 | **92.27** | | MATH (4-shot) | **69.60** | 64.80 | | HumanEval (4-shot) | 71.95 | **80.49** | | MMLU (5-shot) | 71.52 | **75.14** | | GPQA-Diamond (5-shot) | 41.41 | **44.95** | | **Average** | 75.12 | **77.53** | ### Inference Speed Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct): | Scenario | Speedup | Notes | |:---------|:-------:|:------| | Math Reasoning (GSM8K) | 3-6× | Structured, predictable output | | Code Generation | 2-3× | Deterministic syntax | | Open-ended QA | 1.5-2× | Higher entropy limits parallelism | ## Citation (Coming soon) ## License Apache 2.0

Prompts

Reviews

Write Your Review

Detailed Ratings

ALL

Correctness

Helpfulness

Interesting

Upload Pictures and Videos

Name

Size

Type

Download

Last Modified

Community

Add Discussion

Upload Pictures and Videos

Chatbot close

Bot
Hi there
How can I help you today?

Send