X

POINTS-GUI

Information

HuggingFace Paper view

## News - Upcoming: The End-to-End GUI Agent Model is currently under active development and will be released in a subsequent update. Stay tuned! - 2026.02.06: We are pleased to present POINTS-GUI-G, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in ./evaluation. ## Introduction 1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. 2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5 (which initially lacked native grounding ability). We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization. 3. **Refined Data Engineering**: Existing GUI datasets differ in coordinate systems, task formats, and contain substantial noise. We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases ## Results We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines. ![Example 1](images/results.png) ## Examples ### Prediction on desktop screenshots ![Example 1](images/example_desktop_1.png) ![Example 1](images/example_desktop_2.png) ![Example 1](images/example_desktop_3.png) ### Prediction on mobile screenshots ![Example 1](images/example_mobile.png) ### Prediction on web screenshots ![Example 1](images/example_web_1.png) ![Example 1](images/example_web_2.png) ![Example 1](images/example_web_3.png) ## Getting Started This following code snippet has been tested with following environment: \`\`\` python==3.12.11 torch==2.9.1 transformers==4.57.1 cuda==12.6 \`\`\` ### Run with Transformers Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command: \`\`\`sh git clone https://github.com/WePOINTS/WePOINTS.git cd ./WePOINTS pip install -e . \`\`\` \`\`\`python from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor import torch system_prompt_point = ( 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n' 'Requirements for the output:\n' '- Return only the point (x, y) representing the center of the target element\n' '- Coordinates must be normalized to the range [0, 1]\n' '- Round each coordinate to three decimal places\n' '- Format the output as strictly (x, y) without any additional text\n' ) system_prompt_bbox = ( 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n' 'Requirements for the output:\n' '- Return only the bounding box coordinates (x0, y0, x1, y1)\n' '- Coordinates must be normalized to the range [0, 1]\n' '- Round each coordinate to three decimal places\n' '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n' ) system_prompt = system_prompt_point # system_prompt_bbox user_prompt = None # replace with your instruction (e.g., 'close the window') image_path = '/path/to/your/local/image' model_path = 'tencent/POINTS-GUI-G' model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, dtype=torch.bfloat16, device_map='cuda') tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) image_processor = Qwen2VLImageProcessor.from_pretrained(model_path) content = [ dict(type='image', image=image_path), dict(type='text', text=user_prompt) ] messages = [ \{ 'role': 'system', 'content': [dict(type='text', text=system_prompt)] \}, \{ 'role': 'user', 'content': content \} ] generation_config = \{ 'max_new_tokens': 2048, 'do_sample': False \} response = model.chat( messages, tokenizer, image_processor, generation_config ) print(response) \`\`\` ### Deploy with SGLang We have created a [Pull Request](https://github.com/sgl-project/sglang/pull/17989) for SGLang. You can check out this branch and install SGLang in editable mode by following the [official guide](https://docs.sglang.ai/get_started/install.html) prior to the merging of this PR. #### How to Deploy You can deploy POINTS-GUI-G with SGLang using the following command: \`\`\` python3 -m sglang.launch_server \ --model-path tencent/POINTS-GUI-G \ --tp-size 1 \ --dp-size 1 \ --chunked-prefill-size -1 \ --mem-fraction-static 0.7 \ --chat-template qwen2-vl \ --trust-remote-code \ --port 8081 \`\`\` #### How to Use You can use the following code to obtain results from SGLang: \`\`\`python from typing import List import requests import json def call_wepoints(messages: List[dict], temperature: float = 0.0, max_new_tokens: int = 2048, repetition_penalty: float = 1.05, top_p: float = 0.8, top_k: int = 20, do_sample: bool = True, url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str: """Query WePOINTS model to generate a response. Args: messages (List[dict]): A list of messages to be sent to WePOINTS. The messages should be the standard OpenAI messages, like: [ \{ 'role': 'user', 'content': [ \{ 'type': 'text', 'text': 'Please describe this image in short' \}, \{ 'type': 'image_url', 'image_url': \{'url': /path/to/image.jpg\} \} ] \} ] temperature (float, optional): The temperature of the model. Defaults to 0.0. max_new_tokens (int, optional): The maximum number of new tokens to generate. Defaults to 2048. repetition_penalty (float, optional): The penalty for repetition. Defaults to 1.05. top_p (float, optional): The top-p probability threshold. Defaults to 0.8. top_k (int, optional): The top-k sampling vocabulary size. Defaults to 20. do_sample (bool, optional): Whether to use sampling or greedy decoding. Defaults to True. url (str, optional): The URL of the WePOINTS model. Defaults to 'http://127.0.0.1:8081/v1/chat/completions'. Returns: str: The generated response from WePOINTS. """ data = \{ 'model': 'WePoints', 'messages': messages, 'max_new_tokens': max_new_tokens, 'temperature': temperature, 'repetition_penalty': repetition_penalty, 'top_p': top_p, 'top_k': top_k, 'do_sample': do_sample, \} response = requests.post(url, json=data) response = json.loads(response.text) response = response['choices'][0]['message']['content'] return response system_prompt_point = ( 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n' 'Requirements for the output:\n' '- Return only the point (x, y) representing the center of the target element\n' '- Coordinates must be normalized to the range [0, 1]\n' '- Round each coordinate to three decimal places\n' '- Format the output as strictly (x, y) without any additional text\n' ) system_prompt_bbox = ( 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n' 'Requirements for the output:\n' '- Return only the bounding box coordinates (x0, y0, x1, y1)\n' '- Coordinates must be normalized to the range [0, 1]\n' '- Round each coordinate to three decimal places\n' '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n' ) system_prompt = system_prompt_point # system_prompt_bbox user_prompt = None # replace with your instruction (e.g., 'close the window') messages = [ \{ 'role': 'system', 'content': [ \{ 'type': 'text', 'text': system_prompt \} ] \}, \{ 'role': 'user', 'content': [ \{ 'type': 'image_url', 'image_url': \{'url': '/path/to/image.jpg'\} \}, \{ 'type': 'text', 'text': user_prompt \} ] \} ] response = call_wepoints(messages) print(response) \`\`\` ## Citation If you use this model in your work, please cite the following paper: \`\`\` @article\{zhao2026pointsguigguigroundingjourney, title = \{POINTS-GUI-G: GUI-Grounding Journey\}, author = \{Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie\}, journal = \{arXiv preprint arXiv:2602.06391\}, year = \{2026\} \} @inproceedings\{liu2025points, title=\{POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion\}, author=\{Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others\}, booktitle=\{Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing\}, pages=\{1576--1601\}, year=\{2025\} \} @article\{liu2024points1, title=\{POINTS1. 5: Building a Vision-Language Model towards Real World Applications\}, author=\{Liu, Yuan and Tian, Le and Zhou, Xiao and Gao, Xinyu and Yu, Kavio and Yu, Yang and Zhou, Jie\}, journal=\{arXiv preprint arXiv:2412.08443\}, year=\{2024\} \} @article\{liu2024points, title=\{POINTS: Improving Your Vision-language Model with Affordable Strategies\}, author=\{Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie\}, journal=\{arXiv preprint arXiv:2409.04828\}, year=\{2024\} \} @article\{liu2024rethinking, title=\{Rethinking Overlooked Aspects in Vision-Language Models\}, author=\{Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie\}, journal=\{arXiv preprint arXiv:2405.11850\}, year=\{2024\} \} \`\`\`

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos