3.7 KiB
ui-tars
A python package for parsing VLM-generated GUI action instructions into executable pyautogui codes.
Introduction
ui-tars is a Python package for parsing VLM-generated GUI action instructions, automatically generating pyautogui scripts, and supporting coordinate conversion and smart image resizing.
- Supports multiple VLM output formats (e.g., Qwen-VL, Seed-VL)
- Automatically handles coordinate scaling and format conversion
- One-click generation of pyautogui automation scripts
Quick Start
Installation
pip install ui-tars
# or
uv pip install ui-tars
Parse output into structured actions
from ui_tars.action_parser import parse_action_to_structure_output, parsing_response_to_pyautogui_code
response = "Thought: Click the button\nAction: click(point='<point>200 300</point>')"
original_image_width, original_image_height = 1920, 1080
parsed_dict = parse_action_to_structure_output(
response,
factor=1000,
origin_resized_height=original_image_height,
origin_resized_width=original_image_width,
model_type="doubao"
)
print(parsed_dict)
parsed_pyautogui_code = parsing_response_to_pyautogui_code(
responses=parsed_dict,
image_height=original_image_height,
image_width=original_image_width
)
print(parsed_pyautogui_code)
Generate pyautogui automation script
from ui_tars.action_parser import parsing_response_to_pyautogui_code
pyautogui_code = parsing_response_to_pyautogui_code(parsed_dict, original_image_height, original_image_width)
print(pyautogui_code)
Visualize coordinates on the image (optional)
from PIL import Image, ImageDraw
import numpy as np
import matplotlib.pyplot as plt
image = Image.open("your_image_path.png")
start_box = parsed_dict[0]["action_inputs"]["start_box"]
coordinates = eval(start_box)
x1 = int(coordinates[0] * original_image_width)
y1 = int(coordinates[1] * original_image_height)
draw = ImageDraw.Draw(image)
radius = 5
draw.ellipse((x1 - radius, y1 - radius, x1 + radius, y1 + radius), fill="red", outline="red")
plt.imshow(np.array(image))
plt.axis("off")
plt.show()
API Documentation
parse_action_to_structure_output
def parse_action_to_structure_output(
text: str,
factor: int,
origin_resized_height: int,
origin_resized_width: int,
model_type: str = "qwen25vl",
max_pixels: int = 16384 * 28 * 28,
min_pixels: int = 100 * 28 * 28
) -> list[dict]:
...
Description: Parses output action instructions into structured dictionaries, automatically handling coordinate scaling and box/point format conversion.
Parameters:
text: The output stringfactor: Scaling factororigin_resized_height/origin_resized_width: Original image height/widthmodel_type: Model type (e.g., "qwen25vl", "doubao")max_pixels/min_pixels: Image pixel upper/lower limits
Returns:
A list of structured actions, each as a dict with fields like action_type, action_inputs, thought, etc.
parsing_response_to_pyautogui_code
def parsing_response_to_pyautogui_code(
responses: dict | list[dict],
image_height: int,
image_width: int,
input_swap: bool = True
) -> str:
...
Description: Converts structured actions into a pyautogui script string, supporting click, type, hotkey, drag, scroll, and more.
Parameters:
responses: Structured actions (dict or list of dicts)image_height/image_width: Image height/widthinput_swap: Whether to use clipboard paste for typing (default True)
Returns: A pyautogui script string, ready for automation execution.
Contribution
Contributions, issues, and suggestions are welcome!
License
Apache-2.0 License