Files
JoyD/TARS/UI-TARS/codes/README.md

147 lines
3.7 KiB
Markdown
Raw Normal View History

2025-10-31 11:12:44 +08:00
# ui-tars
A python package for parsing VLM-generated GUI action instructions into executable pyautogui codes.
---
## Introduction
`ui-tars` is a Python package for parsing VLM-generated GUI action instructions, automatically generating pyautogui scripts, and supporting coordinate conversion and smart image resizing.
- Supports multiple VLM output formats (e.g., Qwen-VL, Seed-VL)
- Automatically handles coordinate scaling and format conversion
- One-click generation of pyautogui automation scripts
---
## Quick Start
### Installation
```bash
pip install ui-tars
# or
uv pip install ui-tars
```
### Parse output into structured actions
```python
from ui_tars.action_parser import parse_action_to_structure_output, parsing_response_to_pyautogui_code
response = "Thought: Click the button\nAction: click(point='<point>200 300</point>')"
original_image_width, original_image_height = 1920, 1080
parsed_dict = parse_action_to_structure_output(
response,
factor=1000,
origin_resized_height=original_image_height,
origin_resized_width=original_image_width,
model_type="doubao"
)
print(parsed_dict)
parsed_pyautogui_code = parsing_response_to_pyautogui_code(
responses=parsed_dict,
image_height=original_image_height,
image_width=original_image_width
)
print(parsed_pyautogui_code)
```
### Generate pyautogui automation script
```python
from ui_tars.action_parser import parsing_response_to_pyautogui_code
pyautogui_code = parsing_response_to_pyautogui_code(parsed_dict, original_image_height, original_image_width)
print(pyautogui_code)
```
### Visualize coordinates on the image (optional)
```python
from PIL import Image, ImageDraw
import numpy as np
import matplotlib.pyplot as plt
image = Image.open("your_image_path.png")
start_box = parsed_dict[0]["action_inputs"]["start_box"]
coordinates = eval(start_box)
x1 = int(coordinates[0] * original_image_width)
y1 = int(coordinates[1] * original_image_height)
draw = ImageDraw.Draw(image)
radius = 5
draw.ellipse((x1 - radius, y1 - radius, x1 + radius, y1 + radius), fill="red", outline="red")
plt.imshow(np.array(image))
plt.axis("off")
plt.show()
```
---
## API Documentation
### parse_action_to_structure_output
```python
def parse_action_to_structure_output(
text: str,
factor: int,
origin_resized_height: int,
origin_resized_width: int,
model_type: str = "qwen25vl",
max_pixels: int = 16384 * 28 * 28,
min_pixels: int = 100 * 28 * 28
) -> list[dict]:
...
```
**Description:**
Parses output action instructions into structured dictionaries, automatically handling coordinate scaling and box/point format conversion.
**Parameters:**
- `text`: The output string
- `factor`: Scaling factor
- `origin_resized_height`/`origin_resized_width`: Original image height/width
- `model_type`: Model type (e.g., "qwen25vl", "doubao")
- `max_pixels`/`min_pixels`: Image pixel upper/lower limits
**Returns:**
A list of structured actions, each as a dict with fields like `action_type`, `action_inputs`, `thought`, etc.
---
### parsing_response_to_pyautogui_code
```python
def parsing_response_to_pyautogui_code(
responses: dict | list[dict],
image_height: int,
image_width: int,
input_swap: bool = True
) -> str:
...
```
**Description:**
Converts structured actions into a pyautogui script string, supporting click, type, hotkey, drag, scroll, and more.
**Parameters:**
- `responses`: Structured actions (dict or list of dicts)
- `image_height`/`image_width`: Image height/width
- `input_swap`: Whether to use clipboard paste for typing (default True)
**Returns:**
A pyautogui script string, ready for automation execution.
---
## Contribution
Contributions, issues, and suggestions are welcome!
---
## License
Apache-2.0 License