Image Matching

AI SDK

AI vision-based image matching using LLMs via Vercel AI SDK.

Overview

The @nut-tree/plugin-ai-sdk plugin enables AI vision-based image matching in nut.js using large language models via the Vercel AI SDK. Instead of pixel-based template matching, it uses multimodal LLMs to understand and locate UI elements by description.

AI Vision

Locate elements by natural language description

screen.find("login button")

Multiple Providers

OpenAI, Anthropic, and Ollama support

useOpenAIVisionProvider()

Configurable

Custom models, prompts, and matching options

{ model, systemPrompt }

Installation

Install the plugin along with the AI SDK provider for your preferred LLM:

typescript
# Core plugin
npm install @nut-tree/plugin-ai-sdk

# Choose one or more AI SDK providers:
npm install @ai-sdk/openai      # OpenAI (GPT-5, etc.)
npm install @ai-sdk/anthropic   # Anthropic (Claude Opus 4.6, etc.)
npm install ollama-ai-provider   # Ollama (local models)

Subscription Required

This package is included in Solo and Team subscription plans.

Quick Reference

useOpenAIVisionProvider

useOpenAIVisionProvider(options?)
void

Activate OpenAI as the vision provider for image matching

useAnthropicVisionProvider

useAnthropicVisionProvider(options?)
void

Activate Anthropic as the vision provider for image matching

useOllamaVisionProvider

useOllamaVisionProvider(options?)
void

Activate Ollama as the vision provider for local AI matching


Provider Setup

OpenAI

typescript
import { screen } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

useOpenAIVisionProvider({
    model: openai("gpt-5"),
});

// Find elements by description
const loginButton = await screen.find("the login button");

API Key Required

Set the OPENAI_API_KEY environment variable with your OpenAI API key.

Anthropic

typescript
import { screen } from "@nut-tree/nut-js";
import { useAnthropicVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { anthropic } from "@ai-sdk/anthropic";

useAnthropicVisionProvider({
    model: anthropic("claude-opus-4-6"),
});

const submitButton = await screen.find("the submit button");

API Key Required

Set the ANTHROPIC_API_KEY environment variable with your Anthropic API key.

Ollama (Local)

Use Ollama for fully local AI matching without external API calls:

typescript
import { screen } from "@nut-tree/nut-js";
import { useOllamaVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { ollama } from "ollama-ai-provider";

useOllamaVisionProvider({
    model: ollama("llava"),
});

const element = await screen.find("the search input field");

Local Setup

Make sure Ollama is running locally (ollama serve) and you have pulled a vision-capable model like llava or llava-llama3.

Configuration

All provider setup functions accept a configuration object:

typescript
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

useOpenAIVisionProvider({
    // Required: the AI model to use
    model: openai("gpt-5"),

    // Optional: default confidence threshold (0-1)
    defaultConfidence: 0.7,

    // Optional: maximum matches to return per search
    defaultMaxMatches: 5,

    // Optional: matching strategy
    matching: "default",

    // Optional: custom system prompt for the AI model
    systemPrompt: "You are analyzing a desktop application screenshot.",
});

Options Reference

model

model: LanguageModelV1
required

The AI SDK language model instance to use for vision analysis

defaultConfidence

defaultConfidence?: number
optional

Default confidence threshold for matches (0-1). Can be overridden per search.

defaultMaxMatches

defaultMaxMatches?: number
optional

Maximum number of matches to return per search. Can be overridden per search.

matching

matching?: "default"
optional

Matching strategy to use

systemPrompt

systemPrompt?: string
optional

Custom system prompt to provide context to the AI model about what it is analyzing

Usage Examples

Finding Elements

typescript
import { screen, mouse, centerOf, straightTo, Button } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

useOpenAIVisionProvider({ model: openai("gpt-5") });

// Find by natural language description
const button = await screen.find("the blue Submit button");
await mouse.move(straightTo(centerOf(button)));
await mouse.click(Button.LEFT);

// Find multiple matches
const items = await screen.findAll("list items in the sidebar");
console.log(`Found ${items.length} sidebar items`);

// Wait for an element to appear
const dialog = await screen.waitFor("a confirmation dialog", 10000, 1000);

With Confidence Override

typescript
import { screen } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

useOpenAIVisionProvider({
    model: openai("gpt-5"),
    defaultConfidence: 0.7,
});

// Override confidence for a specific search
const result = await screen.find("the navigation menu", {
    confidence: 0.9,
});

Best Practices

Descriptions

  • Be specific in your descriptions (e.g., "the blue Submit button" vs "a button")
  • Include visual characteristics like color, position, or text content
  • Use the systemPrompt option to give the model context about your application

Performance Considerations

  • AI vision matching is slower than template matching (nl-matcher) due to API latency
  • Cloud providers (OpenAI, Anthropic) require internet connectivity and incur API costs
  • Ollama provides local matching but requires a capable GPU for good performance
  • Consider using nl-matcher for speed-critical operations and AI SDK for complex visual understanding

Was this page helpful?