Skip to content

OculiX MCP Server — lightweight remote control for visual automation

Oculix-specific

The OculiX MCP Server is a small Java daemon that exposes Oculix’s visual automation — template matching, OCR, mouse and keyboard control — as a remote-callable API. It is designed to be driven by external test runners, QA pipelines, internal tooling, and any other client that needs to control a screen from another process.

Under the hood, the wire protocol is Model Context Protocol — JSON-RPC over stdio or Streamable HTTP. We picked it because it is noticeably lighter than the XML-RPC layer used by classic Robot Framework remote libraries (jrobotremoteserver and the like). Less framing overhead, simpler client code, faster round-trips.

The fact that MCP clients also exist on the LLM side (Claude Desktop, Cursor, the MCP Inspector) is a bonus, not the point. If you want to drive Oculix from an LLM you can; if you want to drive it from a Python test script or a Java service, you can too. The server does not know nor care which one is on the other end.

Boundary — what Oculix provides vs what you provide

Section titled “Boundary — what Oculix provides vs what you provide”

Oculix ships security primitives. It does not define your security policy. Read this before any deployment in a regulated environment, because the distinction matters legally and operationally.

Oculix provides (mechanism)You provide (policy)
ActionGate interface called on every tool callConcrete ActionGate implementation (auto-approve, queue, allowlist, …) matching your threat model
Cryptographically signed audit trail (Ed25519 + hash chain)Where the journal is archived, who reviews it, how often, what counts as a violation, retention period
Key rotation, session model, fail-fast on unsigned stateRotation cadence, who is allowed to rotate, how revocation propagates to your other systems
Open vs confidential modesThe decision of which mode is required for which use case, enforced at process-launch time by your own configuration management
Refusal to fall back to unsigned mode (no silent degradation)Monitoring that detects when the server is not running and alerts your on-call
Loopback-only by default, TLS termination flag for upstream proxiesReverse proxy config, certificate rotation, WAF rules, network segmentation
No outbound traffic from the server itselfOutbound firewall rules on the host, allow/deny lists of LLM API endpoints, network isolation between the LLM host and the target screen host

Oculix can demonstrate that an action was taken, by whom, with what arguments. Oculix cannot decide for your security team whether that action should have been allowed in the first place. That decision belongs to your ActionGate implementation, your network policy, your compliance officer, and your regulator-facing review process — all of which are your responsibility.

If you ship Oculix into a regulated environment without a non-trivial ActionGate implementation and a defined audit review process, you have not configured a secure system — you have configured an auditable one. The two are not the same, and conflating them is the source of most automation incidents.

The server treats any LLM-backed client (the Claude API, OpenAI, Mistral, Gemini, on-prem Ollama / vLLM — any provider, hosted or local) as a category that the deployer must explicitly enable through a validation mechanism before such a client can drive the server. The default posture is “a QA framework or a deterministic test runner is on the other end”.

This is independent of the LLM-agnosticism of the codebase (no provider SDK is bundled, no model-specific logic ships): that is a build-time property. The opt-in rule is a runtime activation rule that the deployer owns. A vanilla deployment refuses to negotiate with a client whose handshake identifies it as an LLM front-end; flipping that bit is a conscious operator decision tied to a documented policy, not an accidental side-effect of starting the daemon.

The point is to make the LLM-driven case impossible by mistake. If you want it, you turn it on, with the configuration trail to prove it. If you do not, the server stays a plain QA backend regardless of what clients are running on the network.

Deployer attestation — turning activation into an audit entry

Section titled “Deployer attestation — turning activation into an audit entry”

For activations that materially change the server’s security posture (enabling LLM clients, switching from confidential to open mode, deploying with AutoApproveGate in production, widening an allowlist), the server ships a deployer attestation template: a small structured form the operator fills before the feature can be activated.

The template is not a legal NDA — Oculix has no contractual relationship with the deployer’s organisation. It is an internal attestation that the deployer issues to itself and to its own compliance function, and that becomes part of the audit chain.

Concretely:

  • The server provides a template (YAML/JSON) with required fields: operator identity, organisation, business justification, internal policy reference, target environment, expected duration, fallback path on revocation
  • The operator fills it locally before activation
  • The server validates the template at activation time and, if valid, writes the template’s canonical hash and full contents as an activation_acknowledged entry, signed by the current audit key
  • Future verification (oculix-mcp verify) confirms the attestation entry alongside the operational tool calls; a missing or unsigned attestation for a sensitive mode is a verification failure
  • Rotating the audit key preserves the attestation: the entry chains across rotations like any other journal entry

For a compliance officer, this turns “did anyone enable LLM clients in March?” into a deterministic offline check rather than a conversation. For an external regulator, it produces an evidence artefact tying a specific human, with a specific policy reference, to the activation timestamp.

Implementation status: the activation slot exists in the audit schema; concrete templates and the per-feature attestation requirement are scoped for a near-term release. See What is not in V1.

A SikuliX script written in the IDE drives the screen of the same machine it runs on. That is fine for desktop scripts, but a lot of real-world setups want one of these:

  • A CI worker triggers screen actions on a dedicated automation VM sitting on a different host.
  • A test harness on a developer laptop drives a remote production-like terminal over VNC.
  • A legacy test framework (Robot Framework, Squish-style scripts, internal shell harnesses) needs to call screen actions from outside the SikuliX JVM.
  • An on-premise orchestration platform schedules screen automations on pools of machines.

All of these need the same thing: a stable, language-agnostic API to ask “click this image”, “wait for this text”, “screenshot this region”, and get a result back.

That is what oculix-mcp-server is.

The server registers nine tools by default. Each is invoked over the standard MCP tools/call method; arguments and return shape are described by the JSON schema returned by tools/list.

ToolWhat it does
oculix_find_imageLocate a reference image on screen, return coordinates (non-blocking)
oculix_click_imageFind a reference image and left-click it
oculix_exists_imageNon-blocking presence check
oculix_wait_for_imageBlock until a reference image appears, up to a timeout
oculix_screenshotCapture the full screen or a region, return as base64 PNG
oculix_type_textType literal text at the keyboard focus
oculix_key_comboPress a keyboard combination (Ctrl+C, Cmd+Tab, F5, …)
oculix_find_textOCR: locate a text string on screen, return bounding box
oculix_read_text_in_regionOCR: extract the text from a region — open mode only

In confidential mode (see below), the last two are replaced by _to_disk variants that keep pixels and text on the local host and return only a path and a SHA-256 hash.

Each tool is a thin Java class in MCP/src/main/java/org/sikuli/mcp/tools/, wrapping the corresponding Oculix primitive (see Region, Pattern, Match, text and OCR).

If you already know jrobotremoteserver (the Java-side daemon used by community Sikuli libraries for Robot Framework), the comparison is straightforward.

jrobotremoteserver (XML-RPC)OculiX MCP Server (JSON-RPC)
FramingVerbose XML envelopesCompact JSON objects
TransportHTTP onlyStdio (no socket needed) or Streamable HTTP
Per-call overheadHigherLower
Native client surfaceRobot Framework Remote LibraryAny MCP client or raw JSON-RPC
AuthApp-specificLayered: optional PAT + mandatory ephemeral bearer
Session modelStateless per callExplicit session, revocable via DELETE
AuditOff the shelf: nothingEd25519 hash chain, signed per call, verifiable offline

The MCP server is a drop-in upgrade for the kind of remote-control workflow XML-RPC was carrying. The audit chain and session model are bonuses on top.

Oculix is API-compatible with SikuliX: existing scripts and existing bindings (including Robot Framework setups using XML-RPC bridges) keep working unchanged. The MCP server is a separate, additional path — not a replacement.

Who can use this — and why it beats rolling your own

Section titled “Who can use this — and why it beats rolling your own”

The MCP server gives any framework or service that wants visual recognition a complete, audited, language-agnostic backend over a single JSON-RPC connection. Concretely:

If you…The MCP server gives you…
Drive Selenium / Playwright / Cypress and bolt on OpenCV + Tesseract for visual assertsA maintained engine with PaddleOCR + Tesseract fallback, exposed by 9 schema’d tools
Have a homemade Python harness using pyautogui + pytesseractA stable Java backend over JSON-RPC, with audit trail and session model out of the box
Want to drive screens from a Java / Go / Rust / .NET service that has no SikuliX bindingPlain HTTP + JSON — no SDK, no JNI, no native build
Build a new test framework or QA harness from scratchA visual-recognition backend you don’t have to write, so you can focus on your framework’s ergonomics
Run a regulated environment that needs proof of every UI actionEd25519 signed journal verifiable offline, with verify subcommand

The point: rather than every framework re-implementing template matching, OCR plumbing, and result-shaping for the hundredth time, plug them all into the same oculix-mcp-server and call its nine tools. The framework keeps its native ergonomics (pytest fixtures, Playwright expect-style, Robot keywords); only the visual primitives are delegated to a single audited backend.

Terminal window
mvn -pl MCP -am -DskipTests -Pmcp-fatjar clean package
# produces MCP/target/oculix-mcp-server.jar

Stdio mode (one process, one client, no socket):

Terminal window
java -jar MCP/target/oculix-mcp-server.jar run

HTTP mode (multiple clients, addressable from outside the process):

7337/mcp
java -jar MCP/target/oculix-mcp-server.jar serve

On first start, the server generates an Ed25519 key pair in ~/.oculix-mcp/ (private key mode 600). Subsequent runs reuse that pair.

Terminal window
java -jar oculix-mcp-server.jar verify

Walks every audit-*.jsonl file in ~/.oculix-mcp/journal/, re-computes hashes, verifies signatures, and checks chain continuity. Outputs OK / FAIL per file.

oculix-mcp run Start over stdio (default)
oculix-mcp serve [flags] Start over HTTP (Streamable HTTP)
--host HOST (default 127.0.0.1)
--port PORT (default 7337, 0 for auto)
env OCULIX_MCP_TOKEN optional client token
env OCULIX_MCP_MODE open | confidential
env OCULIX_MCP_VAULT confidential landing dir
env OCULIX_MCP_TRUST_TLS_TERMINATION=1
acknowledge upstream TLS
oculix-mcp rotate-key Rotate the Ed25519 audit signing key
oculix-mcp rotate-session-key Rotate the HMAC keyring for session tokens
oculix-mcp recover Record an unsigned-gap and start a fresh chain
oculix-mcp verify [FILES...] Verify audit journals (all by default)
oculix-mcp --help Show usage

HTTP transport — calling the server from anywhere

Section titled “HTTP transport — calling the server from anywhere”

Once serve is running, any HTTP client can drive it.

7337/mcp
java -jar MCP/target/oculix-mcp-server.jar serve
# → client token: DISABLED (any caller on loopback can initialize)
# 1. initialize — returns Mcp-Session-Id (header) + bearer (body)
RESP=$(curl -s -D /tmp/h.txt -X POST http://127.0.0.1:7337/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}')
SID=$(grep -i mcp-session-id /tmp/h.txt | awk '{print $2}' | tr -d '\r')
BEARER=$(echo "$RESP" | jq -r '.result._meta.bearer')
# 2. tools/list
curl -s -X POST http://127.0.0.1:7337/mcp \
-H "Mcp-Session-Id: $SID" \
-H "Authorization: Bearer $BEARER" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":2,"method":"tools/list"}' | jq
# 3. call a tool
curl -s -X POST http://127.0.0.1:7337/mcp \
-H "Mcp-Session-Id: $SID" \
-H "Authorization: Bearer $BEARER" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":3,"method":"tools/call",
"params":{"name":"oculix_click_image",
"arguments":{"reference_path":"/tmp/button.png","similarity":0.85}}}' | jq
# 4. DELETE — revoke the session
curl -s -X DELETE http://127.0.0.1:7337/mcp \
-H "Mcp-Session-Id: $SID" \
-H "Authorization: Bearer $BEARER" -o /dev/null -w "%{http_code}\n"

No SDK needed — it’s plain HTTP + JSON. Python requests, Go net/http, JS fetch, Java HttpClient, all work directly.

Layer 1: client credential (optional, PAT-style). Gates initialize. Configured via OCULIX_MCP_TOKEN. Unset → anyone on the bound interface can initialize, which is acceptable on loopback. Set it when exposing the server beyond localhost.

Layer 2: session token (mandatory, ephemeral). Minted by the server on every successful initialize, returned in result._meta.bearer. Format:

ocx.<kid>.<b64url(payload)>.<b64url(hmac)>

The client sends it back as Authorization: Bearer <value> on every subsequent request. Default TTL is 30 minutes; refresh by re-initializing.

A leaked token cannot be replayed against a different session id; a token whose session was DELETEd is rejected with 404 even if still crypto-valid; key rotation leaves outstanding tokens valid until they expire naturally.

Terminal window
oculix-mcp rotate-session-key
# Session-token keyring rotated.
# previous kid: k9QwM7eL
# new kid: kX3pT2aB
# ring size: 2

The HMAC keyring lives in ~/.oculix-mcp/session-hmac-keyring.json (mode 0600). Generating a new kid keeps old kids in the ring so already-issued tokens remain verifiable until they expire. Delete the file to force all outstanding sessions into hard failure.

Plain HTTP on anything other than a loopback address is refused at startup. If TLS is terminated upstream (nginx, Caddy, service mesh, WAF), set OCULIX_MCP_TRUST_TLS_TERMINATION=1 to acknowledge that responsibility before binding --host 0.0.0.0 or a non-local interface. In-process TLS is planned but not shipped — use a reverse proxy for now; certificate rotation and WAF policy belong there anyway.

Screen actions are serialized through a fair lock: no matter how many clients share one process, two oculix_click_image cannot interleave. This is independent of the transport.

By default, the server registers the 9 tools described above. Two of them return screen content over the wire:

  • oculix_screenshot — PNG bytes (base64) back to the caller.
  • oculix_read_text_in_region — OCR text back to the caller.

For regulated workflows where captured content must not leave the host (banking screens, healthcare records, internal admin consoles), start the server in confidential mode:

Terminal window
OCULIX_MCP_MODE=confidential \
OCULIX_MCP_TOKEN=... \
java -jar oculix-mcp-server.jar serve

In confidential mode the two content-bearing tools are not registered. The client’s tools/list will not even mention them — there is no filter to bypass, the capability is physically absent. They are replaced by:

  • oculix_screenshot_to_disk — captures a PNG into the local vault, returns {path, sha256, width, height, bytes}. Pixels stay local.
  • oculix_ocr_to_disk — writes the OCR output to a text file in the vault, returns {path, sha256, engine, line_count, char_count}. Text stays local.

Vault location: $OCULIX_MCP_VAULT if set, else ~/.oculix-mcp/vault/, restricted to owner only (0700 on POSIX).

ClaimHolds?
Screen pixels never leave the host via the MCP channelyes
OCR results never leave the host via the MCP channelyes
Local audit chain records SHA-256 of every captured artefactyes
The remote caller cannot enumerate content-bearing toolsyes — they are absent from tools/list
Caller-supplied arguments (e.g. type_text) never leakno — the caller is the one writing them

Every tool call produces a journal entry containing:

  • Timestamp (UTC, microsecond precision) plus a monotonic per-file sequence number
  • Session id (UUID)
  • Client name and version (from the MCP handshake)
  • Tool name and full arguments
  • SHA-256 hash of the result payload
  • prev_hash chaining to the previous entry
  • entry_hash of the entry itself
  • Ed25519 signature of entry_hash

Rotation markers are signed by the outgoing key before a new key takes over, so the chain remains provable across key rotations. Deleting the private key manually triggers a fail-fast refusal on the next start — the server never falls back to unsigned mode.

{
"type": "tool_call",
"ts_utc": "2026-04-14T19:23:45.123456Z",
"seq": 42,
"session_id": "c4a1f8e0-3b20-4d6a-9b3f-1f6e4d6c8a11",
"client": { "name": "qa-runner-internal", "version": "2.4.1" },
"tool": "oculix_click_image",
"args": { "reference_path": "/tmp/button.png", "similarity": 0.85 },
"result_sha256": "3b1ab...",
"extra": null,
"prev_hash": "0000...",
"entry_hash": "f8c2...",
"signature": "a7e9..."
}

Special entry types:

  • rotation_end / rotation_begin — pair of markers at key / file rotation
  • clock_regression — recorded when the wall clock regresses versus the monotonic reference (NTP sync going backward, VM pause)
  • recovery_gap — written by the recover subcommand as a separate file
Terminal window
oculix-mcp rotate-key

The command:

  1. Writes a rotation_end marker to the current journal, signed by the outgoing key. The marker includes the SHA-256 of the new public key so operators can anchor trust.
  2. Archives the outgoing key pair under ~/.oculix-mcp/archive/.
  3. Generates and installs a new Ed25519 key pair.
  4. Writes a rotation_begin marker to a fresh journal, chained to the closing marker of the previous file.

If the private key is genuinely lost (disk failure, operator error) and the chain cannot be continued:

Terminal window
oculix-mcp recover

Writes an UNSIGNED_GAP marker and starts a fresh chain under a new key. The discontinuity is explicit and visible to verify.

The server refuses to start in any of the following situations, with a clear message pointing at the recovery command:

SituationBehaviour
No key pair, no journal historyInitialise normally
No private key but journal history existsREFUSE — run recover if intentional
Inconsistent key state (only one of priv/pub present)REFUSE — run recover
Key pair present but unreadable / corruptedREFUSE — restore from backup or recover
Current key does not verify the last journal entryREFUSE — use rotate-key properly, or recover
Key pair present and consistent with journalStart normally

Silent degradation is a security anti-pattern.

Air-gapped LLM autonomy — the legitimate path

Section titled “Air-gapped LLM autonomy — the legitimate path”

The combination most users will eventually ask about is “an LLM that drives our screens for QA”. That use case is real, but the only configuration where it makes sense in a regulated environment is fully on-prem: an on-host LLM, the MCP server next to it, no outbound traffic toward any cloud LLM API.

┌──────────────────────────┐ ┌──────────────────────────┐
│ Host A (LLM) │ RPC │ Host B (target screen) │
│ - Mistral / Llama / etc. │◀─────▶│ - oculix-mcp-server │
│ on local GPU │ priv │ - the application under │
│ - no outbound Internet │ net │ test │
└──────────────────────────┘ │ - no outbound Internet │
└──────────────────────────┘
outbound firewall: outbound firewall:
DENY api.anthropic.com DENY *
DENY api.openai.com (except localhost
DENY api.mistral.ai paddleocr-server)
DENY *.googleapis.com

What this configuration guarantees end-to-end:

  • Screen pixels never leave the host network
  • OCR results never leave the host network
  • Tool arguments (typed text, key combos) never leave the host network
  • The audit chain stays on Host B, signed by a key whose private half also stays on Host B
  • No third-party logging of inferences (no Anthropic / OpenAI / Google audit logs to subpoena)

What it does not protect against — these risks exist regardless of the LLM topology:

  • Prompt injection via screen content. If the target screen displays attacker-controlled text (a malicious email rendered in the app under test, a third-party SaaS panel embedding ads), an OCR call surfaces that text inside the LLM’s context. Mitigate with ActionGate (see What is not in V1), an allowlist of clickable regions, and a hard cap on actions per session.
  • OCR ambiguity. Tesseract can read Save as Send on noisy fonts; OCR-driven click targets are not deterministic. Pair with image references for sensitive actions.
  • Lateral motion within the host. The LLM can drive any pixel on Host B’s screen, not just the application under test. Run on a dedicated VM and segregate OCULIX_MCP_VAULT from the rest of the filesystem.
  • CVEs in OCR engines. Tesseract has had image-parser CVEs over the years. Pin the version, keep it patched.

Hardware cost — be honest about it. Running a 70B-class open-weight model with usable latency and decent reasoning needs ~80 GB of GPU memory (a single H100, or two A100 80GB, or a Mac Studio M3 Ultra). Smaller models (7B–13B) run on a single consumer GPU but lose enough quality on long tool-use chains that they often aren’t worth it for production QA. This is the real cost of the air-gapped story — it shifts spend from “API tokens at scale” to “one capable GPU in your DC”. For a Tier 1 bank doing thousands of nightly UI tests, that math frequently wins. For a small QA team with a handful of sessions per day, the cloud API + confidential mode + outbound-firewall combo may be cheaper. Both are valid; pick consciously.

Cloud LLM clients (Claude Desktop, Cursor, MCP Inspector) — for development only

Section titled “Cloud LLM clients (Claude Desktop, Cursor, MCP Inspector) — for development only”

The server speaks MCP, so any MCP client can connect to it during development — Claude Desktop, Cursor, the MCP Inspector. This is useful to iterate on prompts and to debug tool schemas. It is not a deployment configuration for any regulated environment, because every tool argument and (in open mode) every screenshot or OCR result is sent to the LLM provider’s API.

{
"mcpServers": {
"oculix": {
"command": "java",
"args": ["-jar", "/absolute/path/to/oculix-mcp-server.jar", "run"]
}
}
}

The server itself is LLM-agnostic by construction:

  • No Anthropic SDK, no OpenAI SDK
  • No model-specific prompt engineering
  • The _meta.llm fields in the audit trail are populated only if the client passes them — left null otherwise

oculix-mcp-server.jar is a fat jar (java -jar runnable):

  • Apertix / OpenCV 4.10 natives: bundled for Windows x86_64, Linux x86_64, macOS x64/aarch64. Template matching works out of the box.
  • Tesseract natives for Linux and macOS: NOT bundled (see oculix-org/oculix#110). Install Tesseract via your system package manager. On Windows, tess4j ships the natives so no extra install is needed.
  • PaddleOCR: runs as a separate Python microservice (paddleocr-server, Flask). The MCP server talks to it via HTTP on 127.0.0.1:5000 by default. If PaddleOCR is not reachable, the server falls back to Tesseract transparently.
  • Human-in-the-loop approval: the ActionGate interface exists and is called for every tool call, but the default AutoApproveGate always approves. A queue-and-notify implementation can plug into the same interface for per-action operator approval.
  • Deployer attestation templates: the audit schema reserves an activation_acknowledged entry type for posture-changing activations; concrete templates (LLM clients, mode switch, wide allowlist) and the per-feature enforcement are scoped for a near-term release.
  • Multi-monitor region selection: the screen field exists in the region schema but defaults to 0. Full multi-screen UX is scoped for later.
  • Server-initiated SSE notifications: the GET /mcp stream is kept open but no notifications are pushed yet. Inspector treats the stream as “ready” and continues; adding progress events is a later enhancement.

License: MIT, same as OculiX.