OculiX MCP Server — lightweight remote control for visual automation
The OculiX MCP Server is a small Java daemon that exposes Oculix’s visual automation — template matching, OCR, mouse and keyboard control — as a remote-callable API. It is designed to be driven by external test runners, QA pipelines, internal tooling, and any other client that needs to control a screen from another process.
Under the hood, the wire protocol is Model Context Protocol — JSON-RPC over stdio or Streamable HTTP. We picked it because it is noticeably lighter than the XML-RPC layer used by classic Robot Framework remote libraries (jrobotremoteserver and the like). Less framing overhead, simpler client code, faster round-trips.
The fact that MCP clients also exist on the LLM side (Claude Desktop, Cursor, the MCP Inspector) is a bonus, not the point. If you want to drive Oculix from an LLM you can; if you want to drive it from a Python test script or a Java service, you can too. The server does not know nor care which one is on the other end.
Boundary — what Oculix provides vs what you provide
Section titled “Boundary — what Oculix provides vs what you provide”Oculix ships security primitives. It does not define your security policy. Read this before any deployment in a regulated environment, because the distinction matters legally and operationally.
| Oculix provides (mechanism) | You provide (policy) |
|---|---|
ActionGate interface called on every tool call | Concrete ActionGate implementation (auto-approve, queue, allowlist, …) matching your threat model |
| Cryptographically signed audit trail (Ed25519 + hash chain) | Where the journal is archived, who reviews it, how often, what counts as a violation, retention period |
| Key rotation, session model, fail-fast on unsigned state | Rotation cadence, who is allowed to rotate, how revocation propagates to your other systems |
| Open vs confidential modes | The decision of which mode is required for which use case, enforced at process-launch time by your own configuration management |
| Refusal to fall back to unsigned mode (no silent degradation) | Monitoring that detects when the server is not running and alerts your on-call |
| Loopback-only by default, TLS termination flag for upstream proxies | Reverse proxy config, certificate rotation, WAF rules, network segmentation |
| No outbound traffic from the server itself | Outbound firewall rules on the host, allow/deny lists of LLM API endpoints, network isolation between the LLM host and the target screen host |
Oculix can demonstrate that an action was taken, by whom, with what arguments. Oculix cannot decide for your security team whether that action should have been allowed in the first place. That decision belongs to your ActionGate implementation, your network policy, your compliance officer, and your regulator-facing review process — all of which are your responsibility.
If you ship Oculix into a regulated environment without a non-trivial ActionGate implementation and a defined audit review process, you have not configured a secure system — you have configured an auditable one. The two are not the same, and conflating them is the source of most automation incidents.
LLM clients are opt-in, never default
Section titled “LLM clients are opt-in, never default”The server treats any LLM-backed client (the Claude API, OpenAI, Mistral, Gemini, on-prem Ollama / vLLM — any provider, hosted or local) as a category that the deployer must explicitly enable through a validation mechanism before such a client can drive the server. The default posture is “a QA framework or a deterministic test runner is on the other end”.
This is independent of the LLM-agnosticism of the codebase (no provider SDK is bundled, no model-specific logic ships): that is a build-time property. The opt-in rule is a runtime activation rule that the deployer owns. A vanilla deployment refuses to negotiate with a client whose handshake identifies it as an LLM front-end; flipping that bit is a conscious operator decision tied to a documented policy, not an accidental side-effect of starting the daemon.
The point is to make the LLM-driven case impossible by mistake. If you want it, you turn it on, with the configuration trail to prove it. If you do not, the server stays a plain QA backend regardless of what clients are running on the network.
Deployer attestation — turning activation into an audit entry
Section titled “Deployer attestation — turning activation into an audit entry”For activations that materially change the server’s security posture (enabling LLM clients, switching from confidential to open mode, deploying with AutoApproveGate in production, widening an allowlist), the server ships a deployer attestation template: a small structured form the operator fills before the feature can be activated.
The template is not a legal NDA — Oculix has no contractual relationship with the deployer’s organisation. It is an internal attestation that the deployer issues to itself and to its own compliance function, and that becomes part of the audit chain.
Concretely:
- The server provides a template (YAML/JSON) with required fields: operator identity, organisation, business justification, internal policy reference, target environment, expected duration, fallback path on revocation
- The operator fills it locally before activation
- The server validates the template at activation time and, if valid, writes the template’s canonical hash and full contents as an
activation_acknowledgedentry, signed by the current audit key - Future verification (
oculix-mcp verify) confirms the attestation entry alongside the operational tool calls; a missing or unsigned attestation for a sensitive mode is a verification failure - Rotating the audit key preserves the attestation: the entry chains across rotations like any other journal entry
For a compliance officer, this turns “did anyone enable LLM clients in March?” into a deterministic offline check rather than a conversation. For an external regulator, it produces an evidence artefact tying a specific human, with a specific policy reference, to the activation timestamp.
Implementation status: the activation slot exists in the audit schema; concrete templates and the per-feature attestation requirement are scoped for a near-term release. See What is not in V1.
Why a remote-control daemon
Section titled “Why a remote-control daemon”A SikuliX script written in the IDE drives the screen of the same machine it runs on. That is fine for desktop scripts, but a lot of real-world setups want one of these:
- A CI worker triggers screen actions on a dedicated automation VM sitting on a different host.
- A test harness on a developer laptop drives a remote production-like terminal over VNC.
- A legacy test framework (Robot Framework, Squish-style scripts, internal shell harnesses) needs to call screen actions from outside the SikuliX JVM.
- An on-premise orchestration platform schedules screen automations on pools of machines.
All of these need the same thing: a stable, language-agnostic API to ask “click this image”, “wait for this text”, “screenshot this region”, and get a result back.
That is what oculix-mcp-server is.
What you can call
Section titled “What you can call”The server registers nine tools by default. Each is invoked over the standard MCP tools/call method; arguments and return shape are described by the JSON schema returned by tools/list.
| Tool | What it does |
|---|---|
oculix_find_image | Locate a reference image on screen, return coordinates (non-blocking) |
oculix_click_image | Find a reference image and left-click it |
oculix_exists_image | Non-blocking presence check |
oculix_wait_for_image | Block until a reference image appears, up to a timeout |
oculix_screenshot | Capture the full screen or a region, return as base64 PNG |
oculix_type_text | Type literal text at the keyboard focus |
oculix_key_combo | Press a keyboard combination (Ctrl+C, Cmd+Tab, F5, …) |
oculix_find_text | OCR: locate a text string on screen, return bounding box |
oculix_read_text_in_region | OCR: extract the text from a region — open mode only |
In confidential mode (see below), the last two are replaced by _to_disk variants that keep pixels and text on the local host and return only a path and a SHA-256 hash.
Each tool is a thin Java class in MCP/src/main/java/org/sikuli/mcp/tools/, wrapping the corresponding Oculix primitive (see Region, Pattern, Match, text and OCR).
Why MCP rather than XML-RPC
Section titled “Why MCP rather than XML-RPC”If you already know jrobotremoteserver (the Java-side daemon used by community Sikuli libraries for Robot Framework), the comparison is straightforward.
jrobotremoteserver (XML-RPC) | OculiX MCP Server (JSON-RPC) | |
|---|---|---|
| Framing | Verbose XML envelopes | Compact JSON objects |
| Transport | HTTP only | Stdio (no socket needed) or Streamable HTTP |
| Per-call overhead | Higher | Lower |
| Native client surface | Robot Framework Remote Library | Any MCP client or raw JSON-RPC |
| Auth | App-specific | Layered: optional PAT + mandatory ephemeral bearer |
| Session model | Stateless per call | Explicit session, revocable via DELETE |
| Audit | Off the shelf: nothing | Ed25519 hash chain, signed per call, verifiable offline |
The MCP server is a drop-in upgrade for the kind of remote-control workflow XML-RPC was carrying. The audit chain and session model are bonuses on top.
Oculix is API-compatible with SikuliX: existing scripts and existing bindings (including Robot Framework setups using XML-RPC bridges) keep working unchanged. The MCP server is a separate, additional path — not a replacement.
Who can use this — and why it beats rolling your own
Section titled “Who can use this — and why it beats rolling your own”The MCP server gives any framework or service that wants visual recognition a complete, audited, language-agnostic backend over a single JSON-RPC connection. Concretely:
| If you… | The MCP server gives you… |
|---|---|
| Drive Selenium / Playwright / Cypress and bolt on OpenCV + Tesseract for visual asserts | A maintained engine with PaddleOCR + Tesseract fallback, exposed by 9 schema’d tools |
Have a homemade Python harness using pyautogui + pytesseract | A stable Java backend over JSON-RPC, with audit trail and session model out of the box |
| Want to drive screens from a Java / Go / Rust / .NET service that has no SikuliX binding | Plain HTTP + JSON — no SDK, no JNI, no native build |
| Build a new test framework or QA harness from scratch | A visual-recognition backend you don’t have to write, so you can focus on your framework’s ergonomics |
| Run a regulated environment that needs proof of every UI action | Ed25519 signed journal verifiable offline, with verify subcommand |
The point: rather than every framework re-implementing template matching, OCR plumbing, and result-shaping for the hundredth time, plug them all into the same oculix-mcp-server and call its nine tools. The framework keeps its native ergonomics (pytest fixtures, Playwright expect-style, Robot keywords); only the visual primitives are delegated to a single audited backend.
Quick start
Section titled “Quick start”1. Build
Section titled “1. Build”mvn -pl MCP -am -DskipTests -Pmcp-fatjar clean package# produces MCP/target/oculix-mcp-server.jar2. Run
Section titled “2. Run”Stdio mode (one process, one client, no socket):
java -jar MCP/target/oculix-mcp-server.jar runHTTP mode (multiple clients, addressable from outside the process):
java -jar MCP/target/oculix-mcp-server.jar serveOn first start, the server generates an Ed25519 key pair in ~/.oculix-mcp/ (private key mode 600). Subsequent runs reuse that pair.
3. Verify the audit trail
Section titled “3. Verify the audit trail”java -jar oculix-mcp-server.jar verifyWalks every audit-*.jsonl file in ~/.oculix-mcp/journal/, re-computes hashes, verifies signatures, and checks chain continuity. Outputs OK / FAIL per file.
CLI subcommands
Section titled “CLI subcommands”oculix-mcp run Start over stdio (default)oculix-mcp serve [flags] Start over HTTP (Streamable HTTP) --host HOST (default 127.0.0.1) --port PORT (default 7337, 0 for auto) env OCULIX_MCP_TOKEN optional client token env OCULIX_MCP_MODE open | confidential env OCULIX_MCP_VAULT confidential landing dir env OCULIX_MCP_TRUST_TLS_TERMINATION=1 acknowledge upstream TLSoculix-mcp rotate-key Rotate the Ed25519 audit signing keyoculix-mcp rotate-session-key Rotate the HMAC keyring for session tokensoculix-mcp recover Record an unsigned-gap and start a fresh chainoculix-mcp verify [FILES...] Verify audit journals (all by default)oculix-mcp --help Show usageHTTP transport — calling the server from anywhere
Section titled “HTTP transport — calling the server from anywhere”Once serve is running, any HTTP client can drive it.
java -jar MCP/target/oculix-mcp-server.jar serve# → client token: DISABLED (any caller on loopback can initialize)
# 1. initialize — returns Mcp-Session-Id (header) + bearer (body)RESP=$(curl -s -D /tmp/h.txt -X POST http://127.0.0.1:7337/mcp \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}')SID=$(grep -i mcp-session-id /tmp/h.txt | awk '{print $2}' | tr -d '\r')BEARER=$(echo "$RESP" | jq -r '.result._meta.bearer')
# 2. tools/listcurl -s -X POST http://127.0.0.1:7337/mcp \ -H "Mcp-Session-Id: $SID" \ -H "Authorization: Bearer $BEARER" \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":2,"method":"tools/list"}' | jq
# 3. call a toolcurl -s -X POST http://127.0.0.1:7337/mcp \ -H "Mcp-Session-Id: $SID" \ -H "Authorization: Bearer $BEARER" \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":3,"method":"tools/call", "params":{"name":"oculix_click_image", "arguments":{"reference_path":"/tmp/button.png","similarity":0.85}}}' | jq
# 4. DELETE — revoke the sessioncurl -s -X DELETE http://127.0.0.1:7337/mcp \ -H "Mcp-Session-Id: $SID" \ -H "Authorization: Bearer $BEARER" -o /dev/null -w "%{http_code}\n"No SDK needed — it’s plain HTTP + JSON. Python requests, Go net/http, JS fetch, Java HttpClient, all work directly.
Auth model — two layers
Section titled “Auth model — two layers”Layer 1: client credential (optional, PAT-style). Gates initialize. Configured via OCULIX_MCP_TOKEN. Unset → anyone on the bound interface can initialize, which is acceptable on loopback. Set it when exposing the server beyond localhost.
Layer 2: session token (mandatory, ephemeral). Minted by the server on every successful initialize, returned in result._meta.bearer. Format:
ocx.<kid>.<b64url(payload)>.<b64url(hmac)>The client sends it back as Authorization: Bearer <value> on every subsequent request. Default TTL is 30 minutes; refresh by re-initializing.
A leaked token cannot be replayed against a different session id; a token whose session was DELETEd is rejected with 404 even if still crypto-valid; key rotation leaves outstanding tokens valid until they expire naturally.
Session-token key rotation
Section titled “Session-token key rotation”oculix-mcp rotate-session-key# Session-token keyring rotated.# previous kid: k9QwM7eL# new kid: kX3pT2aB# ring size: 2The HMAC keyring lives in ~/.oculix-mcp/session-hmac-keyring.json (mode 0600). Generating a new kid keeps old kids in the ring so already-issued tokens remain verifiable until they expire. Delete the file to force all outstanding sessions into hard failure.
Plain HTTP on anything other than a loopback address is refused at startup. If TLS is terminated upstream (nginx, Caddy, service mesh, WAF), set OCULIX_MCP_TRUST_TLS_TERMINATION=1 to acknowledge that responsibility before binding --host 0.0.0.0 or a non-local interface. In-process TLS is planned but not shipped — use a reverse proxy for now; certificate rotation and WAF policy belong there anyway.
Concurrency
Section titled “Concurrency”Screen actions are serialized through a fair lock: no matter how many clients share one process, two oculix_click_image cannot interleave. This is independent of the transport.
Modes: open vs confidential
Section titled “Modes: open vs confidential”By default, the server registers the 9 tools described above. Two of them return screen content over the wire:
oculix_screenshot— PNG bytes (base64) back to the caller.oculix_read_text_in_region— OCR text back to the caller.
For regulated workflows where captured content must not leave the host (banking screens, healthcare records, internal admin consoles), start the server in confidential mode:
OCULIX_MCP_MODE=confidential \OCULIX_MCP_TOKEN=... \ java -jar oculix-mcp-server.jar serveIn confidential mode the two content-bearing tools are not registered. The client’s tools/list will not even mention them — there is no filter to bypass, the capability is physically absent. They are replaced by:
oculix_screenshot_to_disk— captures a PNG into the local vault, returns{path, sha256, width, height, bytes}. Pixels stay local.oculix_ocr_to_disk— writes the OCR output to a text file in the vault, returns{path, sha256, engine, line_count, char_count}. Text stays local.
Vault location: $OCULIX_MCP_VAULT if set, else ~/.oculix-mcp/vault/, restricted to owner only (0700 on POSIX).
| Claim | Holds? |
|---|---|
| Screen pixels never leave the host via the MCP channel | yes |
| OCR results never leave the host via the MCP channel | yes |
| Local audit chain records SHA-256 of every captured artefact | yes |
| The remote caller cannot enumerate content-bearing tools | yes — they are absent from tools/list |
Caller-supplied arguments (e.g. type_text) never leak | no — the caller is the one writing them |
Audit trail
Section titled “Audit trail”Every tool call produces a journal entry containing:
- Timestamp (UTC, microsecond precision) plus a monotonic per-file sequence number
- Session id (UUID)
- Client name and version (from the MCP handshake)
- Tool name and full arguments
- SHA-256 hash of the result payload
prev_hashchaining to the previous entryentry_hashof the entry itself- Ed25519 signature of
entry_hash
Rotation markers are signed by the outgoing key before a new key takes over, so the chain remains provable across key rotations. Deleting the private key manually triggers a fail-fast refusal on the next start — the server never falls back to unsigned mode.
Example entry
Section titled “Example entry”{ "type": "tool_call", "ts_utc": "2026-04-14T19:23:45.123456Z", "seq": 42, "session_id": "c4a1f8e0-3b20-4d6a-9b3f-1f6e4d6c8a11", "client": { "name": "qa-runner-internal", "version": "2.4.1" }, "tool": "oculix_click_image", "args": { "reference_path": "/tmp/button.png", "similarity": 0.85 }, "result_sha256": "3b1ab...", "extra": null, "prev_hash": "0000...", "entry_hash": "f8c2...", "signature": "a7e9..."}Special entry types:
rotation_end/rotation_begin— pair of markers at key / file rotationclock_regression— recorded when the wall clock regresses versus the monotonic reference (NTP sync going backward, VM pause)recovery_gap— written by therecoversubcommand as a separate file
Key rotation
Section titled “Key rotation”oculix-mcp rotate-keyThe command:
- Writes a
rotation_endmarker to the current journal, signed by the outgoing key. The marker includes the SHA-256 of the new public key so operators can anchor trust. - Archives the outgoing key pair under
~/.oculix-mcp/archive/. - Generates and installs a new Ed25519 key pair.
- Writes a
rotation_beginmarker to a fresh journal, chained to the closing marker of the previous file.
Recovery (broken chain)
Section titled “Recovery (broken chain)”If the private key is genuinely lost (disk failure, operator error) and the chain cannot be continued:
oculix-mcp recoverWrites an UNSIGNED_GAP marker and starts a fresh chain under a new key. The discontinuity is explicit and visible to verify.
Startup refusal states
Section titled “Startup refusal states”The server refuses to start in any of the following situations, with a clear message pointing at the recovery command:
| Situation | Behaviour |
|---|---|
| No key pair, no journal history | Initialise normally |
| No private key but journal history exists | REFUSE — run recover if intentional |
| Inconsistent key state (only one of priv/pub present) | REFUSE — run recover |
| Key pair present but unreadable / corrupted | REFUSE — restore from backup or recover |
| Current key does not verify the last journal entry | REFUSE — use rotate-key properly, or recover |
| Key pair present and consistent with journal | Start normally |
Silent degradation is a security anti-pattern.
Air-gapped LLM autonomy — the legitimate path
Section titled “Air-gapped LLM autonomy — the legitimate path”The combination most users will eventually ask about is “an LLM that drives our screens for QA”. That use case is real, but the only configuration where it makes sense in a regulated environment is fully on-prem: an on-host LLM, the MCP server next to it, no outbound traffic toward any cloud LLM API.
┌──────────────────────────┐ ┌──────────────────────────┐│ Host A (LLM) │ RPC │ Host B (target screen) ││ - Mistral / Llama / etc. │◀─────▶│ - oculix-mcp-server ││ on local GPU │ priv │ - the application under ││ - no outbound Internet │ net │ test │└──────────────────────────┘ │ - no outbound Internet │ └──────────────────────────┘ outbound firewall: outbound firewall: DENY api.anthropic.com DENY * DENY api.openai.com (except localhost DENY api.mistral.ai paddleocr-server) DENY *.googleapis.comWhat this configuration guarantees end-to-end:
- Screen pixels never leave the host network
- OCR results never leave the host network
- Tool arguments (typed text, key combos) never leave the host network
- The audit chain stays on Host B, signed by a key whose private half also stays on Host B
- No third-party logging of inferences (no Anthropic / OpenAI / Google audit logs to subpoena)
What it does not protect against — these risks exist regardless of the LLM topology:
- Prompt injection via screen content. If the target screen displays attacker-controlled text (a malicious email rendered in the app under test, a third-party SaaS panel embedding ads), an OCR call surfaces that text inside the LLM’s context. Mitigate with
ActionGate(see What is not in V1), an allowlist of clickable regions, and a hard cap on actions per session. - OCR ambiguity. Tesseract can read
SaveasSendon noisy fonts; OCR-driven click targets are not deterministic. Pair with image references for sensitive actions. - Lateral motion within the host. The LLM can drive any pixel on Host B’s screen, not just the application under test. Run on a dedicated VM and segregate
OCULIX_MCP_VAULTfrom the rest of the filesystem. - CVEs in OCR engines. Tesseract has had image-parser CVEs over the years. Pin the version, keep it patched.
Hardware cost — be honest about it. Running a 70B-class open-weight model with usable latency and decent reasoning needs ~80 GB of GPU memory (a single H100, or two A100 80GB, or a Mac Studio M3 Ultra). Smaller models (7B–13B) run on a single consumer GPU but lose enough quality on long tool-use chains that they often aren’t worth it for production QA. This is the real cost of the air-gapped story — it shifts spend from “API tokens at scale” to “one capable GPU in your DC”. For a Tier 1 bank doing thousands of nightly UI tests, that math frequently wins. For a small QA team with a handful of sessions per day, the cloud API + confidential mode + outbound-firewall combo may be cheaper. Both are valid; pick consciously.
Cloud LLM clients (Claude Desktop, Cursor, MCP Inspector) — for development only
Section titled “Cloud LLM clients (Claude Desktop, Cursor, MCP Inspector) — for development only”The server speaks MCP, so any MCP client can connect to it during development — Claude Desktop, Cursor, the MCP Inspector. This is useful to iterate on prompts and to debug tool schemas. It is not a deployment configuration for any regulated environment, because every tool argument and (in open mode) every screenshot or OCR result is sent to the LLM provider’s API.
{ "mcpServers": { "oculix": { "command": "java", "args": ["-jar", "/absolute/path/to/oculix-mcp-server.jar", "run"] } }}The server itself is LLM-agnostic by construction:
- No Anthropic SDK, no OpenAI SDK
- No model-specific prompt engineering
- The
_meta.llmfields in the audit trail are populated only if the client passes them — left null otherwise
What is in the jar
Section titled “What is in the jar”oculix-mcp-server.jar is a fat jar (java -jar runnable):
- Apertix / OpenCV 4.10 natives: bundled for Windows x86_64, Linux x86_64, macOS x64/aarch64. Template matching works out of the box.
- Tesseract natives for Linux and macOS: NOT bundled (see oculix-org/oculix#110). Install Tesseract via your system package manager. On Windows,
tess4jships the natives so no extra install is needed. - PaddleOCR: runs as a separate Python microservice (
paddleocr-server, Flask). The MCP server talks to it via HTTP on127.0.0.1:5000by default. If PaddleOCR is not reachable, the server falls back to Tesseract transparently.
What is not in V1
Section titled “What is not in V1”- Human-in-the-loop approval: the
ActionGateinterface exists and is called for every tool call, but the defaultAutoApproveGatealways approves. A queue-and-notify implementation can plug into the same interface for per-action operator approval. - Deployer attestation templates: the audit schema reserves an
activation_acknowledgedentry type for posture-changing activations; concrete templates (LLM clients, mode switch, wide allowlist) and the per-feature enforcement are scoped for a near-term release. - Multi-monitor region selection: the
screenfield exists in the region schema but defaults to 0. Full multi-screen UX is scoped for later. - Server-initiated SSE notifications: the
GET /mcpstream is kept open but no notifications are pushed yet. Inspector treats the stream as “ready” and continues; adding progress events is a later enhancement.
License: MIT, same as OculiX.