Working with text and using OCR features

Added in 2.0.2

OCR Summary

SikuliX uses the Java library Tess4j, that allows to use the Tesseract features at the Java level. Internally it depends on Tesseract,

If you want to know anything about features not mentioned here or supported by SikuliX that are available in Tess4J/Tesseract, you have to dive into the details on the respective home pages of the packages.

Tess4J

Tesseract

There are three feature groups (for method specs follow the links):

handling general options or settings including those of Tesseract (see below)

reading the text from a region on screen or from an image (OCR)

finding the position (Match) of a given text (string/RegEx) in a region on screen or in an image

Special Information for SikuliX version 2 - Lessons learned and BestPractices (incl. features under developement and/or evaluation)

Accuracy of text recognition (confidence)

According to the recommendations of Tesseract and experiences found in the net, SikuliX does some optimization of the images before handing over to OCR:

convert to grayscale

do some edge sharpening

revert images with light text on dark background to black on white

resize the image to an optimum size

On the latter (optimum size) some comments:

with Tessreact 3 the recommendation was to rezize to an image of 200 - 300 dpi
- with Tesseract 4 the best practice seems to be: resize to a size, where the average height of a capital letter (preferably the X) has a height of about 30 pixel

Both variants have drawbacks, but the pixel approach seems to be the most promising (own tests and experiences from the net). The dpi approach works too with slightly lower accuarcy, but is too dependent on the current screen settings, which in turn are differently handled by Java in different system environments.

To not make usage too complicated in the average, SikuliX comes with the following default:

The height of a capital X in the default font used in Java in the current screen environment is taken as the base for the resize to 30 pixel.

There are functions/methods, that allow to tweak this optimization process - see Options.optimization.

If you have problems with accuarcy, then before fiddling around with the height/size options have a look at the Lessons learned and BestPractices.

There are functions/methods (classes Region, Image, OCR), that tell The OCR engine, to treat the image as a single line, a single word or even a single character. In some cases their usage might help to get what you expect.

Generally it makes sense, to try with a sample before investing in complex code:

TODO: Example script to be added

If your interested in the reported accuracy (confidence), you have to use one of the SikuliX features, that return text matches:

match.getScore()

which returns a decimal value between 0 and 1 (meant as percentage). Very good values are above 0.95, good values above 0.90.

To get the text in such cases, simply use:

match.getText()

Handling OCR options

There is one global options set (OCR.Options), that is used if nothing else is said.

Using myOptions = OCR.Options() you can create a new options set, derived from the initial global options. This can be modified using the setters shown below (myOptions.XXX(value)) and later be used with features allowing to specify an option set to use.

As well you can apply the setters to the global options (OCR.globalOptions().XXX(value)), to run OCR with specific defaults. At any time, you can reset the global options to its initial state using reset.

status reports the currently used global options (example for Windows 10 with standard screen settings):

Global settings OCR.options:
data = ...some-path.../tessdata
language(eng) oem(3) psm(3) height(15,1) factor(1,99) dpi(96) LINEAR
configs: conf1, conf2, ...
variables: key:value, ...

The information is usually not relevant, only in cases where you want to report a problem or you are using non-standard SikuliX-OCR-features. More Details you may find below.

For a specific options set (created before using someOptions = OCR.Options()) you can use (Java) someOptions.toString() to get this information as text (use print someOptions in scripts).

The options setters can be chained:

myOptions = OCR.Options().setter(value).setter(value)...

or used alone:

myOptions = OCR.Options()
myOptions.setter(value)
myOptions.setter(value)

OCR engine mode (OEM)

The latest version of Tesseract (namely version 4) internally uses a new detection engine (LSTM), that has again raised accuracy and speed. If the corresponding language models are supplied at runtime (which is the case with SikuliX now), then this engine is used as a default (OEM = 3).

see Options.oem()

Normally there should be no need to run another engine mode.

OCR page segmentation mode(PSM)

You can set the page segmentation mode (PSM), which tells Tesseract, how to split the given image into rectangles, that are supposed to contain readable text.

see Options.psm()

Only in special cases there should be a need to use something else than the default (3).

Switch to another language

In the standard SikuliX runs the text features with the english language set, which is bundled with SikuliX. It is possible to add more languages to your SikuliX setup and switch between the installed languages at runtime.

These are the steps to switch to another language than the standard english (eng):

Step 1: Find the folder SikulixTesseract/tessdata in your SikuliX <app-data> folder (see docs)

Step 2: Download the languages needed from Tesseract languages (only the files with .traineddata)

For SikuliX version 2.0.x+ we use the files for Tesseract 4 (preferably those from tessdata_fast)

For earlier Versions up to 1.1.3 use the files for Tesseract 3 (no longer supported).

Step 3: Put the .traineddata files into the tessdata folder (Step 1)

In your script, that should use the language, say before using an OCR feature (Options.language()):

OCR.globalOptions().language(xxx)

Another way to set a default language to be used after startup globally:

Settings.OcrLanguage = "xxx"

This is then recognized with each subsequent script start in the same IDE session.

Have your own Tesseract datapath

Instead of the above mentioned standard you can have your own folder with all stuff, that is needed by Tesseract at runtime. If you want to do that, simply have:

Settings.OcrDataPath = <some absolute Path>

Before starting the Textrecognizer. Take care, that all relevant files are in a subfolder tessdata.

This is then recognized with each subsequent script start in the same IDE session.

Use Options.dataPath() to switch the path dynamically:

OCR.globalOptions().dataPath(someAbsolutePath)

Other possibilities to tweak the Tesseract OCR process

About Tesseract variables, configurations, training and other gory details you have to consult the Tesseract documentation.

But before you step into Tesseract you should read about LessonsLearned and BestPractices.

To set a variable as a single Tesseract setting, that controls a specific topic in the OCR process use Options_variable()

To set a configuration which is a file containing a set of variables, that configure the behaviour of a tailored OCR process use Options_configs().

The Text and OCR features in Detail (Class OCR)

All methods of class OCR are static. There is no need to create or start an OCR instance — the engine is initialized on first use and reset between IDE script runs.

For each reading method there are two flavours: one that uses the global options (OCR.globalOptions()) and one that takes an explicit OCR.Options instance. Use the second when you need a one-off configuration without disturbing the global state.

The accepted source types (referred to in the signatures as SFIRBS) are: a file name (String), a File, an Image, a Region, a BufferedImage or a ScreenImage.

Reading text from a source

`OCR.readText(from)` / `OCR.readText(from, options)`

Read all text from the given source.

from — a SFIRBS source (see above)
options — an OCR.Options instance to use for this call only

Returns: the recognized text as a single String (lines separated by \n).

text = OCR.readText(some_region)
print text

`OCR.readLine(from)` / `OCR.readLine(from, options)`

Read the source assuming it contains a single line of text. Internally this sets PSM to 7 for the call, so the engine treats the whole image as one line.

from — a SFIRBS source
options — optional OCR.Options

Returns: the recognized text as a String.

`OCR.readWord(from)` / `OCR.readWord(from, options)`

Read the source assuming it contains a single word. Sets PSM to 8 for the call.

Returns: the recognized word as a String.

`OCR.readChar(from)` / `OCR.readChar(from, options)`

Read the source assuming it contains a single character. Sets PSM to 10 for the call.

Returns: the recognized character as a String.

`OCR.readLines(from)` / `OCR.readLines(from, options)`

Same as readText but returns the text split into lines as Match objects, so you keep the bounding box and confidence score of each line.

Returns: a List<Match>.

for m in OCR.readLines(some_region):
    print m.getText(), "@", m, "score=", m.getScore()

`OCR.readWords(from)` / `OCR.readWords(from, options)`

Same as readText but returns the text split into words as Match objects.

Returns: a List<Match>.

The OCR.Options class — configuration

An OCR.Options object groups all the knobs that influence the Tesseract engine: OEM, PSM, language, datapath, image-resize behaviour, Tesseract variables and configs files. There is one global singleton used by default, but you can create your own and pass it explicitly to any reading method.

Setters are chainable and always return the same Options instance.

opts = OCR.Options().language("fra").psm(7).textHeight(28)
text = OCR.readText(some_region, opts)

`OCR.globalOptions()`

Access the current global Options (singleton). Use this if you want to change the defaults for all subsequent reads.

Returns: the global Options instance.

`OCR.Options()` (constructor)

Create a new Options set, initialized from the current global options. Modifications to the returned instance do not affect the global state.

`Options.clone()`

Make a copy of this Options.

Returns: a new Options instance with the same settings.

`Options.reset()`

Reset this Options set to the initial SikuliX defaults:

oem        = OcrEngineMode.DEFAULT
psm        = PageSegMode.AUTO
language   = Settings.OcrLanguage
dataPath   = null         # resolved on first use
textHeight = getDefaultTextHeight()
variables  = {}
configs    = []

When dataPath is null, it will be resolved at the next OCR call to either the SikuliX default path or Settings.OcrDataPath (if set).

Returns: this Options.

`OCR.reset()`

Reset the global options to the initial defaults. Equivalent to OCR.globalOptions().reset().

Returns: the global Options.

`OCR.status()`

Print the current global options to the message area. Useful for problem reports and when you want to know what defaults you are starting from.

`Options.toString()`

Current state of this Options as some formatted lines of text:

OCR.Options:
data = .../tessdata
language(eng) oem(3) psm(3) height(15,1) factor(1,99) dpi(96)
configs: conf1, conf2, ...
variables: key:value, ...

In a script use print someOptions. Java: someOptions.toString()

Returns: the formatted string.

`Options.oem()` / `Options.oem(value)`

Get or set the OCR engine mode. The setter accepts an int or an OEM enum constant.

value — int or OEM enum constant

Returns: the current oem as int (getter) or this Options (setter).

`Options.psm()` / `Options.psm(value)`

Get or set the page segmentation mode. The setter accepts an int or a PSM enum constant.

value — int or PSM enum constant

Returns: the current psm as int (getter) or this Options (setter).

`Options.resetPSM()`

Sets the PSM to -1, which tells Tess4J not to set the PSM at all and leave whatever Tesseract decides internally. Only use it if you know what you are doing.

Returns: this Options.

`Options.asLine()` / `Options.asWord()` / `Options.asChar()`

Convenience setters that configure the Options to recognize, respectively, a single line (PSM 7), a single word (PSM 8) or a single character (PSM 10).

Returns: this Options.

`Options.language()` / `Options.language(lang)`

Get or set the current Tesseract language.

According to Tesseract, the language is a 3-lowercase-letters string like eng, deu, fra, rus, … In special cases it might look like xxx_yyy (chi_sim), xxx_yyyy (deu_frak) or even xxx_yyy_zzzz (chi_tra_vert), but always all lowercase.

Make sure the corresponding <lang>.traineddata file is present in your tessdata folder before the next OCR call. The setter must not be passed null or an empty string.

lang — language short string

Returns: the language string (getter) or this Options (setter).

`Options.dataPath()` / `Options.dataPath(path)`

Get or set the folder where Tesseract looks for language files and configs. The path you give may omit the trailing /tessdata.

The getter may return null if no OCR call has been made yet — in that case the path is resolved on first use to the SikuliX default or to Settings.OcrDataPath (if set).

Take care that everything is in place by the time you actually trigger OCR.

path — absolute path as a String

Returns: the current path (getter) or this Options (setter).

`Options.textHeight()` / `Options.textHeight(pixels)`

Get or set the target text height in pixels used for image optimization before OCR. SikuliX resizes the image so the average height of a capital letter approaches this value (default ≈ 30).

Only tweak this if the defaults do not give acceptable results.

pixels — number of pixels

Returns: the current height (getter) or this Options (setter).

`Options.fontSize(size)`

Convenience setter that takes a font size in points and converts it internally to the corresponding textHeight in pixels (measured on a capital X with the default Java font on the current screen).

Only tweak this if the defaults do not give acceptable results.

size — font size as int

Returns: this Options.

`Options.lightFont()` / `Options.isLightFont()`

Convenience: configure image optimization for a light font on dark background (no automatic inversion). isLightFont() returns the current state.

Returns: this Options (setter) / boolean (getter).

`Options.smallFont()`

Convenience: configure image optimization for small fonts (pixel height below 12, font size below 10). Might give better results when the text is tiny.

Returns: this Options.

`Options.variable(key, value)`

Set a single Tesseract variable. You should know what you are doing — consult the Tesseract docs.

key — Tesseract variable name
value — value as String

Returns: this Options.

`Options.variables()`

Returns: the Map<String,String> of variables currently set on this Options.

`Options.configs()` / `Options.configs(...names)` / `Options.configs(list)`

Get or set the list of Tesseract config-file names to apply. The setter accepts a varargs of strings or a List<String>. Each name refers to a file in the tessdata/configs folder.

Returns: the current list (getter) or this Options (setter).

Working with text and using OCR features

OCR Summary

Accuracy of text recognition (confidence)

Handling OCR options

OCR engine mode (OEM)

OCR page segmentation mode(PSM)

Switch to another language

Have your own Tesseract datapath

Other possibilities to tweak the Tesseract OCR process

The Text and OCR features in Detail (Class OCR)

Reading text from a source

OCR.readText(from) / OCR.readText(from, options)

OCR.readLine(from) / OCR.readLine(from, options)

OCR.readWord(from) / OCR.readWord(from, options)

OCR.readChar(from) / OCR.readChar(from, options)

OCR.readLines(from) / OCR.readLines(from, options)

OCR.readWords(from) / OCR.readWords(from, options)

The OCR.Options class — configuration

OCR.globalOptions()

OCR.Options() (constructor)

Options.clone()

Options.reset()

OCR.reset()

OCR.status()

Options.toString()

Options.oem() / Options.oem(value)

Options.psm() / Options.psm(value)

Options.resetPSM()

Options.asLine() / Options.asWord() / Options.asChar()

Options.language() / Options.language(lang)

Options.dataPath() / Options.dataPath(path)

Options.textHeight() / Options.textHeight(pixels)

Options.fontSize(size)

Options.lightFont() / Options.isLightFont()

Options.smallFont()

Options.variable(key, value)

Options.variables()

Options.configs() / Options.configs(...names) / Options.configs(list)

`OCR.readText(from)` / `OCR.readText(from, options)`

`OCR.readLine(from)` / `OCR.readLine(from, options)`

`OCR.readWord(from)` / `OCR.readWord(from, options)`

`OCR.readChar(from)` / `OCR.readChar(from, options)`

`OCR.readLines(from)` / `OCR.readLines(from, options)`

`OCR.readWords(from)` / `OCR.readWords(from, options)`

`OCR.globalOptions()`

`OCR.Options()` (constructor)

`Options.clone()`

`Options.reset()`

`OCR.reset()`

`OCR.status()`

`Options.toString()`

`Options.oem()` / `Options.oem(value)`

`Options.psm()` / `Options.psm(value)`

`Options.resetPSM()`

`Options.asLine()` / `Options.asWord()` / `Options.asChar()`

`Options.language()` / `Options.language(lang)`

`Options.dataPath()` / `Options.dataPath(path)`

`Options.textHeight()` / `Options.textHeight(pixels)`

`Options.fontSize(size)`

`Options.lightFont()` / `Options.isLightFont()`

`Options.smallFont()`

`Options.variable(key, value)`

`Options.variables()`

`Options.configs()` / `Options.configs(...names)` / `Options.configs(list)`