Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

alt

Daniel Nashed

tesseract -- Teaching Tika to read image formats

Daniel Nashed – 10 May 2025 09:37:14

While looking into what I can do with Domino IQ I add some experiments with LLMs in general.

Domino IQ out of the box today not support images.
But sending images to an LLM might even not be the best option you have depending on your use case.


If you are looking for real image processing a visual model might be a good choice.

But in many cases you are looking for "text processing" in images. For example when processing invoice scans or similar documents.


There is something available for a very long time called "OCR", which we might have forgotten during the AI hype.

It turns out that OCR in combination could be a way more efficient and effective way to get text out of images.


So the idea is to pre-process images before running an LLM query.


Domino uses Apache Tika in the background to extract data from many formats.

But out of the box Tika cannot process images.



Tesseract an interesting project with a long history


It turns out that there is a free package which works stand-alone on command line, as C/C++ based lib to include in your applications and also integrates into Tika.

In fact it is well integrated into Tika even being an own project.


One of the reasons it is separate is that it does not fit the Apache Java based project.

But Tika automatically detects it when installed and it is included for example into the Ubuntu distribution.


You find details about the project here -->
https://tesseract-ocr.github.io
But let me show you how simple it is to use it.


Once you have it installed on Linux, you can just run it from command-line.

The command-line is pretty simple. You specify the file and an output text file name.


Example:


tesseract invoice.png invoice
cat invoice.txt


Tika directly integrates with it and finds it once installed.


Notes/Domino leverages Apache Tika in the background running it on local host.
You should not try to use the Domino Tika instance, because it is controlled by the full-text index back-end of Domino and is started and stopped as needed.

But you can start your own Tika instance.


You can either download the latest version or use the one included in Domino.

In this example I am downloading it manually before running it.



Running Tika stand-alone


curl -L
https://dlcdn.apache.org/tika/3.1.0/tika-server-standard-3.1.0.jar -o tika-server.jar
java -jar tika-server.jar > tika.log 2>&1 &


Tika provides a simple REST based API. Notes/Domino uses the exact same interface.


With this interface you can get text from a file you send in a binary post.
But there are also other endpoints which classify attachments in detail by the way.


For a full reference of Tika REST check this link ->
https://cwiki.apache.org/confluence/display/TIKA/TikaServer

But in our case we just want to send a plain request to get the text from an image.

With Tesseract installed, Tika does support image formats.


This interface can be used from command-line or from your own applications -- provided you find a way to send binary data.

The Lotus Script HTTP request class currently does not support sending binary data.

And it would be much more efficient if running the extract on server side.


But this is a general free option you can leverage in your applications not just for scanning images.

You can use Tika for your needs. But you need your own instance running on a different port (because the embedded instance is currently only usable by Domino FT indexing).



curl -T invoice.png  
http://localhost:9998/rmeta/text | jq
curl -T invoice.png
http://localhost:9998/rmeta/text | jq -r '.[0]."X-TIKA:content"'


Domino Tika supporting image indexing


But this does not only work on command-line. Once you installed Tesseract, Notes/Domino can also index images in attachments.


You could install Tesseract OCR on Linux and have Domino Tika use image processing.


To get this working you also have to include those attachment types to FT indexing.
They are disabled by default because Tika cannot process them. So they are not send to Tika for indexing.

But there is a way to have your own list of attachment types to be specified.


In my testing Tesseract make the CPU quite busy for a couple of test attachments.

Top showed that Tika invoked it multiple times in parallel during attachment re-indexing (updall -x db.nsf)



FT_USE_ATTACHMENT_WHITE_LIST=1

FT_INDEX_FILTER_ATTACHMENT_TYPES=*.jpg,*.png,*.pdf,*.pptx,*.ppt



Building a container image with Tesseract support


The Domino container project supports adding Linux packages. Sadly the package is not available on Redhat UBI.

But you can use Ubuntu as the base image of your container and just have the packages added at build time.



./build.sh menu -from=ubuntu -linuxpkg "tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu"




Running your own Tika Server with Tesseract support


Here is a simple test using an Ubuntu docker container.

This could be turned into an own container image eventually.

You also need the download of the Tika server separately. But in a Domino container you would already have Tika installed.


docker run --rm -it ubuntu bash

apt update

apt install -y openjdk-21-jdk curl jq tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu



Alpine Linux would be the better choice for a container


Alpine also supports Tesseract. But does also not include Tika directly.
Here is a simple command line to install it. Alpine is much lighter from the packages installed as you will notice when you run those commands.



docker run --rm -it alpine sh

apk update
apk add openjdk21 curl jq tesseract-ocr tesseract-ocr-data-eng tesseract-ocr-data-deu




My Conclusion & your feedback


I would not add Tesseract to a Domino server for Tika and change the Tika indexing globally.

This was just to show how far we could go. And maybe HCL wants to look into the Tesseract option in some way.

It could be also built into Notes/Domino itself to allow text extraction from images.


I would look into Tika as a separate service you use for your own applications and leave FT indexing alone for now.

Tika itself with or without this extension is another tool in your arsenal for building cool applications.


The tika-server.jar comes with every Notes client and Domino server.
You could run it for your own applications today under your control.
The only challenge is really to send binary data post requests to Tika.


Local Tesseract support on a server could invoke the binary like Tika does.
Or you could use their C lib to add it to your own C-API based soltutions.


I thought about building a DSAPI filter to provide Tika functionality.

And I would be interested to hear if this would make sense from your point of view.

I already have LibCurl code to talk to Tika from a performance troubleshooting project.

It can run on databases to extract attachment data and write the results into a document.


This blog post is to raise the awareness for Tika and Tesseract.
And might be food for thought for your own integrations and requirements.


Did anyone work with Tesseract and/or Tika outside Domino before?

What is your feedback and what are your use cases?


I could build a simple Docker container for reuse putting both components into one new Tika service.

But again it is more a challenge how to access it from Notes/Domino



Links

    Archives


    • [HCL Domino]
    • [Domino on Linux]
    • [Nash!Com]
    • [Daniel Nashed]