Domino 10 Attachment Indexing with Apache Tika
Daniel Nashed – 14 June 2019 04:38:17
Tika is the new engine used to extract text from attachments for fulltext indexing in Notes/Domino 10 (client and server).
It replaces the previous key view package which had a different integration (it was leveraging the kvoop binary).
The Tika server is a stand alone Java based open source server component (http://tika.apache.org/) which is accessed by the server processes via curl requests (you can see it when looking into a NSD ;-)
Tika is just a single jar file which is started in a separate JVM instance. It is loosely coupled with Domino.
So it is living in it's own "stand box" with communication over TCP/IP. It's not based on any Domino code and doesn't share memory nor other resources like MQs etc.
Because it is quite new in Domino there are some open issues. Some of them have been already addressed in FP2 but there are some fixes that had to wait for FP3.
The good news is that those limitations can be worked around.
I hope this helps you in your deployments. The parameters are just needed on server side but they would also work on the Notes client if really needed.
It's the same back-end code which is executed on the client and the indexing works in the same way on the client.
-- Daniel
Tika Shutdown Issues on Linux
One of the issues we ran into is a server shutdown hang. On Linux currently the Tika process does not terminate, which will cause my start script to wait for this process to terminate.
I have added new logic to my start script , which is automatically killing the process at shutdown (default without any parameter after 30 seconds, default in the config file 20 seconds).
This ensures that your server shutdown will continue to work in time and this logic stays active also for newer versions just in case.
So we give the Tika process time to terminate before stopping it on OS level.
In addition the start script has an option to stop/kill the process during run-time. In normal cases killing a Domino process isn't a good idea.
But because Tika is loosely coupled, you can kill it without having any Domino server impact. Also the child died signal is not causing the server to panic in this case.
See version 3.2.2 update in a separate post for details of the new start script version --> http://blog.nashcom.de/nashcomblog.nsf/dx/new-start-script-version-3.2.2-with-a-tika-stop-server-work-around.htm
Here are the details from HCL support for this issue. This fix is planned for FP3 and Domino 11.
SPR - JPAIB6ZLKG reports Java Tika Process not terminating when Domino Shutdowns cleaning (NON-WINDOWS Platforms Only)
Tuning Tika
Beside that there is a performance issue one customer with very large attachments (mostly PDF) did run into.
By default the memory for the Tika process might be too small. And the server is not honoring the JVM size parameter.
But there is a general JVM Options Override parameter, which can be used to pass options to the Tika server.
There is more control planned without this generic parameter for the next fixpack. But for now this option can be used to increase the memory.
In addition there is a newer Tika server version 1.20 which might give you better results from performance side as well.
Domino is using version 1.18 but because it is a single jar and the server is loosely coupled, you could replace the jar file with the current version from the Tika project website.
It's not fully supported by IBM/HCL today but works in our test. They are looking into updating the Tika server version for Domino 11 and maybe also for Domino 10 FP3. But that isn't committed.
In addition to this change, you might want to increase the Tika server timeout to give the server more time to respond for larger attachments and higher load on the server.
Here are the parameters that could help you. Note this examples use the newer Tika version. The existing version is the same name without the additional version number.
Windows:
TIKA_JVM_OPTIONS_OVERRIDE= -Xmx1536m -jar "C:\Program Files\IBM\Domino\tika-server-1.20.jar"
Linux:
TIKA_JVM_OPTIONS_OVERRIDE= -Xmx1536m -jar "/opt/ibm/domino/notes/latest/linux/tika-server-1.20.jar"
DEBUG_TIKA_TIMEOUT=60000
- Comments [2]