Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

alt

Daniel Nashed

Tika in Notes/Domino

Daniel Nashed – 9 September 2021 19:01:12

At DNUG Domino this week there have been some interesting questions how Tika works and how it could be used to search attachments.
I explained that Tika is only used in the back-end by Notes/Domino to index the attachments and it not used for searching attachments.

So Tika feeds attachment text extracts into the Indexing process and is not part of the search operations.
Tika replaced the old legacy external code "Key view package" used for a long time. Tika is an open source Apache project.

When I first looked at it when it was introduced I started a NSD to see what happens under the coverts.
You can see from the call-stack below that Notes/Domino index processes communicate with Tika leveraging libcurl functionality.

Tika is listening on localhost only after being started by Notes/Domino and any Notes/Domino process can send requests to Tika.


Call stack from indexing attachments in a database

############################################################
### thread 1/12: [ nUpdate:  0450:  0f84]
### FP=0x001ab4e8, PC=0x77b315b0, SP=0x001ab4e8
### stkbase=0x001b0000, total stksize=81920, used stksize=19224
############################################################
 [ 1] 0x77b315b0 kernel32.TlsGetValue+48 (1ab640,FFFFFFFFFB3B4C0,100000000,10)
 [ 2] 0x7FEFCF06FF9 mswsock+28665 (4c0,7FEFDB9D0D8,76C6F2242BD7,0)
 [ 3] 0x7FEFDB9507C WS2_32.select+348 (0,4c0,0,7FEFDB9507C)
 [ 4] 0x7FEFDB94FFD WS2_32.select+221 (4c1,0,FFFFFFFFFFFFFFF,a0)
@[ 5] 0x7FEF2044936 nnotes.Curl_socket_check+582 (4c1,0,0,0)
@[ 6] 0x7FEF2049B9F nnotes.Curl_readwrite+159 (0,15c73b20,1abf89,0)
@[ 7] 0x7FEF2035E41 nnotes.multi_runsingle+3393 (15c78ce0,1ac010,cc61600,0)
@[ 8] 0x7FEF2034436 nnotes.curl_multi_perform+118 (0,0,15c78ce0,3e8)
@[ 9] 0x7FEF20308B4 nnotes.easy_perform+404 (0,7FE00004E2B,7FE00000000,1)
@[10] 0x7FEF20267CB nnotes.GetChar+4651 (0,d104ae0,cc616d8,13)
@[11] 0x7FEF2029E37 nnotes.FTGetDocStream+519 (0,7FEF03A9C30,243,1accb0)
@[12] 0x7FEF03A9D08 nftgtr40.NotesStreamReadChar+216 (45,290021000C,d104ae0,B000007E2)
@[13] 0x7FEF202DDFE nnotes.FTLexMatch+142 (d104b88,1ad080,7FEF03A9C30,d1031c0)
@[14] 0x7FEF03A752A nftgtr40.FTGCreateIndex+1466 (d100003,1ad2f0,0,465687300000000)
@[15] 0x7FEF03A2F2E nftgtr40.CFTNoteIndexer::ProcessDoc+350 (d1031c0,90ef,121de,0)
@[16] 0x7FEF03A5F81 nftgtr40.FTGIndexIDProc+817 (0,121de,fe,0)
@[17] 0x7FEF275F20B nnotes.IDEnumerate+235 (20000466,1ad5f0,0,125833F003F0036)
@[18] 0x7FEF03A519D nftgtr40.FTGIndex+6701 (1cdcc48,1,1148,1cdcc48)
@[19] 0x7FEF2021690 nnotes.FTIndexExt2+4416 (243,20001148,0,0)
@[20] 0x13F51BE68 nUpdate.UpdateFullTextIndex+488 (0,3fff,1aeaf0,0)
@[21] 0x13F51BB26 nUpdate.UpdateCollectionsExt+3318 (0,1aeaf0,7FEF2740001,4000000)
@[22] 0x13F51AE27 nUpdate.UpdateCollections+135 (1aeb38,7FE00000001,1aef40,20000012)
@[23] 0x13F51483B nUpdate.PerformRequest+715 (0,1e,cd60098,0)
@[24] 0x13F517978 nUpdate.Update+3576 (3304,1,0,3)
@[25] 0x13F511181 nUpdate.AddInMain+385 (32cc38,0,11700001,0)
...

Tika documentation

The Tika service is fully documented and is a simple REST base interface.
See this link for full documentation including the REST interface https://cwiki.apache.org/confluence/display/TIKA/TikaServer
This makes Tika available for attachment based applications outside the standard use case.


Using Tika for your own text filtering

The discussion yesterday at the on-line conference resulted in a different approach needed for the customer.

They will have to analyze attachments already in the routing phase to categorize and re-reroute messages based on sender, text, subject and also attachment content.

For an application like this Tika could be a good candidate to extract the attachment data. This would need a custom solution, to send the attachments to Tika.
Probably I would use an own Tika instance on another port. But sending attachments over a REST interface isn't rocket science if you know how to use libcurl.

I am thinking about building a flexible and customizable routing and processing add-on for Domino leveraging Tika in the back end.

This will need an extension manager to stop the mail in mail.box and a small servertask to process the message.
And this would open new business cases for how Domino could be used to route mail for different type of purposes.

But back to the troubleshooting approach.


Troubleshooting and analysis

In a customer situation needed to figure out how certain attachment types are handled by Tika and which results we get back.

There is debugging on the Notes/Domino side, which helps to trace requests.

notes.ini DEBUG_FILTER_TIMING=1

This setting results in console.log output, which is easy to annotate:

13.05.2021 12:34:54 Tika Attachment Filtering - took 4539 msecs Filtering Attachment 'domino_backup.odp' in '/local/notesdata/tika.nsf' (DocID = 2330), size = 402668, occurrences = 1785

This was our first step to look into what was going on in the customer environment.
I wrote some code to parse the log results into a Notes database.
This is a similar tool I have for client_clock, server_clock, Domino iostat, sem debug and other files. All tools raised over the years thru on-site troubleshooting.

It just takes the data that was already there and puts it sortable into documents.

This approach provided already more details about how Tika is behaving with different attachment types.
But this made me curious to really find out how Tika works..
So I ended up writing my own C-API based analysis and performance troubleshooting tool.


Analysis application to benchmark and troubleshoot Tika

The tool crawls a whole NSF and sends all attachments to a Tika process that you can run manually on the same machine.
The requests are send to Tika in a very similar what Notes/Domino sends them.

All information returned from the request including meta information coming from Tika is logged into trace documents.
In addition it is hooking into the Tika process to get performance data directly from the Tika process.
Assumed we are the only thread sending requests, This data can be aligned with other results.

This showed very detailed information about what is returned from Tika.
I am getting the full text stream back. But I am just checking the size.

I found out some very interesting details that might be similar for other environments.

Here is an example:

Image:Tika in Notes/Domino



Analyzing logs and getting best practices

Notes/Domino can only be as good as the Tika process handles data.
And every new Tika version might bring better results.

But there are some general rules for optimization.
  • You should exclude all graphics formats if not really required!
    Some graphic formats cause a lot of overhead with not very useful text data returned.
  • All type of ZIP/compressed files should be avoided, because the exclusions are always for the attachment name you pass.
    An extracted text file in ZIP format might get you huge results back.
    For example a NSD zipped is a small attachment. But expanded it can be huge!
  • PDF took some time in my environment. But you can't avoid PDFs
    It wasn't always just a matter of the attachment size

In my environment I have only limited analysis data.
But in a larger environment this could provide useful information for optimizing the Tika indexing.

Obfuscating data

The data can be completely obfuscated. The attachment name and the database names can be obfuscated with a single switch.
The attachment extensions remains the same and also the path name. But the file names are always obfuscated by turning them into a hash.

Because this data can be very sensitive, I added obfuscation from day one to allow the customer who first ran it, to share data with me.


Conclusion

This would be a good took if you look into larger full text index deployments. For example when you want to enable FT index including attachment indexing for your Verse users.

I have never been a fan of attachment indexing. But if you want to enable it in a larger environment, this tool might help you.

And for sure you should look into optimizing attachment indexing by excluding certain types of file extensions.

ZIP formats are problematic! But they cannot be avoided in all type of environments. But you should try to!




Links

    Archives


    • [HCL Domino]
    • [Domino on Linux]
    • [Nash!Com]
    • [Daniel Nashed]