Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

 
alt

Daniel Nashed

 

Critical Problem in Domino 8.0.2 - Compact can cause data loss

Daniel Nashed  14 December 2008 20:53:10


Ooops my first posting and directly about  problem. But I have the feeling that you should all know about this bug and what this can cause in your environment.

Critical Problem in Domino 8.0.2 - Compact can cause data loss


We ran unto a critical problem at a customer and I have been working with Lotus Support to resolve the problem. I made sure that we get a support flash ASAP
TN #21329103.
which as been created 2 days after we opened the PMR. This alone shows how critical this problem is rated. The following describes the background about the issue.


There is a SPR for a low-level database issue that causes database with the option "Optimize document table map" (last tab of database properties) not to be able to store documents in some situations.

This bug only occurs in Notes and Domino 8.0.2 and does only affect databases with this option set.


The official text for the SPR is the following


SPR# WTUZ7EEL85 - Certain conditions can result in the UNKs not being resolved for a parent response note during the note update operation which result in not being able to properly resolve the bucket bitmap optimization.  

The fix allows that failure not to interfere with the updating of the note operation.  Without this fix certain operations would fail in such a way that subsequent access to the database result in:  "Page not buffered" error.


What happens in the background


So what happens is that during any kind of operation when a document needs to be saved or stored can fail with an error.

The usual error is "Page not buffered" but there are other error messages that could occur.

One other error message is the internal error "02:0A" for example during a copy-style compact operation. But there might be also other error messages.


When does the problem occur?


The problem could occur for example when an user sends a mail and the document is not stored in his own mail-database.

Or if the mail-router is trying to deliver a mail to a mail database. But there could be a couple of other scenarios where this problem can occur.


Murphy's Law hit us with compact -c -i


So there is a best practices to use compact -c -i to convert to a new ODS. The customer admin used this because he ran into the error number #02:0A when using compact.

A copy-style compact generates a TMP file and does a note-by-note copy of all notes and objects in the database. When the compact finishes without error the old database is deleted and the new database in the TMP file is renamed to be the new database.


If you specify the -i option ALL errors are ignored and the database is renamed in any case.

Even worse there is no error logging when documents cannot be copied and you don't know afterwards if all documents have been copied.


IMHO the -i option should not be a best practices but a last resort if you cannot get your compact operations thru and have another replica where you could pull back missing documents if needed. This would be similar to fixing up a database.


In our case the bug causes that documents cannot be stored in the target database (TMP database) during the copy-style compact at some point when the database has the "Optimize document table map" database option enabled.

This causes that remaining documents are not copied into the database. But the TMP files is still renamed back :-(


Problems occurring after compact -c -i


So the bug in combination with compact -c -i can cause dramatical lost of documents, design, deletion stubs and profiles. And this leads to all sorts of errors that an admin normally does not relate to this compact.


-- Lost documents in the database

-- Lost design elements. For example the icon note!

-- Lost navigation in the mail-file

-- Lost profile documents (which cause other issues)


If you loose profile documents in a database all sorts of problems can occur.


For example you get an error when the Out of Office Agents runs


"AMgr: Agent ('OutOfOffice OutOfOffice' in 'mail\xxx.nsf') error message: Function requires a valid ADT argument."


Also when the Calendar Profile document is lost all sorts of problems can occur in a mail-database


How to fix the problem


First of all you need either revert back to Domino 8.0.1 or get a hotfix for SPR #WTUZ7EEL85 from IBM/Lotus support.


If you discover the problem soon and you have another server or a good backup you might be able to restore most of the data.

You can restore databases to a different server, clear the replication history and specially the cut-off date to replicate the missing documents back.

Deleting the replication history thru the Admin Client does also delete the cut-off date which makes it easy to replicate the data back.



Another Issue - New and Old Profile Documents


So when the compact -c -i removed a profile document (without a deletion stub) a new profile document is created on the fly when a profile document is accessed.

When you try to replicate the original profile document back the profile note is replicated back but the profile doc does not replace the new created profile doc.

So the original profile note remains invisible and there is no way to get this profile doc back.


Profile documents are a special type of documents which are referenced thru their name and cannot be search using public APIs.

They are internally called ghost documents which are also cached in memory for performance.

To get the original profile document back we have to remove the new profile note and restore the original profile note as the current profile.


Working closely with IBM we found a way to fix the profile documents. But it needs a very low-level routine to get this sorted out.


I hope this information helps you to get a clear understanding about the SPR and what is happening in the background.

If you are hit by the same problem feel free to contact me.


-- Daniel

Links

    Archives


    • [HCL Domino]
    • [Domino on Linux]
    • [Nash!Com]
    • [Daniel Nashed]