Using multiple temporary indexes to improve indexing time (1.41.1)

Note

The underlying code is buggy between 1.38 and 1.41.0, fixed in 1.41.1. The bug affects the storing of document texts inside the index, so it only affects snippets generation inside result lists. If the result lists snippets are important to you, do not use the function with an affected release.

In some cases, either when the input documents are simple and require little processing (e.g. HTML files), or possibly with a high number of available cores, the single-threaded Xapian index updates can become the performance bottleneck for indexing.

In this case, it is possible to configure the indexer to use multiple temporary indexes which are merged at the end of the operation. This can provide a huge gain in performance, but, as opposed to multithreading for document preparation, it can also have a (slight) negative impact in some cases, so that it is not enabled by default.

In most cases, this should also be turned off after the initial index creation is done, because it is extremely detrimental to the speed of small incremental updates.

The parameter which controls the number of temporary indexes in recoll.conf is named thrTmpDbCnt. The default value is 0, meaning that no temporary indexes are used.

If your document set is big, and you are using a processor with many cores for indexing, especially if the input documents are simple, it may be worth it to experiment with the value. For example, with a partial Wikipedia dump (many HTML small files), indexing times could be divided almost by three, by using four temporary indexes on a quad-core machine. More detail in this article on the Recoll WEB site.

All the tests were performed on SSDs, it is quite probable that this approach would not work well on spinning disks, at least not in its current form.