General features
-
Runs on most Unix-based systems, MS-Windows, and Mac OS X.
-
Qt desktop GUI, WEB browser, command line, Gnome Shell Search Plugin, KDE KIO and krunner user interfaces.
-
Searches most common document types. Transparently handles decompression (zip, gzip, bzip2, etc.).
-
Processes all email attachments, and more generally any realistic level of container imbrication (the "msword attachment to a message inside a mailbox in a zip" thingy…) .
-
Powerful query facilities, with boolean searches, phrases, proximity, wildcards, filter on file types and directory tree, accessible through the query language or a GUI query builder interface.
-
Multi-language and multi-character set, internally using Unicode UTF-8.
-
Extensive documentation, with a complete user manual and
manpages for each command.. -
Can use a Firefox extension to index visited Web pages history. See the Howto for more detail.
-
Multiple selectable databases.
-
Stemming performed at query time (can switch stemming language after indexing).
-
An indexer which can be set to run either as a periodic batch job, or as a real-time indexing permanent task.
-
Works equally well with very small or very big datasets. Someone is currently managing a 11 million documents dataset resulting in a 250GB index.
Document types
Recoll can index many document types (along with their compressed versions).
The MS-Windows installer includes the supporting applications, no additional installations are needed. On Linux and MacOS, some types are processed internally, and some need a separate application to be installed to extract the text. Types that only need very common utilities (awk/sed/groff/iconv, Python etc.) are listed in the native section.
Many formats are processed by Python3 scripts. Formats which are processed using only the Python standard library are listed in the native section.
After installing a missing handler, you may need to tell recollindex to retry the failed files, by
adding option -k to the command line (this can also be done from the GUI File→Special indexing
menu). This is because recollindex in its default operation mode will not retry files which caused
an error during an earlier pass. In special cases, it may be useful to reset the data for a category
of files before indexing. See the recollindex manual page. If your index is not too big, it may be
simpler to just reset it.
|
Note
|
By default, Recoll indexes Chinese and Korean texts by generating arbitrary n-gram terms. However it can use a better text segmenter is this is available. See here for Chinese and here for Korean |
File types indexed natively
The formats in this section would be processed by a basic Recoll installation and its immediate dependencies without need for installing further applications or modules.
-
Plain text.
-
HTML.
-
Maildir, mh, and mailbox (Mozilla, Thunderbird, Evolution, etc.). Evolution note: be sure to remove .cache from the
skippedNameslist in the GUIIndexing preferences/Local Parameterspane if you want to index local copies of Imap mail. Outlook archives are processed with an external helper, see further. -
Gaim and purple log files.
-
Scribus files.
-
Man pages (needs groff).
-
Mimehtml web archive format (this is based on the mail handler, which introduces some mild weirdness, but is still usable).
Recoll processes most XML-based formats internally (using the libxml2 and libxslt C++ libraries):
-
OpenOffice.
-
Microsoft Office Open XML.
-
Abiword.
-
Kword.
-
Fb2 ebooks.
-
SVG.
-
Gnumeric.
-
Okular annotations.
The following use Python3 and its standard library:
-
Excel and Powerpoint (classic formats, pre-Open XML).
-
Zip archives.
-
Joplin notes (see the details here).
-
Dia diagrams.
-
Tar archives (and their compressed versions). Tar file indexing is disabled by default (because tar archives don’t typically contain the kind of documents that people search for), you will need to enable it explicitely, e.g., with the following in your
$HOME/.recoll/mimeconffile:[index] application/x-tar = execm rcltar.py
Replace rcltar.py with rcltar for old recoll versions. You can check which exists by looking in the handlers directory, e.g., /usr/share/recoll/filters/.
-
Konqueror webarchive format (uses the
tarfilePython standard library module). -
MacOS webarchive format (not to be confused with the above). From Recoll 1.42.2, only on Mac OS (uses the Mac
textutilprogram, which is installed by default which is why we count the format as internal).
File types indexed with external helpers
The following need miscellaneous helper programs or libraries to extract the document text.
-
PDF needs the pdftotext command, which comes with poppler. The package name is quite often
poppler-utils. Note: the older pdftotext command which comes with xpdf is not compatible with Recoll. PDF has its own section further, with details about metadata, attachments, annotations, OCR, and opening documents at the right page. -
Microsoft Word is processed with antiword, which is not maintained much, but keeps working. I maintain a very slightly improved antiword version, it can extract a little extra data in some cases. In case antiword fails, which sometimes happens with very small files, recoll (>= 1.43.4) falls back to trying to use soffice (e.g. from LibreOffice), else if soffice is not found, or for older recoll versions, wvWare.
-
RTF files with unrtf. Note that up to version 0.21.3, unrtf mostly does not work with non western-european character sets. Many serious problems (crashes with serious security implications and infinite loops) were fixed in unrtf 0.21.8, so you really want to use this or a newer release. Building unrtf from source is quick and easy, but most distributions now have an up to date unrtf.
-
CHM (Microsoft help) files with chmlib. Recoll bundles the Python3 bindings for the library (ported from pychm which originally did not support Python3).
-
EPUB files with Python and the epub module, which is not packaged on Debian. The packaged version by the original author (0.5.2) is old and suffers from a lot of bitrot, so Recoll now bundles an unpackaged version, updated by Arthur Darcet: no need to install anything.
-
Audio tags: Recoll uses a Python script based on the mutagen package, which you need to install.
-
Images tags are extracted with perl and exiftool.
-
Microsoft Outlook .pst and .ost files are processed with libpff. We use a slightly modified version (to provide streaming output), stored in this repository. This is bundled with Recoll for Windows. You will need to build it on other platforms and ensure that the pffexport program is in the executable PATH when recoll/recollindex is run.
-
Hancom office Hanword .hwp format for Korean text processing, using the pyhwp Python module. See the module page. Use
python3 -m pip install pyhwpto install on Linux. This is bundled with Recoll for Windows. On Debian, you also probably want to install the fonts-nanum package, which is not part of the default install. -
Wordperfect is processed with the wpd2html command from the libwpd package. On some distributions, the command may come with a package named
libwpd-toolsor such, not the baselibwpdpackage. -
jupyter notebooks need
jupyterto be installed for conversion to html -
djvu with DjVuLibre.
-
GNU info files are processed with Python and the info command.
-
Lyx files need Lyx to be installed.
-
Rar archives with either the unrar or rarfile python package. The rarfile package is a wrapper over the unrar command and is packaged as python3-rarfile by Debian. The unrar package uses the libunrar library and works much better. The libunrar library can sometimes be found as a standard package (Ubuntu libunrar5) or be built from the unrar source code. The unrar package is generally not packaged, use
pip3 install unrarto install it. Note that the free version of unrar (unrar-free) fails for many files with the message "Failed the read enough data". -
7zip archives with py7zr The Recoll handler can alternatively use pylzma, but this fails on some archives. Neither is packaged by all distributions, and you will probably have to use pip for installation.
Recent versions of py7zr are broken as far as I can see (mindless API change and issues with crc). Use 0.22:python3 -m pip install py7zr==0.22
-
iCalendar(.ics) files with the icalendar module.
-
Mozilla calendar data. See the Howto about this.
-
Postscript with the ghostscript, ps2pdf command, and pdftotext from poppler.
-
TeX with untex. If there is no untex package for your distribution, this site stores a source package, as untex has no obvious home. Will also work with detex if this is installed.
-
DVI with catdvi.
-
Midi karaoke files (.kar, .mid) need the
chardetPython3 module.
Desktop and WEB integration
The Recoll GUI has many features that help to specify an efficient search and to manage the results. However it maybe sometimes preferable to use a simpler tool with a better integration with your desktop interfaces. Several solutions exist:
-
The Recoll Web UI lets you query a Recoll index from a WEB browser. The one linked here, from framagit.org, is a much updated version of the one on GitHub (by GitHub user koniu), and is the one to use.
-
The Recoll Gnome Shell Search Provider allows searching from the Gnome Shell.
-
The Recoll KIO module allows starting queries and viewing results from the Dolphin file manager, the Konqueror browser or other KDE applications
Opendialogs. -
The recollrunner module allows integrating Recoll search results into the Plasma workspace KRunner. Beware that there was a much older, now obsolete, module of the same name for an older KDE version, which is what you usually find from a Web search. The current KRunner module is packaged by, e.g. Fedora and Mageia. If your distribution does not package it, it is not too difficult to build once the main Recoll package has been installed. There is a README describing the process. Else, maybe petition your kind packager to build the module. It’s really easy after taking a look at the Fedora spec file, or the Debian one in the Recoll packaging directory …
Recoll also has a Python3 extension with full capability to query or update the index (and on which the WebUI and Joplin notes indexer are built).
There used to be a PHP extension too, it is currently unmaintained.
Stemming
Stemming is a process which transforms inflected words into their most basic form. For example, flooring, floors, floored would probably all be transformed to floor by a stemmer for the English language.
In many search engines, the stemming process occurs during indexing. The index will only contain the stemmed form of words, with exceptions for terms which are detected as being probably proper nouns (ie: capitalized). At query time, the terms entered by the user are stemmed, then matched against the index.
This process results into a smaller index, but it has the grave inconvenient of irrevocably losing information during indexing.
Recoll works in a different way.
-
No destructive stemming is performed during indexing, so that all information gets into the index. The resulting index is bigger, but most people probably don’t care much about this nowadays, because they have a terabyte disk 95% full of binary data which does not get indexed.
-
At the end of an indexing pass, Recoll builds one or several stemming dictionaries, where all word stems are listed in correspondence to the list of their derivatives.
-
At query time, by default, user-entered terms are stemmed, then matched against the stem database, and the query is expanded to include all derivatives. This will yield search results analogous to those obtained by a classical engine.
The benefits of this approach is that stem expansion can be controlled instantly at query time in several ways:
-
It can be selectively turned-off for any query term by capitalizing it (Floor).
-
The stemming language (ie: english, french…) can be selected (this supposes that several stemming databases have been built, which can be configured as part of the indexing, or done later, in a reasonably fast way).
Special features for PDF
As of Recoll 1.43.1, this is done with the pdfdetach command from poppler-utils, so no
additional package is needed (this enables PDF attachment indexing on MS-Windows). Previously, the
feature needed the pdftk command. The package is commonly found in
the system repository for Linux distributions.
Recoll can extract process custom XMP metadata fields, mapping them to query fields. See the manual. This needs the pdfinfo command, which usually comes in the same package as pdftotext. This is included in the Windows Recoll package.
Recoll can extract and index PDF annotations by using the poppler-glib Python bindings. These are normally available with the Poppler installation on Linux systems, by adding the Poppler GObject introspection data, which can come in variously named packages, for example gir1.2-poppler-0.18 on Ubuntu, typelib-1_0-Poppler-0_18 on OpenSuse. The actual versions may differ of course. This is not available on Windows at the moment.
The default configuration uses the evince application to open PDF files. evince has options for direct page access and pre-setting the search strings (hits will be highlighted). There are examples in the default mimeview for doing the same thing with qpdfview (+qpdfview --search %s %f#%p_). Okular does not have a search string option (but it does have a page number one). Evince is available for Windows, but you will need to install it separately.
Recoll can automatically perform OCR on pure image PDF files, and cache the results so that OCR is not needed for future indexing passes. The function can currently use either tesseract or ABBYY FineReader. See the manual. NOTE: using OCRmyPDF is probably a much better approach in most cases.
