Indexing visited Web pages with the Recoll Firefox extension

Overview

The Firefox Recoll-WE Firefox add-on downloads the Web pages you visit, for later caching and indexing by Recoll.

History

The extension works with Recoll in a slightly convoluted way because it has a complicated history.

It began its life as a complement to the Beagle indexer, and was then copied and modified to work with Recoll
This first version was based on the Firefox XUL API, and was almost fully rewritten to use the WebExtensions API when XUL was made obsolete, a few years ago. The new extension is largely based on code stolen from the save-page-we extension.

The current version works with Recoll 1.23.5 or newer.

The repository for the current extension code is hosted on framagit.

How things works

Some values in the following can be configured in the Recoll configuration, see further down.

The extension saves a copy of the visited page data and metadata in two files inside your Firefox Downloads directory (or a subdirectory, see the configuration section). The files are named something like recoll-we-m-[url hash.rclwe for the metadata and recoll-we-c-[url hash].rclwe for the data.
When the Recoll indexer runs, it first processes the normal file system files, then, if Web indexing is enabled, it runs the recoll-we-move-files.py script (from the Recoll filters directory). The script moves the files from the Downloads directory to the Recoll Web indexing queue directory, normally ~/.recollweb/ToIndex. This step was introduced with the switch to WebExtensions, to avoid modifying the indexer itself too much and keep compatibility with the old extension.
The indexer then runs the specific Web pages indexing code proper. This:
- Indexes the data.
- Saves the data and metadata into the Recoll Web cache storage (by default ~/.recoll/webcache), and deletes the queue files.

Note

The Recoll Web cache is not an archive

Old content will be deleted and become unsearchable when new content needs space. The maximum cache size is configurable. Pages that you want to archive permanently need to be saved elsewhere, as they will otherwise eventually disappear from the Recoll results. Recoll can index .maff files, which may be a better choice for archival usage, or also see the Save Page WE Firefox extension.

Note

Turning the cache into an archive

You could conceivably turn the Web cache into an archive by setting the maximum size absurdly high. While this will work, be aware that space will be wasted when a new version of a page is stored: storage for the old versions will be erased, but not reclaimed. It is possible to compact the cache by copying it though, the code for this exists in a test driver (in the source set, but not built or packaged by default).

Configuration

Extension Configuration

By default, after installation, the extension will store all the pages which you visit.

You can change this behaviour through configuration options in the addon preferences page or through the context menu:

Download subdirectory: Subdirectory of the Firefox Downloads directory where the extension will create its files. By default, this is empty, and the recoll-we-move-files.py script expects to find its files under ~/Downloads. You can set a value to avoid pollution of the main Downloads directory (e.g. recoll-we). In this case, you also need to set the webdownloadsdir configuration variable (e.g. webdownloadsdir = ~/Downloads/recoll-we).
Automatically index pages: If this is not set, a page will only be saved if requested by clicking the toolbox button or selecting the submenu action. If this is set, pages will be automatically saved, subject to the rules below.
Also do it for pages with secure content (https): Enable/disable the same behaviour for https URLs.

Note	The options which follow only have any effect if automatic indexing has been activated (see above).

Save by default (when no rules set matches): This is originally set. It It is automatically unset the first time you add an inclusion rule.
Save when both rules sets match: If set, index the page when both the inclusion and exclusion rule sets match.
URL include rules: Rules to select the URLs which will be automatically indexed.
URL exclude rules: Rules to select URLs which will not be indexed.

Rules for both sets can be of three types: domain (just select by host name), wildcard, or regular expression. Rules added through the context menu are of the 'domain' type.

Recoll Configuration

The .rclwe extension should be added to the noContentSuffixes configuration variable. This was not the case by default until Recoll 1.27.6. You can add it by adding the following to the recoll.conf configuration file:

noContentSuffixes+ = .rclwe

This can be also be done with the GUI preferences:

Preferences->indexing configuration->Local Parameters->Ignored endings

processwebqueue

the web pages saved by the extension will only be indexed if this is set in the recoll configuration. This can be done by editing recoll.conf or in the GUI preferences

Preferences->Index configuration->Web history

webcachemaxmbs

the Web cache maximum size can be set through this configuration parameter, which is also accessible from the GUI:

Preferences->Index configuration->Web history

webdownloadsdir

if your browser default downloads directory is not ~/Downloads, or if you set the subdirectory option in the extension preferences, you will need to set this to the appropriate value in 'recoll.conf' (e.g. webdownloadsdir = ~/Downloads/recoll-we).

webcachedir

you can set this to change the location where the Recoll Web cache is stored. By default, this is a webcache subdirectory of the Recoll configuration directory.

webqueuedir

this is the intermediate staging location where the recoll-we-move-files.py stores the pages, and where they are retrieved by the Recoll indexer. The default value is ~/.recollweb/ToIndex. The variable is used by the script and by recollindex.