Overview

(There is currently no Korean translation for this new version of the page).

Korean Hangul text is not processed well either by the whitespace-based word splitter used for Western languages or by the n-gram based splitter used for Chinese characters.

As of version 1.27 Recoll has support for using an external text analyzer for splitting Korean text into appropriate terms.

The initial implementation was based on the Konlpy Python package, which has support for several morpheme analyzers for Korean (Hannanum, Kkma, Komoran, Mecab, Twitter/Okt).

Testing with a kind Korean Recoll user finally led to choosing Mecab as default, as it seemed to present the best performance/quality compromise.

The current Recoll implementation retains the capability to work with konlpy, if you want to experiment with different analyzers, but the default setup is now to use python-mecab-ko, which is a direct interface to Mecab-ko and avoids the multiple Konlpy dependencies.

Mecab-ko is written in C++, unlike the others which are in Java.

The necessary modules are not bundled with the base Recoll installation. The installation steps for supporting Korean depend on the OS and are described further, for Linux and Windows.

Installing Mecab-ko and python-mecab-ko on Windows

The bundled Recoll Python installation comes with the pip utility, so you can just install the module from Pypi.

  • In a command window:

C:\Program Files\Recoll\Share\filters\python.exe -m pip install python-mecab-ko
  • Edit the Recoll index configuration file (default: C:\Users\[me]\Appdata\Local\Recoll\recoll.conf) with, e.g., Notepad, and add the following line:

hangultagger = Mecab
  • Reset the index.

The above pip command will in general install the mecab-related files under C:/users/[you]/Appdata/Roaming/Python/Python312/site-packages.

Installing python-mecab-ko on Linux

  • Just use your system Python and pip to install python-mecab-ko from Pypi

  • Edit recoll.conf as for Windows above

Done…​ Reset the index to get the new Korean terms.