Recoll for Korean

Overview

Korean Hangul text is not processed well either by the whitespace-based word splitter used for Western languages or by the n-gram based splitter used for Chinese characters.

As of version 1.27 Recoll has support for using an external text analyzer for splitting Korean text into appropriate terms.

The initial implementation was based on the Konlpy Python package, which has support for several morpheme analyzers for Korean (Hannanum, Kkma, Komoran, Mecab, Twitter/Okt).

Testing with a kind Korean Recoll user finally led to choosing Mecab as default, as it seemed to present the best performance/quality compromise.

The current Recoll implementation retains the capability to work with konlpy, if you want to experiment with different analyzers, but the default setup is now to use python-mecab-ko, which is a direct interface to Mecab-ko and avoids the multiple Konlpy dependencies.

Mecab-ko is written in C++, unlike the others which are in Java.

The necessary modules are not bundled with the base Recoll installation. The installation steps for supporting Korean depend on the OS and are described further, for Linux and Windows.

Installing Mecab-ko and python-mecab-ko on Windows

As of 1.27.0 Recoll for Windows comes bundled with python-mecab-ko. You just need to install the main Mecab-ko package and its dictionaries. Fortunately, someone built packages for you.

Procedure

Download the zip files for: mecab-ko-msvc and mecab-ko-dic-msvc.
Unzip both zip files under C:\Mecab. The location is currently mandatory, maybe I’ll check if it can be made configurable one day.
Edit the Recoll index configuration file (default: C:\Users\[me]\Appdata\Local\Recoll\recoll.conf) with, e.g., Notepad, and add the following line:

hangultagger = Mecab

Reset the index.

Installing Mecab-ko and python-mecab-ko on Linux

The following installs Mecab to /usr/local. Use the --prefix=/usr argument to the configure commands to install to /usr instead.

Create a directory to build Mecab-ko:

cd
mkdir mecab
cd mecab

Retrieve, extract, build and install the software itself:

wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.1.tar.gz
tar xvzf mecab-0.996-ko-0.9.1.tar.gz
cd mecab-0.996-ko-0.9.1
./configure
make
make check
sudo make install

Retrieve, extract, build and install the dictionary:

cd ..    # Now in the top mecab directory
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-1.6.1-20140814.tar.gz
tar xvzf mecab-ko-dic-1.6.1-20140814.tar.gz
cd mecab-ko-dic-1.6.1-20140814
./configure
# If you get an error about the version of automake files, re-bootstrap
# and run configure again. You will need autoconf and automake installed
# sh autogen.sh
# ./configure
make
# Tell mecab where its dictionary lives
sudo sh -c 'echo "dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic" > /usr/local/etc/mecabrc'
sudo make install
# The following is necessary for the python-mecab-ko build to succeed if
# you installed mecab to /usr/local
sudo ln -s /usr/local/bin/mecab-config /usr/bin/mecab-config

Install python-mecab-ko.

The later version of the package (>= 1.0.9) does not currently work with the above version of mecab. You need to build and install 1.0.8

sudo python3 -m pip install python-mecab-ko==1.0.8

Edit recoll.conf as for Windows above

Done… Reset the index to get the new Korean terms.