Overview
Korean Hangul text is not processed well either by the whitespace-based word splitter used for Western languages or by the n-gram based splitter used for Chinese characters.
As of version 1.27 Recoll has support for using an external text analyzer for splitting Korean text into appropriate terms.
The initial implementation was based on the Konlpy Python package, which has support for several morpheme analyzers for Korean (Hannanum, Kkma, Komoran, Mecab, Twitter/Okt).
Testing with a kind Korean Recoll user finally led to choosing Mecab as default, as it seemed to present the best performance/quality compromise.
The current Recoll implementation retains the capability to work with konlpy, if you want to experiment with different analyzers, but the default setup is now to use python-mecab-ko, which is a direct interface to Mecab-ko and avoids the multiple Konlpy dependencies.
Mecab-ko is written in C++, unlike the others which are in Java.
Installing Mecab-ko and python-mecab-ko on Windows
As of 1.27.0 Recoll for Windows comes bundled with python-mecab-ko. You just need to install the main Mecab-ko package and its dictionaries. Fortunately, someone built packages for you.
-
Download the zip files for: mecab-ko-msvc and mecab-ko-dic-msvc.
-
Unzip both zip files under
C:\Mecab
. The location is currently mandatory, maybe I’ll check if it can be made configurable one day. -
Edit the Recoll index configuration file (default:
C:\Users\[me]\Appdata\Local\Recoll\recoll.conf
) with, e.g., Notepad, and add the following line:
hangultagger = Mecab
-
Reset the index.
Installing Mecab-ko and python-mecab-ko on Linux
The following installs Mecab to /usr/local
. Use the --prefix=/usr
argument to the configure
commands to install to /usr
instead.
-
Create a directory to build Mecab-ko:
cd mkdir mecab cd mecab
-
Retrieve, extract, build and install the software itself:
wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.1.tar.gz tar xvzf mecab-0.996-ko-0.9.1.tar.gz cd mecab-0.996-ko-0.9.1 ./configure make make check sudo make install
-
Retrieve, extract, build and install the dictionary:
cd .. # Now in the top mecab directory wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-1.6.1-20140814.tar.gz tar xvzf mecab-ko-dic-1.6.1-20140814.tar.gz cd mecab-ko-dic-1.6.1-20140814 ./configure # If you get an error about the version of automake files, re-bootstrap # and run configure again. You will need autoconf and automake installed # sh autogen.sh # ./configure make # Tell mecab where its dictionary lives sudo sh -c 'echo "dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic" > /usr/local/etc/mecabrc' sudo make install # The following is necessary for the python-mecab-ko build to succeed if # you installed mecab to /usr/local sudo ln -s /usr/local/bin/mecab-config /usr/bin/mecab-config
-
Install python-mecab-ko.
The later version of the package (>= 1.0.9) does not currently work with the above version of mecab. You need to build and install 1.0.8
sudo python3 -m pip install python-mecab-ko==1.0.8
-
Edit recoll.conf as for Windows above
Done… Reset the index to get the new Korean terms.