マストドン（mastodon）の全文検索系（Elasticsearch+sudachi）のコードに手を入れてみた（v3.3.0対応）

あけましておめでとうございます。本年もよろしくお願いします。

昨年末からmastodon（マストドン）のElasticsearchを使った全文検索系のコードに対して、いままで対応してきた方のコードを見ながら自分なりに使いやすいコードに改造していました。ひとまず、githubの自分のリポジトリにもcommitできたのでここにまとめます。

mastodonの全文検索にはElasticsearchが採用されていますが、本家のコードは英語を対象としたものでそのままでは日本語での検索は快適とは言えません。そこで、クラゲ丼のぜまさんがsudachiをElasticsearchに組み込んだ方法に取り組まれていました。

github.com

kurage.cc

基本的にはこのブログの記事に沿って作業すれば良いのですが、v3.3.0だといろいろコードが変わっているのでそれを生かす形にする場合は少し工夫がいります。

コードの改造の前に環境整備の話を。今回の日本語全文検索にはElasticsearch+sudachiという構成をとるのですが、Elasticsearch用のsudachiプラグインはファイルをダウンロードしてインストールする必要があります。
Elasticsearchのインストールについては割愛します。今回使用しているのは7.10.1です。

まずはプラグインをダウンロードしてインストールします。今回はrootユーザで実行しています。

github.com

今回はElasticsearch7.10.1を対象としていますので、7.10.1用のファイルをダウンロードします。

wget https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v2.1.0/analysis-sudachi-7.10.1-2.1.0.zip
/usr/share/elasticsearch/bin/elasticsearch-plugin install file:///root/analysis-sudachi-7.10.1-2.1.0.zip

次に辞書をダウンロードして所定の位置に置きます。私は、「/etc/elasticsearch/sudachi」に置くようにしています。辞書はzipファイルのを使用していますので、unzipコマンドを先にインストールしておきます。

apt install unzip
wget https://oss.sonatype.org/content/repositories/snapshots/com/worksap/nlp/sudachi/0.1.2-SNAPSHOT/sudachi-0.1.2-20190401.094135-19-dictionary-core.zip
unzip sudachi-0.1.2-20190401.094135-19-dictionary-core.zip
wget https://oss.sonatype.org/content/repositories/snapshots/com/worksap/nlp/sudachi/0.1.2-SNAPSHOT/sudachi-0.1.2-20190401.094135-19-dictionary-full.zip
unzip sudachi-0.1.2-20190401.094135-19-dictionary-full.zip
mkdir /etc/elasticsearch/sudachi
cp system_*.dic /etc/elasticsearch/sudachi/

今回full版の辞書を使いたいので、それを指定するために編集したsudachi.jsonを/etc/elasticsearch/sudachiに置きます。

github.com

wget https://raw.githubusercontent.com/WorksApplications/Sudachi/develop/src/main/resources/sudachi.json
sed -i -e ‘s/system_core.dic/system_full.dic/g’ ./sudachi.json
cp sudachi.json /etc/elasticsearch/sudachi/

私は、icu プラグインも使用していますのでこれもインストールしました。

/usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu

作業が終わったら、Elasticsearchを再起動してください。

さて、ここからはmastodonのソースコードを改造していきます。

まず、インデックスを作成するコードのうち、toot本文に関するコードです。

https://github.com/kaias1jp/mastodon/blob/master/app/chewy/statuses_index.rb

~/live/app/chewy/statuses_index.rb

 +    tokenizer: {
 +      sudachi_tokenizer: {
 +        type: 'sudachi_tokenizer',
 +        discard_punctuation: true,
 +        resources_path: '/etc/elasticsearch/sudachi',
 +        settings_path: '/etc/elasticsearch/sudachi/sudachi.json',
 +      },
 +    },

ぜまさんの記事では「mode: 'search',」という記述がありますが、Elasticsearch7.10.1ではこれは使えなくなっていますので指定しません。

~/live/app/chewy/statuses_index.rb

 -      content: {
 -        tokenizer: 'uax_url_email',
 -        filter: %w(
 -          english_possessive_stemmer
 -          lowercase
 -          asciifolding
 -          cjk_width
 -          english_stop
 -          english_stemmer
 -        ),
 -      },

 +      content: {
 +        "char_filter":["icu_normalizer"],
 +        "tokenizer": "sudachi_tokenizer",
 +        type: "custom",
 +        filter: %w(
 +          lowercase
 +          cjk_width
 +          sudachi_part_of_speech
 +          sudachi_ja_stop
 +          sudachi_baseform
 +          english_possessive_stemmer
 +asciifolding
 +english_stop
 +english_stemmer
 +        ),
 +      },

sudachiのtokenizerに切り替えるのと、sudachiのフィルターを入れています。
なお、私はicuのフィルターも入れています。これは例えば「㌔」という本文に対して「キロ」というキーワードでも検索できるようにするものです。

次は、検索時に呼び出されるコードのうち一番メインのコードです。

https://github.com/kaias1jp/mastodon/blob/master/app/services/search_service.rb

~/live/app/services/search_service.rb

  def perform_statuses_search!
 -    definition = parsed_query.apply(StatusesIndex.filter(term: { searchable_by: @account.id }))
 +    definition = parsed_query.apply(StatusesIndex).order(id: :desc)

今回、現在のmasterのコードをできるだけ生かす形で変更しています。ここでの変更点は２つです。まず、account.id限定の絞り込みをやめています。さらに、order指定を追加して日付順ソートを実現しています。idでソートしていますが、日付順と考えてよいはずです。

~/live/app/services/search_service.rb

  def relations_map_for_account(account, account_ids, domains)
    {
      blocking: Account.blocking_map(account_ids, account.id),
      blocked_by: Account.blocked_by_map(account_ids, account.id),
      muting: Account.muting_map(account_ids, account.id),
 -      following: Account.following_map(account_ids, account.id),
 +      #following: Account.following_map(account_ids, account.id),
      domain_blocking_by_domain: Account.domain_blocking_map_by_domain(domains, account.id),
    }
  end

rejectで使用されるコードからfollow部分をコメントアウトしています。先ほどのaccount.id限定の絞り込みをやめるのと合わせて、連合TLに流れる公開tootも検索できるようになるはずです。

~/live/app/services/search_service.rb

  def parsed_query
 -    SearchQueryTransformer.new.apply(SearchQueryParser.new.parse(@query))
 +    SearchQueryTransformer.new.apply(SearchQueryParser.new.parse(@query.gsub(/(　)/," ")))
  end

この変更は個人的な使いやすさ優先です。検索文字列の全角空白を半角空白に置き換えています。全角空白もキーワードの区切り文字とみなす処理です。実はフレーズ検索というのができるんですが、日本語で全角空白を含むフレーズ検索は考えつかなかったのでざっくりと置換しています。

さて、ぜまさんの記事ではmulti_match部分をmatchに置き換えていますがこれを実現するのが以下のコードです。

https://github.com/kaias1jp/mastodon/blob/master/app/lib/search_query_transformer.rb

~/live/app/lib/search_query_transformer.rb

    def clause_to_query(clause)
      case clause
      when TermClause
 -        { multi_match: { type: 'most_fields', query: clause.term, fields: ['text', 'text.stemmed'] } }
 +        { match: {  'text.stemmed': {query: clause.term} } }

確かに元のコードだと日本語での検索がうまくいきません。例えば、「いわゆる」というキーワードで検索したい場合でも「い」「わ」「ゆ」「る」が分割されて検索されてしまいます。修正後のコードはこれを回避しています。
日本語検索に関する修正は以上です。これから下は個人的に使い勝手を良くするための修正です。

https://github.com/kaias1jp/mastodon/blob/master/app/controllers/api/v2/search_controller.rb

~/live/app/controllers/api/v2/search_controller.rb

 -  RESULTS_LIMIT = 20
 +  RESULTS_LIMIT = 500

検索結果の最大件数指定です。正直20件は少ないと感じています。ただし、500件も環境やキーワードによってはタイムアウトになる可能性が高いのでせいぜい100件ぐらいが良いのではとは思っています。

https://github.com/kaias1jp/mastodon/blob/master/lib/mastodon/search_cli.rb

~/live/lib/mastodon/search_cli.rb

                  # The following is an optimization for statuses specifically, since
                  # we want to de-index statuses that cannot be searched by anybody,
                  # but can't use Chewy's delete_if logic because it doesn't use
                  # crutches and our searchable_by logic depends on them
                  if type == StatusesIndex::Status
                    bulk_body.map! do |entry|
 -                      if entry[:index] && entry.dig(:index, :data, 'searchable_by').blank?
 -                        index_count  -= 1
 -                        delete_count += 1
 -
 -                        { delete: entry[:index].except(:data) }
 -                      else
 -                        entry
 -                      end
 +                  #    if entry[:index] && entry.dig(:index, :data, 'searchable_by').blank?
 +                  #      index_count  -= 1
 +                  #      delete_count += 1
 +
 +                  #      { delete: entry[:index].except(:data) }
 +                  #    else
 +                        entry
 +                  #    end
                    end
                  end

「bin/tootctl search deploy」を実行するときに走るコードです。微妙に悩んでいるのですが、「search deploy」を使うとsearchable_byを見てインデックスからの削除処理が走るようになっています。全ソースをきちんと見ていないので何とも言えないのですが、「search deploy」が終わった以降の追加処理と整合性が取れていないのではと思っています。「search deploy」は「rails chewy:deploy」よりも安定してデータ投入ができるのでできるだけ使いたいと考えています。なので、私はdelete関係のコードをコメントアウトしています。

改造部分については以上です。元のソースコードは生かしつつ、日本語検索がきちんとできる形にしてみたつもりです。なにか不明な点などありましたら、「popn_ja@popon.pptdn.jp」までお問い合わせください。

popon.pptdn.jp

プログラミングなんてわからないんですけど〜

元プログラマによるプライベートでのプログラミング日記。1/3のつもりだけどソフト関連はここがメイン

マストドン（mastodon）の全文検索系（Elasticsearch+sudachi）のコードに手を入れてみた（v3.3.0対応）