トークンフィルタリファレンス - 辞書分解器（Dictionary decompounder）

辞書分解トークンフィルター
例
設定可能なパラメータ
アナライザーのカスタマイズと追加

辞書分解トークンフィルター

ほとんどの場合、私たちはこのフィルターの代わりにより高速な hyphenation_decompounder トークンフィルターの使用を推奨します。しかし、dictionary_decompounder フィルターを使用して、hyphenation_decompounder フィルターに実装する前に単語リストの品質を確認することができます。

指定された単語リストと強制的なアプローチを使用して、複合語の中のサブワードを見つけます。見つかった場合、これらのサブワードはトークン出力に含まれます。

このフィルターは、ドイツ語系言語のために構築された Lucene の DictionaryCompoundWordTokenFilter を使用しています。

例

以下の analyze API リクエストは、dictionary_decompounder フィルターを使用して Donaudampfschiff のサブワードを見つけます。フィルターは次に、これらのサブワードを指定された単語リスト Donau、dampf、meer、および schiff と照合します。

Python

resp = client.indices.analyze(
   tokenizer="standard",
   filter=[
   {
   "type": "dictionary_decompounder",
   "word_list": [
   "Donau",
   "dampf",
   "meer",
   "schiff"
   ]
   }
   ],
   text="Donaudampfschiff",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'standard',
   filter: [
   {
   type: 'dictionary_decompounder',
   word_list: [
   'Donau',
   'dampf',
   'meer',
   'schiff'
   ]
   }
   ],
   text: 'Donaudampfschiff'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: [
   {
   type: "dictionary_decompounder",
   word_list: ["Donau", "dampf", "meer", "schiff"],
   },
  ],
  text: "Donaudampfschiff",
});
console.log(response);

コンソール

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
   {
   "type": "dictionary_decompounder",
   "word_list": ["Donau", "dampf", "meer", "schiff"]
   }
  ],
  "text": "Donaudampfschiff"
}

フィルターは以下のトークンを生成します:

テキスト

[ Donaudampfschiff, Donau, dampf, schiff ]

設定可能なパラメータ

word_list
(必須*, 文字列の配列) トークンストリーム内で探すサブワードのリスト。見つかった場合、サブワードはトークン出力に含まれます。
このパラメータまたは word_list_path のいずれかを指定する必要があります。
word_list_path
(必須*, 文字列) トークンストリーム内で探すサブワードのリストを含むファイルへのパス。見つかった場合、サブワードはトークン出力に含まれます。
このパスは絶対パスまたは config の場所に対する相対パスでなければならず、ファイルは UTF-8 エンコードされている必要があります。ファイル内の各トークンは改行で区切られている必要があります。
このパラメータまたは word_list のいずれかを指定する必要があります。
max_subword_size
(オプション、整数) 最大サブワード文字長。長すぎるサブワードトークンは出力から除外されます。デフォルトは 15 です。
min_subword_size
(オプション、整数) 最小サブワード文字長。短すぎるサブワードトークンは出力から除外されます。デフォルトは 2 です。
min_word_size
(オプション、整数) 最小単語文字長。短すぎる単語トークンは出力から除外されます。デフォルトは 5 です。
only_longest_match
(オプション、ブール値) true の場合、最も長い一致するサブワードのみを含めます。デフォルトは false です。

アナライザーのカスタマイズと追加

dictionary_decompounder フィルターをカスタマイズするには、それを複製して新しいカスタムトークンフィルターの基礎を作成します。設定可能なパラメータを使用してフィルターを変更できます。

たとえば、以下の create index API リクエストは、カスタム dictionary_decompounder フィルターを使用して新しいカスタムアナライザーを構成します。

カスタム dictionary_decompounder フィルターは analysis/example_word_list.txt ファイル内のサブワードを見つけます。22文字を超えるサブワードはトークン出力から除外されます。

Python

resp = client.indices.create(
   index="dictionary_decompound_example",
   settings={
   "analysis": {
   "analyzer": {
   "standard_dictionary_decompound": {
   "tokenizer": "standard",
   "filter": [
   "22_char_dictionary_decompound"
   ]
   }
   },
   "filter": {
   "22_char_dictionary_decompound": {
   "type": "dictionary_decompounder",
   "word_list_path": "analysis/example_word_list.txt",
   "max_subword_size": 22
   }
   }
   }
   },
)
print(resp)

Js

const response = await client.indices.create({
  index: "dictionary_decompound_example",
  settings: {
   analysis: {
   analyzer: {
   standard_dictionary_decompound: {
   tokenizer: "standard",
   filter: ["22_char_dictionary_decompound"],
   },
   },
   filter: {
   "22_char_dictionary_decompound": {
   type: "dictionary_decompounder",
   word_list_path: "analysis/example_word_list.txt",
   max_subword_size: 22,
   },
   },
   },
  },
});
console.log(response);

コンソール

PUT dictionary_decompound_example
{
  "settings": {
   "analysis": {
   "analyzer": {
   "standard_dictionary_decompound": {
   "tokenizer": "standard",
   "filter": [ "22_char_dictionary_decompound" ]
   }
   },
   "filter": {
   "22_char_dictionary_decompound": {
   "type": "dictionary_decompounder",
   "word_list_path": "analysis/example_word_list.txt",
   "max_subword_size": 22
   }
   }
   }
  }
}