トークンフィルタリファレンス - キーワードマーカー（Keyword marker）

キーワードマーカートークンフィルター
例
設定可能なパラメータ
アナライザーのカスタマイズと追加

キーワードマーカートークンフィルター

指定されたトークンをキーワードとしてマークし、ステミングされません。

keyword_marker フィルターは、指定されたトークンに keyword 属性を true として割り当てます。 stemmer や porter_stem のようなステマートークンフィルターは、keyword 属性が true のトークンをスキップします。

keyword_marker フィルターが正しく機能するためには、analyzer configuration の中で、他のステマートークンフィルターの前にリストされている必要があります。

keyword_marker フィルターは、Lucene の KeywordMarkerFilter を使用します。

例

keyword_marker フィルターがどのように機能するかを見るには、まずステミングされたトークンを含むトークンストリームを生成する必要があります。

次の analyze API リクエストは、stemmer フィルターを使用して fox running and jumping のためのステミングされたトークンを作成します。

Python

resp = client.indices.analyze(
   tokenizer="whitespace",
   filter=[
   "stemmer"
   ],
   text="fox running and jumping",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'whitespace',
   filter: [
   'stemmer'
   ],
   text: 'fox running and jumping'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: ["stemmer"],
  text: "fox running and jumping",
});
console.log(response);

コンソール

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [ "stemmer" ],
  "text": "fox running and jumping"
}

リクエストは次のトークンを生成します。 running は run にステミングされ、jumping は jump にステミングされました。

テキスト

[ fox, run, and, jump ]

jumping がステミングされないようにするには、前の analyze API リクエストで keyword_marker フィルターを stemmer フィルターの前に追加します。 jumping を keywords パラメータの keyword_marker フィルターに指定します。

Python

resp = client.indices.analyze(
   tokenizer="whitespace",
   filter=[
   {
   "type": "keyword_marker",
   "keywords": [
   "jumping"
   ]
   },
   "stemmer"
   ],
   text="fox running and jumping",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'whitespace',
   filter: [
   {
   type: 'keyword_marker',
   keywords: [
   'jumping'
   ]
   },
   'stemmer'
   ],
   text: 'fox running and jumping'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
   {
   type: "keyword_marker",
   keywords: ["jumping"],
   },
   "stemmer",
  ],
  text: "fox running and jumping",
});
console.log(response);

コンソール

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
   {
   "type": "keyword_marker",
   "keywords": [ "jumping" ]
   },
   "stemmer"
  ],
  "text": "fox running and jumping"
}

リクエストは次のトークンを生成します。 running はまだ run にステミングされていますが、jumping はステミングされていません。

テキスト

[ fox, run, and, jumping ]

これらのトークンの keyword 属性を見るには、analyze API リクエストに次の引数を追加します：

explain: true
attributes: keyword

Python

resp = client.indices.analyze(
   tokenizer="whitespace",
   filter=[
   {
   "type": "keyword_marker",
   "keywords": [
   "jumping"
   ]
   },
   "stemmer"
   ],
   text="fox running and jumping",
   explain=True,
   attributes="keyword",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'whitespace',
   filter: [
   {
   type: 'keyword_marker',
   keywords: [
   'jumping'
   ]
   },
   'stemmer'
   ],
   text: 'fox running and jumping',
   explain: true,
   attributes: 'keyword'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
   {
   type: "keyword_marker",
   keywords: ["jumping"],
   },
   "stemmer",
  ],
  text: "fox running and jumping",
  explain: true,
  attributes: "keyword",
});
console.log(response);

コンソール

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
   {
   "type": "keyword_marker",
   "keywords": [ "jumping" ]
   },
   "stemmer"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}

API は次のレスポンスを返します。 jumping トークンには keyword 属性が true です。

コンソール-結果

{
  "detail": {
   "custom_analyzer": true,
   "charfilters": [],
   "tokenizer": {
   "name": "whitespace",
   "tokens": [
   {
   "token": "fox",
   "start_offset": 0,
   "end_offset": 3,
   "type": "word",
   "position": 0
   },
   {
   "token": "running",
   "start_offset": 4,
   "end_offset": 11,
   "type": "word",
   "position": 1
   },
   {
   "token": "and",
   "start_offset": 12,
   "end_offset": 15,
   "type": "word",
   "position": 2
   },
   {
   "token": "jumping",
   "start_offset": 16,
   "end_offset": 23,
   "type": "word",
   "position": 3
   }
   ]
   },
   "tokenfilters": [
   {
   "name": "__anonymous__keyword_marker",
   "tokens": [
   {
   "token": "fox",
   "start_offset": 0,
   "end_offset": 3,
   "type": "word",
   "position": 0,
   "keyword": false
   },
   {
   "token": "running",
   "start_offset": 4,
   "end_offset": 11,
   "type": "word",
   "position": 1,
   "keyword": false
   },
   {
   "token": "and",
   "start_offset": 12,
   "end_offset": 15,
   "type": "word",
   "position": 2,
   "keyword": false
   },
   {
   "token": "jumping",
   "start_offset": 16,
   "end_offset": 23,
   "type": "word",
   "position": 3,
   "keyword": true
   }
   ]
   },
   {
   "name": "stemmer",
   "tokens": [
   {
   "token": "fox",
   "start_offset": 0,
   "end_offset": 3,
   "type": "word",
   "position": 0,
   "keyword": false
   },
   {
   "token": "run",
   "start_offset": 4,
   "end_offset": 11,
   "type": "word",
   "position": 1,
   "keyword": false
   },
   {
   "token": "and",
   "start_offset": 12,
   "end_offset": 15,
   "type": "word",
   "position": 2,
   "keyword": false
   },
   {
   "token": "jumping",
   "start_offset": 16,
   "end_offset": 23,
   "type": "word",
   "position": 3,
   "keyword": true
   }
   ]
   }
   ]
  }
}

設定可能なパラメータ

ignore_case
(オプション、ブール値) true の場合、keywords および keywords_path パラメータの一致は大文字と小文字を無視します。デフォルトは false です。
keywords
(必須*、文字列の配列) キーワードの配列。これらのキーワードに一致するトークンはステミングされません。
このパラメータ、keywords_path、または keywords_pattern のいずれかを指定する必要があります。このパラメータと keywords_pattern を同時に指定することはできません。
keywords_path
(必須*、文字列) キーワードのリストを含むファイルへのパス。これらのキーワードに一致するトークンはステミングされません。
このパスは絶対パスまたは config の場所に対する相対パスであり、ファイルは UTF-8 エンコードされている必要があります。ファイル内の各単語は改行で区切られている必要があります。
このパラメータ、keywords、または keywords_pattern のいずれかを指定する必要があります。このパラメータと keywords_pattern を同時に指定することはできません。
keywords_pattern
(必須*、文字列) トークンに一致するために使用される Java 正規表現。この式に一致するトークンはキーワードとしてマークされ、ステミングされません。
このパラメータ、keywords、または keywords_path のいずれかを指定する必要があります。このパラメータと keywords または keywords_pattern を同時に指定することはできません。
不適切に記述された正規表現は、Elasticsearch の動作を遅くしたり、スタックオーバーフローエラーを引き起こしたりし、実行中のノードが突然終了する原因となることがあります。

アナライザーのカスタマイズと追加

keyword_marker フィルターをカスタマイズするには、それを複製して新しいカスタムトークンフィルターの基礎を作成します。設定可能なパラメータを使用してフィルターを変更できます。

たとえば、次の create index API リクエストは、カスタム keyword_marker フィルターと porter_stem フィルターを使用して新しい custom analyzer を構成します。

カスタム keyword_marker フィルターは、analysis/example_word_list.txt ファイルに指定されたトークンをキーワードとしてマークします。 porter_stem フィルターはこれらのトークンをステミングしません。

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "my_custom_analyzer": {
   "type": "custom",
   "tokenizer": "standard",
   "filter": [
   "my_custom_keyword_marker_filter",
   "porter_stem"
   ]
   }
   },
   "filter": {
   "my_custom_keyword_marker_filter": {
   "type": "keyword_marker",
   "keywords_path": "analysis/example_word_list.txt"
   }
   }
   }
   },
)
print(resp)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   my_custom_analyzer: {
   type: 'custom',
   tokenizer: 'standard',
   filter: [
   'my_custom_keyword_marker_filter',
   'porter_stem'
   ]
   }
   },
   filter: {
   my_custom_keyword_marker_filter: {
   type: 'keyword_marker',
   keywords_path: 'analysis/example_word_list.txt'
   }
   }
   }
   }
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   my_custom_analyzer: {
   type: "custom",
   tokenizer: "standard",
   filter: ["my_custom_keyword_marker_filter", "porter_stem"],
   },
   },
   filter: {
   my_custom_keyword_marker_filter: {
   type: "keyword_marker",
   keywords_path: "analysis/example_word_list.txt",
   },
   },
   },
  },
});
console.log(response);

コンソール

PUT /my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "my_custom_analyzer": {
   "type": "custom",
   "tokenizer": "standard",
   "filter": [
   "my_custom_keyword_marker_filter",
   "porter_stem"
   ]
   }
   },
   "filter": {
   "my_custom_keyword_marker_filter": {
   "type": "keyword_marker",
   "keywords_path": "analysis/example_word_list.txt"
   }
   }
   }
  }
}