トークナイザーリファレンス - エッジn-gram（Edge n-gram）

Edge n-gram tokenizer
Example output
Configuration
Limitations of the max_gram parameter
Example configuration
- Python
- Ruby
- Js
- Console
- Text
- Python
- Ruby
- Js
- Console

Edge n-gram tokenizer

edge_ngram トークナイザーは、指定された文字のリストのいずれかに遭遇したときに、最初にテキストを単語に分割し、その後、N-gram の開始が単語の先頭に固定された各単語の N-grams を出力します。

Edge N-Grams は 検索時入力 クエリに役立ちます。

映画や曲のタイトルなど、広く知られた順序を持つテキストに対して 検索時入力 が必要な場合、completion suggester はエッジ N-grams よりもはるかに効率的な選択肢です。エッジ N-grams は、任意の順序で出現する可能性のある単語のオートコンプリートを試みる際に利点があります。

Example output

デフォルト設定では、edge_ngram トークナイザーは初期テキストを単一のトークンとして扱い、最小長さ 1 および最大長さ 2 の N-grams を生成します:

Python

resp = client.indices.analyze(
   tokenizer="edge_ngram",
   text="Quick Fox",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'edge_ngram',
   text: 'Quick Fox'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "edge_ngram",
  text: "Quick Fox",
});
console.log(response);

Console

POST _analyze
{
  "tokenizer": "edge_ngram",
  "text": "Quick Fox"
}

上記の文は次の用語を生成します:

Text

[ Q, Qu ]

これらのデフォルトのグラム長はほとんど役に立ちません。使用する前に edge_ngram を構成する必要があります。

Configuration

edge_ngram トークナイザーは次のパラメータを受け入れます:

min_gram
グラム内の文字の最小長さ。デフォルトは 1。
max_gram
グラム内の文字の最大長さ。デフォルトは 2。
max_gram パラメータの制限を参照してください。
token_chars
トークンに含めるべき文字クラス。Elasticsearch は、指定されたクラスに属さない文字で分割します。デフォルトは [] (すべての文字を保持)。
文字クラスは次のいずれかである可能性があります:
- letter— 例えば a、b、ï または 京
- digit— 例えば 3 または 7
- whitespace— 例えば " " または "\n"
- punctuation—例えば ! または "
- symbol— 例えば $ または √
- custom— custom_token_chars 設定を使用して設定する必要があるカスタム文字。
custom_token_chars
トークンの一部として扱うべきカスタム文字。例えば、これを +-_ に設定すると、トークナイザーはプラス、マイナス、アンダースコア記号をトークンの一部として扱います。

Limitations of the max_gram parameter

edge_ngram トークナイザーの max_gram 値はトークンの文字長を制限します。edge_ngram トークナイザーがインデックスアナライザーと共に使用される場合、これは max_gram 長さよりも長い検索用語がインデックスされた用語と一致しない可能性があることを意味します。

例えば、max_gram が 3 の場合、apple の検索はインデックスされた用語 app と一致しません。

これに対処するために、truncate トークンフィルターを検索アナライザーと共に使用して、検索用語を max_gram 文字長に短縮できます。ただし、これにより無関係な結果が返される可能性があります。

例えば、max_gram が 3 で、検索用語が3文字に切り捨てられる場合、検索用語 apple は app に短縮されます。これは、apple の検索が app に一致するインデックスされた用語を返すことを意味します。例えば apply、approximate および apple。

どちらのアプローチがあなたのユースケースと望ましい検索体験に最適かを確認するために、両方のアプローチをテストすることをお勧めします。

Example configuration

この例では、edge_ngram トークナイザーを構成して、文字と数字をトークンとして扱い、最小長さ 2 および最大長さ 10 のグラムを生成します:

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "my_analyzer": {
   "tokenizer": "my_tokenizer"
   }
   },
   "tokenizer": {
   "my_tokenizer": {
   "type": "edge_ngram",
   "min_gram": 2,
   "max_gram": 10,
   "token_chars": [
   "letter",
   "digit"
   ]
   }
   }
   }
   },
)
print(resp)
resp1 = client.indices.analyze(
   index="my-index-000001",
   analyzer="my_analyzer",
   text="2 Quick Foxes.",
)
print(resp1)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   my_analyzer: {
   tokenizer: 'my_tokenizer'
   }
   },
   tokenizer: {
   my_tokenizer: {
   type: 'edge_ngram',
   min_gram: 2,
   max_gram: 10,
   token_chars: [
   'letter',
   'digit'
   ]
   }
   }
   }
   }
  }
)
puts response
response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
   analyzer: 'my_analyzer',
   text: '2 Quick Foxes.'
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   my_analyzer: {
   tokenizer: "my_tokenizer",
   },
   },
   tokenizer: {
   my_tokenizer: {
   type: "edge_ngram",
   min_gram: 2,
   max_gram: 10,
   token_chars: ["letter", "digit"],
   },
   },
   },
  },
});
console.log(response);
const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_analyzer",
  text: "2 Quick Foxes.",
});
console.log(response1);

Console

PUT my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "my_analyzer": {
   "tokenizer": "my_tokenizer"
   }
   },
   "tokenizer": {
   "my_tokenizer": {
   "type": "edge_ngram",
   "min_gram": 2,
   "max_gram": 10,
   "token_chars": [
   "letter",
   "digit"
   ]
   }
   }
   }
  }
}
POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}

上記の例は次の用語を生成します:

Text

[ Qu, Qui, Quic, Quick, Fo, Fox, Foxe, Foxes ]

通常、インデックス時と検索時に同じ analyzer を使用することをお勧めします。edge_ngram トークナイザーの場合、アドバイスは異なります。部分的な単語がインデックスで一致するために利用可能であることを保証するために、インデックス時に edge_ngram トークナイザーを使用することは意味があります。検索時には、ユーザーが入力した用語を検索するだけです。例えば: Quick Fo。

以下は、検索時入力 用のフィールドを設定する方法の例です。

インデックスアナライザーの max_gram 値が 10 であることに注意してください。これは、インデックスされた用語を10文字に制限します。検索用語は切り捨てられず、10文字を超える検索用語はインデックスされた用語と一致しない可能性があります。

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "autocomplete": {
   "tokenizer": "autocomplete",
   "filter": [
   "lowercase"
   ]
   },
   "autocomplete_search": {
   "tokenizer": "lowercase"
   }
   },
   "tokenizer": {
   "autocomplete": {
   "type": "edge_ngram",
   "min_gram": 2,
   "max_gram": 10,
   "token_chars": [
   "letter"
   ]
   }
   }
   }
   },
   mappings={
   "properties": {
   "title": {
   "type": "text",
   "analyzer": "autocomplete",
   "search_analyzer": "autocomplete_search"
   }
   }
   },
)
print(resp)
resp1 = client.index(
   index="my-index-000001",
   id="1",
   document={
   "title": "Quick Foxes"
   },
)
print(resp1)
resp2 = client.indices.refresh(
   index="my-index-000001",
)
print(resp2)
resp3 = client.search(
   index="my-index-000001",
   query={
   "match": {
   "title": {
   "query": "Quick Fo",
   "operator": "and"
   }
   }
   },
)
print(resp3)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   autocomplete: {
   tokenizer: 'autocomplete',
   filter: [
   'lowercase'
   ]
   },
   autocomplete_search: {
   tokenizer: 'lowercase'
   }
   },
   tokenizer: {
   autocomplete: {
   type: 'edge_ngram',
   min_gram: 2,
   max_gram: 10,
   token_chars: [
   'letter'
   ]
   }
   }
   }
   },
   mappings: {
   properties: {
   title: {
   type: 'text',
   analyzer: 'autocomplete',
   search_analyzer: 'autocomplete_search'
   }
   }
   }
  }
)
puts response
response = client.index(
  index: 'my-index-000001',
  id: 1,
  body: {
   title: 'Quick Foxes'
  }
)
puts response
response = client.indices.refresh(
  index: 'my-index-000001'
)
puts response
response = client.search(
  index: 'my-index-000001',
  body: {
   query: {
   match: {
   title: {
   query: 'Quick Fo',
   operator: 'and'
   }
   }
   }
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   autocomplete: {
   tokenizer: "autocomplete",
   filter: ["lowercase"],
   },
   autocomplete_search: {
   tokenizer: "lowercase",
   },
   },
   tokenizer: {
   autocomplete: {
   type: "edge_ngram",
   min_gram: 2,
   max_gram: 10,
   token_chars: ["letter"],
   },
   },
   },
  },
  mappings: {
   properties: {
   title: {
   type: "text",
   analyzer: "autocomplete",
   search_analyzer: "autocomplete_search",
   },
   },
  },
});
console.log(response);
const response1 = await client.index({
  index: "my-index-000001",
  id: 1,
  document: {
   title: "Quick Foxes",
  },
});
console.log(response1);
const response2 = await client.indices.refresh({
  index: "my-index-000001",
});
console.log(response2);
const response3 = await client.search({
  index: "my-index-000001",
  query: {
   match: {
   title: {
   query: "Quick Fo",
   operator: "and",
   },
   },
  },
});
console.log(response3);

Console

PUT my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "autocomplete": {
   "tokenizer": "autocomplete",
   "filter": [
   "lowercase"
   ]
   },
   "autocomplete_search": {
   "tokenizer": "lowercase"
   }
   },
   "tokenizer": {
   "autocomplete": {
   "type": "edge_ngram",
   "min_gram": 2,
   "max_gram": 10,
   "token_chars": [
   "letter"
   ]
   }
   }
   }
  },
  "mappings": {
   "properties": {
   "title": {
   "type": "text",
   "analyzer": "autocomplete",
   "search_analyzer": "autocomplete_search"
   }
   }
  }
}
PUT my-index-000001/_doc/1
{
  "title": "Quick Foxes"
}
POST my-index-000001/_refresh
GET my-index-000001/_search
{
  "query": {
   "match": {
   "title": {
   "query": "Quick Fo",
   "operator": "and"
   }
   }
  }
}


	`autocomplete` アナライザーは `[qu, qui, quic, quick, fo, fox, foxe, foxes]` 用語をインデックスします。
	`autocomplete_search` アナライザーは `[quick, fo]` 用語を検索します。どちらもインデックスに存在します。