トークンフィルタリファレンス - シングル（Shingle）

シングルトークンフィルター
例
アナライザーに追加
設定可能なパラメータ
カスタマイズ

シングルトークンフィルター

隣接するトークンを連結することによって、トークンストリームにシングル（または単語 [n-grams](https://en.wikipedia.org/wiki/N-gram）を追加します。デフォルトでは、`````shingle````` トークンフィルターは二語のシングルとユニグラムを出力します。

例えば、多くのトークナイザーは the lazy dog を [ the, lazy, dog ] に変換します。このストリームに二語のシングルを追加するには、shingle フィルターを使用できます: [ the, the lazy, lazy, lazy dog, dog ]。

シングルは、match_phrase のようなフレーズクエリの速度を向上させるためによく使用されます。shingles フィルターを使用してシングルを作成するのではなく、適切な text フィールドに対して index-phrases マッピングパラメータを使用することをお勧めします。

このフィルターは、Lucene の ShingleFilter を使用します。

例

次の analyze API リクエストは、shingle フィルターを使用して quick brown fox jumps のトークンストリームに二語のシングルを追加します:

Python

resp = client.indices.analyze(
   tokenizer="whitespace",
   filter=[
   "shingle"
   ],
   text="quick brown fox jumps",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'whitespace',
   filter: [
   'shingle'
   ],
   text: 'quick brown fox jumps'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: ["shingle"],
  text: "quick brown fox jumps",
});
console.log(response);

コンソール

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [ "shingle" ],
  "text": "quick brown fox jumps"
}

フィルターは次のトークンを生成します:

テキスト

[ quick, quick brown, brown, brown fox, fox, fox jumps, jumps ]

2-3語のシングルを生成するには、analyze API リクエストに次の引数を追加します:

min_shingle_size: 2
max_shingle_size: 3

Python

resp = client.indices.analyze(
   tokenizer="whitespace",
   filter=[
   {
   "type": "shingle",
   "min_shingle_size": 2,
   "max_shingle_size": 3
   }
   ],
   text="quick brown fox jumps",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'whitespace',
   filter: [
   {
   type: 'shingle',
   min_shingle_size: 2,
   max_shingle_size: 3
   }
   ],
   text: 'quick brown fox jumps'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
   {
   type: "shingle",
   min_shingle_size: 2,
   max_shingle_size: 3,
   },
  ],
  text: "quick brown fox jumps",
});
console.log(response);

コンソール

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
   {
   "type": "shingle",
   "min_shingle_size": 2,
   "max_shingle_size": 3
   }
  ],
  "text": "quick brown fox jumps"
}

フィルターは次のトークンを生成します:

テキスト

[ quick, quick brown, quick brown fox, brown, brown fox, brown fox jumps, fox, fox jumps, jumps ]

出力にシングルのみを含めるには、リクエストに output_unigrams 引数として false を追加します。

Python

resp = client.indices.analyze(
   tokenizer="whitespace",
   filter=[
   {
   "type": "shingle",
   "min_shingle_size": 2,
   "max_shingle_size": 3,
   "output_unigrams": False
   }
   ],
   text="quick brown fox jumps",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'whitespace',
   filter: [
   {
   type: 'shingle',
   min_shingle_size: 2,
   max_shingle_size: 3,
   output_unigrams: false
   }
   ],
   text: 'quick brown fox jumps'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
   {
   type: "shingle",
   min_shingle_size: 2,
   max_shingle_size: 3,
   output_unigrams: false,
   },
  ],
  text: "quick brown fox jumps",
});
console.log(response);

コンソール

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
   {
   "type": "shingle",
   "min_shingle_size": 2,
   "max_shingle_size": 3,
   "output_unigrams": false
   }
  ],
  "text": "quick brown fox jumps"
}

フィルターは次のトークンを生成します:

テキスト

[ quick brown, quick brown fox, brown fox, brown fox jumps, fox jumps ]

アナライザーに追加

次の create index API リクエストは、shingle フィルターを使用して新しいカスタムアナライザーを構成します。

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "standard_shingle": {
   "tokenizer": "standard",
   "filter": [
   "shingle"
   ]
   }
   }
   }
   },
)
print(resp)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   standard_shingle: {
   tokenizer: 'standard',
   filter: [
   'shingle'
   ]
   }
   }
   }
   }
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   standard_shingle: {
   tokenizer: "standard",
   filter: ["shingle"],
   },
   },
   },
  },
});
console.log(response);

コンソール

PUT /my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "standard_shingle": {
   "tokenizer": "standard",
   "filter": [ "shingle" ]
   }
   }
   }
  }
}

設定可能なパラメータ

max_shingle_size
（オプション、整数）シングルを作成する際に連結するトークンの最大数。デフォルトは 2。
この値は、デフォルトで 2 の min_shingle_size 引数よりも小さくすることはできません。この値と min_shingle_size 引数の差は、デフォルトで 3 の index.max_shingle_diff インデックスレベル設定を超えることはできません。
min_shingle_size
（オプション、整数）シングルを作成する際に連結するトークンの最小数。デフォルトは 2。
この値は、デフォルトで 2 の max_shingle_size 引数を超えることはできません。max_shingle_size 引数とこの値の差は、デフォルトで 3 の index.max_shingle_diff インデックスレベル設定を超えることはできません。
output_unigrams
（オプション、ブール値）true の場合、出力には元の入力トークンが含まれます。false の場合、出力にはシングルのみが含まれ、元の入力トークンは削除されます。デフォルトは true。
output_unigrams_if_no_shingles
true の場合、出力にはシングルが生成されない場合のみ元の入力トークンが含まれます。シングルが生成される場合、出力にはシングルのみが含まれます。デフォルトは false。
このパラメータと output_unigrams パラメータの両方が true の場合、output_unigrams 引数のみが使用されます。
token_separator
（オプション、文字列）隣接するトークンを連結してシングルを形成するために使用されるセパレーター。デフォルトはスペース（" "）。
filler_token
（オプション、文字列）トークンを含まない空の位置の代わりにシングルで使用される文字列。このフィラートークンはシングルでのみ使用され、元のユニグラムでは使用されません。デフォルトはアンダースコア（_）。
一部のトークンフィルター、例えば stop フィルターは、位置インクリメントが1より大きい場合にストップワードを削除するときに空の位置を作成します。
例
次の analyze API リクエストでは、stop フィルターが a ストップワードを fox jumps a lazy dog から削除し、空の位置を作成します。その後の shingle フィルターは、この空の位置をシングル内のプラス記号（+）で置き換えます。

Python

resp = client.indices.analyze(
   tokenizer="whitespace",
   filter=[
   {
   "type": "stop",
   "stopwords": [
   "a"
   ]
   },
   {
   "type": "shingle",
   "filler_token": "+"
   }
   ],
   text="fox jumps a lazy dog",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'whitespace',
   filter: [
   {
   type: 'stop',
   stopwords: [
   'a'
   ]
   },
   {
   type: 'shingle',
   filler_token: '+'
   }
   ],
   text: 'fox jumps a lazy dog'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
   {
   type: "stop",
   stopwords: ["a"],
   },
   {
   type: "shingle",
   filler_token: "+",
   },
  ],
  text: "fox jumps a lazy dog",
});
console.log(response);

コンソール

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
   {
   "type": "stop",
   "stopwords": [ "a" ]
   },
   {
   "type": "shingle",
   "filler_token": "+"
   }
  ],
  "text": "fox jumps a lazy dog"
}

フィルターは次のトークンを生成します:

テキスト

[ fox, fox jumps, jumps, jumps +, + lazy, lazy, lazy dog, dog ]

カスタマイズ

shingle フィルターをカスタマイズするには、それを複製して新しいカスタムトークンフィルターの基礎を作成します。設定可能なパラメータを使用してフィルターを変更できます。

例えば、次の create index API リクエストは、カスタム shingle フィルター my_shingle_filter を使用して新しいカスタムアナライザーを構成します。

my_shingle_filter フィルターは min_shingle_size が 2 で max_shingle_size が 5 であるため、2-5語のシングルを生成します。このフィルターには output_unigrams 引数として false が含まれており、出力にはシングルのみが含まれます。

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "en": {
   "tokenizer": "standard",
   "filter": [
   "my_shingle_filter"
   ]
   }
   },
   "filter": {
   "my_shingle_filter": {
   "type": "shingle",
   "min_shingle_size": 2,
   "max_shingle_size": 5,
   "output_unigrams": False
   }
   }
   }
   },
)
print(resp)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   en: {
   tokenizer: 'standard',
   filter: [
   'my_shingle_filter'
   ]
   }
   },
   filter: {
   my_shingle_filter: {
   type: 'shingle',
   min_shingle_size: 2,
   max_shingle_size: 5,
   output_unigrams: false
   }
   }
   }
   }
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   en: {
   tokenizer: "standard",
   filter: ["my_shingle_filter"],
   },
   },
   filter: {
   my_shingle_filter: {
   type: "shingle",
   min_shingle_size: 2,
   max_shingle_size: 5,
   output_unigrams: false,
   },
   },
   },
  },
});
console.log(response);

コンソール

PUT /my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "en": {
   "tokenizer": "standard",
   "filter": [ "my_shingle_filter" ]
   }
   },
   "filter": {
   "my_shingle_filter": {
   "type": "shingle",
   "min_shingle_size": 2,
   "max_shingle_size": 5,
   "output_unigrams": false
   }
   }
   }
  }
}