組み込みアナライザーのリファレンス - スタンダード（Standard）

Standard analyzer
Example output
Configuration
Example configuration
Definition

Standard analyzer

standard アナライザーは、指定されていない場合に使用されるデフォルトのアナライザーです。これは、文法に基づくトークン化を提供し（Unicode Standard Annex #29に指定されているUnicodeテキストセグメンテーションアルゴリズムに基づく）、ほとんどの言語でうまく機能します。

Example output

Python

resp = client.indices.analyze(
   analyzer="standard",
   text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   analyzer: 'standard',
   text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response

Js

const response = await client.indices.analyze({
  analyzer: "standard",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response);

Console

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

上記の文は次の用語を生成します：

Text

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

Configuration

standard アナライザーは、次のパラメーターを受け入れます：


`max_token_length`	最大トークン長。この長さを超えるトークンが見られた場合、`max_token_length`間隔で分割されます。デフォルトは`255`です。
`stopwords`	`_english_`のような事前定義されたストップワードリストまたはストップワードのリストを含む配列。デフォルトは`_none_`です。
`stopwords_path`	ストップワードを含むファイルへのパス。

ストップワードの構成に関する詳細は、Stop Token Filterを参照してください。

Example configuration

この例では、standard アナライザーをmax_token_lengthを5に設定し（デモ目的のため）、事前定義された英語のストップワードリストを使用します：

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "my_english_analyzer": {
   "type": "standard",
   "max_token_length": 5,
   "stopwords": "_english_"
   }
   }
   }
   },
)
print(resp)
resp1 = client.indices.analyze(
   index="my-index-000001",
   analyzer="my_english_analyzer",
   text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp1)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   my_english_analyzer: {
   type: 'standard',
   max_token_length: 5,
   stopwords: '_english_'
   }
   }
   }
   }
  }
)
puts response
response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
   analyzer: 'my_english_analyzer',
   text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   my_english_analyzer: {
   type: "standard",
   max_token_length: 5,
   stopwords: "_english_",
   },
   },
   },
  },
});
console.log(response);
const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_english_analyzer",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response1);

Console

PUT my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "my_english_analyzer": {
   "type": "standard",
   "max_token_length": 5,
   "stopwords": "_english_"
   }
   }
   }
  }
}
POST my-index-000001/_analyze
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

上記の例は次の用語を生成します：

Text

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

Definition

standard アナライザーは次のもので構成されています：

トークナイザー
- Standard Tokenizer
トークンフィルター
- Lower Case Token Filter
- Stop Token Filter（デフォルトでは無効）

standard アナライザーの構成パラメーターを超えてカスタマイズする必要がある場合は、custom アナライザーとして再作成し、通常はトークンフィルターを追加して修正する必要があります。これにより、組み込みのstandard アナライザーが再作成され、出発点として使用できます：

Python

resp = client.indices.create(
   index="standard_example",
   settings={
   "analysis": {
   "analyzer": {
   "rebuilt_standard": {
   "tokenizer": "standard",
   "filter": [
   "lowercase"
   ]
   }
   }
   }
   },
)
print(resp)

Ruby

response = client.indices.create(
  index: 'standard_example',
  body: {
   settings: {
   analysis: {
   analyzer: {
   rebuilt_standard: {
   tokenizer: 'standard',
   filter: [
   'lowercase'
   ]
   }
   }
   }
   }
  }
)
puts response

Js

const response = await client.indices.create({
  index: "standard_example",
  settings: {
   analysis: {
   analyzer: {
   rebuilt_standard: {
   tokenizer: "standard",
   filter: ["lowercase"],
   },
   },
   },
  },
});
console.log(response);

Console

PUT /standard_example
{
  "settings": {
   "analysis": {
   "analyzer": {
   "rebuilt_standard": {
   "tokenizer": "standard",
   "filter": [
   "lowercase"
   ]
   }
   }
   }
  }
}


	`lowercase`の後に任意のトークンフィルターを追加します。