トークナイザーリファレンス - クラシック（Classic）

クラシックトークナイザー
例の出力
設定
例の設定

クラシックトークナイザー

classic トークナイザーは、英語文書に適した文法ベースのトークナイザーです。このトークナイザーは、略語、会社名、メールアドレス、インターネットホスト名の特別な処理のためのヒューリスティックを持っています。しかし、これらのルールは常に機能するわけではなく、このトークナイザーは英語以外のほとんどの言語ではうまく機能しません：

それはほとんどの句読点で単語を分割し、句読点を削除します。ただし、空白の後に続かないドットはトークンの一部と見なされます。
ハイフンで単語を分割しますが、トークンに数字が含まれている場合は、トークン全体が製品番号として解釈され、分割されません。
メールアドレスとインターネットホスト名を1つのトークンとして認識します。

例の出力

Python

resp = client.indices.analyze(
   tokenizer="classic",
   text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: 'classic',
   text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: "classic",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response);

コンソール

POST _analyze
{
  "tokenizer": "classic",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

上記の文は次の用語を生成します:

テキスト

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

設定

classic トークナイザーは、次のパラメータを受け入れます：


`max_token_length`	最大トークン長。この長さを超えるトークンが見られた場合、`max_token_length` の間隔で分割されます。デフォルトは `255` です。

例の設定

この例では、classic トークナイザーを max_token_length を5に設定します（デモ目的のため）：

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "my_analyzer": {
   "tokenizer": "my_tokenizer"
   }
   },
   "tokenizer": {
   "my_tokenizer": {
   "type": "classic",
   "max_token_length": 5
   }
   }
   }
   },
)
print(resp)
resp1 = client.indices.analyze(
   index="my-index-000001",
   analyzer="my_analyzer",
   text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp1)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   my_analyzer: {
   tokenizer: 'my_tokenizer'
   }
   },
   tokenizer: {
   my_tokenizer: {
   type: 'classic',
   max_token_length: 5
   }
   }
   }
   }
  }
)
puts response
response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
   analyzer: 'my_analyzer',
   text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   my_analyzer: {
   tokenizer: "my_tokenizer",
   },
   },
   tokenizer: {
   my_tokenizer: {
   type: "classic",
   max_token_length: 5,
   },
   },
   },
  },
});
console.log(response);
const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_analyzer",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response1);

コンソール

PUT my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "my_analyzer": {
   "tokenizer": "my_tokenizer"
   }
   },
   "tokenizer": {
   "my_tokenizer": {
   "type": "classic",
   "max_token_length": 5
   }
   }
   }
  }
}
POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

上記の例は次の用語を生成します:

テキスト

[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]