トークナイザーリファレンス - 文字グループ（Character group）

文字グループトークナイザー
設定
例の出力

文字グループトークナイザー

char_group トークナイザーは、定義されたセットに含まれる文字に遭遇するたびにテキストを用語に分割します。これは、シンプルなカスタムトークン化が望まれる場合に主に有用であり、pattern トークナイザーの使用によるオーバーヘッドが許容できない場合に適しています。

設定

char_group トークナイザーは1つのパラメータを受け入れます：


`tokenize_on_chars`	文字列をトークン化するための文字のリストを含むリスト。このリストの文字に遭遇するたびに、新しいトークンが開始されます。これは、`-`のような単一の文字や、文字グループ：`whitespace`、`letter`、`digit`、`punctuation`、`symbol`を受け入れます。

| max_token_length | 最大トークン長。この長さを超えるトークンが見られた場合、それはmax_token_lengthの間隔で分割されます。デフォルトは255です。

例の出力

Python

resp = client.indices.analyze(
   tokenizer={
   "type": "char_group",
   "tokenize_on_chars": [
   "whitespace",
   "-",
   "\n"
   ]
   },
   text="The QUICK brown-fox",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   tokenizer: {
   type: 'char_group',
   tokenize_on_chars: [
   'whitespace',
   '-',
   "\n"
   ]
   },
   text: 'The QUICK brown-fox'
  }
)
puts response

Js

const response = await client.indices.analyze({
  tokenizer: {
   type: "char_group",
   tokenize_on_chars: ["whitespace", "-", "\n"],
  },
  text: "The QUICK brown-fox",
});
console.log(response);

コンソール

POST _analyze
{
  "tokenizer": {
   "type": "char_group",
   "tokenize_on_chars": [
   "whitespace",
   "-",
   "\n"
   ]
  },
  "text": "The QUICK brown-fox"
}

返します

コンソール-結果

{
  "tokens": [
   {
   "token": "The",
   "start_offset": 0,
   "end_offset": 3,
   "type": "word",
   "position": 0
   },
   {
   "token": "QUICK",
   "start_offset": 4,
   "end_offset": 9,
   "type": "word",
   "position": 1
   },
   {
   "token": "brown",
   "start_offset": 10,
   "end_offset": 15,
   "type": "word",
   "position": 2
   },
   {
   "token": "fox",
   "start_offset": 16,
   "end_offset": 19,
   "type": "word",
   "position": 3
   }
  ]
}