組み込みアナライザーのリファレンス - パターン（Pattern）

パターンアナライザー
パターンアナライザー
例の出力
設定
例の設定
キャメルケーストークナイザー
@@4_1@@ の定義

パターンアナライザー

pattern アナライザーは、正規表現を使用してテキストを用語に分割します。正規表現は、トークンセパレーターに一致する必要があり、トークン自体には一致しません。正規表現のデフォルトは \W+（またはすべての非単語文字）です。

パターンアナライザー

例の出力

Python

resp = client.indices.analyze(
   analyzer="pattern",
   text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   analyzer: 'pattern',
   text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response

Js

const response = await client.indices.analyze({
  analyzer: "pattern",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response);

コンソール

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

上記の文は次の用語を生成します:

テキスト

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

設定

pattern アナライザーは次のパラメーターを受け入れます:


`pattern`	Java 正規表現、デフォルトは `\W+` です。
`flags`	Java 正規表現のフラグ。フラグはパイプで区切る必要があります。例: `````”CASE_INSENSITIVE	COMMENTS”`````。
`lowercase`	用語を小文字にするかどうか。デフォルトは `true` です。
`stopwords`	`_english_` のような事前定義されたストップワードリストまたはストップワードのリストを含む配列。デフォルトは `_none_` です。
`stopwords_path`	ストップワードを含むファイルへのパス。

ストップワードの設定に関する詳細はストップトークンフィルターを参照してください。

例の設定

この例では、pattern アナライザーを使用して、非単語文字またはアンダースコア (\W|_) でメールアドレスを分割し、結果を小文字にします:

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "my_email_analyzer": {
   "type": "pattern",
   "pattern": "\\W|_",
   "lowercase": True
   }
   }
   }
   },
)
print(resp)
resp1 = client.indices.analyze(
   index="my-index-000001",
   analyzer="my_email_analyzer",
   text="[email protected]",
)
print(resp1)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   my_email_analyzer: {
   type: 'pattern',
   pattern: '\\W|_',
   lowercase: true
   }
   }
   }
   }
  }
)
puts response
response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
   analyzer: 'my_email_analyzer',
   text: '[email protected]'
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   my_email_analyzer: {
   type: "pattern",
   pattern: "\\W|_",
   lowercase: true,
   },
   },
   },
  },
});
console.log(response);
const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_email_analyzer",
  text: "[email protected]",
});
console.log(response1);

コンソール

PUT my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "my_email_analyzer": {
   "type":      "pattern",
   "pattern":   "\\W|_",<br>   "lowercase": true<br>   }<br>   }<br>   }<br>  }<br>}<br><br>POST my-index-000001/_analyze<br>{<br>  "analyzer": "my_email_analyzer",<br>  "text": "[email protected]"<br>}<br>``````<br><br>
|     |     |
| --- | --- |
|  | パターンを JSON 文字列として指定する際には、パターン内のバックスラッシュをエスケープする必要があります。 |
上記の例は次の用語を生成します:
#### テキスト
``````text
[ john, smith, foo, bar, com ]

キャメルケーストークナイザー

以下のより複雑な例は、キャメルケースのテキストをトークンに分割します:

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "camel": {
   "type": "pattern",
   "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
   }
   }
   }
   },
)
print(resp)
resp1 = client.indices.analyze(
   index="my-index-000001",
   analyzer="camel",
   text="MooseX::FTPClass2_beta",
)
print(resp1)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   camel: {
   type: 'pattern',
   pattern: '([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])'
   }
   }
   }
   }
  }
)
puts response
response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
   analyzer: 'camel',
   text: 'MooseX::FTPClass2_beta'
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   camel: {
   type: "pattern",
   pattern:
   "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
   },
   },
   },
  },
});
console.log(response);
const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "camel",
  text: "MooseX::FTPClass2_beta",
});
console.log(response1);

コンソール

PUT my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "camel": {
   "type": "pattern",
   "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
   }
   }
   }
  }
}
GET my-index-000001/_analyze
{
  "analyzer": "camel",
  "text": "MooseX::FTPClass2_beta"
}

上記の例は次の用語を生成します:

テキスト

[ moose, x, ftp, class, 2, beta ]

上記の正規表現は次のように理解しやすくなります:

正規表現

([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
   [\p{L}&&[^\p{Lu}]]          #   then lower case
  )

@@4_1@@ の定義

pattern アナライザーは次の要素で構成されています:

トークナイザー
- パターントークナイザー
トークンフィルター
- 小文字トークンフィルター
- ストップトークンフィルター（デフォルトでは無効）

pattern アナライザーの設定パラメーターを超えてカスタマイズする必要がある場合は、custom アナライザーとして再作成し、通常はトークンフィルターを追加することで修正する必要があります。これにより、組み込みの pattern アナライザーが再作成され、さらなるカスタマイズの出発点として使用できます:

Python

resp = client.indices.create(
   index="pattern_example",
   settings={
   "analysis": {
   "tokenizer": {
   "split_on_non_word": {
   "type": "pattern",
   "pattern": "\\W+"
   }
   },
   "analyzer": {
   "rebuilt_pattern": {
   "tokenizer": "split_on_non_word",
   "filter": [
   "lowercase"
   ]
   }
   }
   }
   },
)
print(resp)

Ruby

response = client.indices.create(
  index: 'pattern_example',
  body: {
   settings: {
   analysis: {
   tokenizer: {
   split_on_non_word: {
   type: 'pattern',
   pattern: '\\W+'
   }
   },
   analyzer: {
   rebuilt_pattern: {
   tokenizer: 'split_on_non_word',
   filter: [
   'lowercase'
   ]
   }
   }
   }
   }
  }
)
puts response

Js

const response = await client.indices.create({
  index: "pattern_example",
  settings: {
   analysis: {
   tokenizer: {
   split_on_non_word: {
   type: "pattern",
   pattern: "\\W+",
   },
   },
   analyzer: {
   rebuilt_pattern: {
   tokenizer: "split_on_non_word",
   filter: ["lowercase"],
   },
   },
   },
  },
});
console.log(response);

コンソール

PUT /pattern_example
{
  "settings": {
   "analysis": {
   "tokenizer": {
   "split_on_non_word": {
   "type":       "pattern",
   "pattern":    "\\W+"
   }
   },
   "analyzer": {
   "rebuilt_pattern": {
   "tokenizer": "split_on_non_word",
   "filter": [
   "lowercase"
   ]
   }
   }
   }
  }
}


	デフォルトのパターンは `\W+` で、非単語文字で分割されます。これは変更する場所です。
	`lowercase` の後に他のトークンフィルターを追加します。