組み込みアナライザーのリファレンス - フィンガープリンティング（Fingerprint）

Fingerprint analyzer
Example output
Configuration
Example configuration
Definition

Fingerprint analyzer

fingerprint アナライザーは、OpenRefine プロジェクトによってクラスタリングを支援するために使用されるフィンガープリンティングアルゴリズムを実装しています。

入力テキストは小文字に変換され、拡張文字を削除するために正規化され、ソートされ、重複が削除され、単一のトークンに連結されます。ストップワードリストが設定されている場合、ストップワードも削除されます。

Example output

Python

resp = client.indices.analyze(
   analyzer="fingerprint",
   text="Yes yes, Gödel said this sentence is consistent and.",
)
print(resp)

Ruby

response = client.indices.analyze(
  body: {
   analyzer: 'fingerprint',
   text: 'Yes yes, Gödel said this sentence is consistent and.'
  }
)
puts response

Js

const response = await client.indices.analyze({
  analyzer: "fingerprint",
  text: "Yes yes, Gödel said this sentence is consistent and.",
});
console.log(response);

Console

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

上記の文は次の単一の用語を生成します:

Text

[ and consistent godel is said sentence this yes ]

Configuration

fingerprint アナライザーは次のパラメータを受け入れます:


`separator`	用語を連結するために使用する文字。デフォルトはスペースです。
`max_output_size`	出力する最大トークンサイズ。デフォルトは `255` です。このサイズより大きいトークンは破棄されます。
`stopwords`	`_english_` のような事前定義されたストップワードリストまたはストップワードのリストを含む配列。デフォルトは `_none_` です。
`stopwords_path`	ストップワードを含むファイルへのパス。

ストップワードの設定に関する詳細は、ストップトークンフィルターを参照してください。

Example configuration

この例では、fingerprint アナライザーを使用して事前定義された英語のストップワードリストを設定します:

Python

resp = client.indices.create(
   index="my-index-000001",
   settings={
   "analysis": {
   "analyzer": {
   "my_fingerprint_analyzer": {
   "type": "fingerprint",
   "stopwords": "_english_"
   }
   }
   }
   },
)
print(resp)
resp1 = client.indices.analyze(
   index="my-index-000001",
   analyzer="my_fingerprint_analyzer",
   text="Yes yes, Gödel said this sentence is consistent and.",
)
print(resp1)

Ruby

response = client.indices.create(
  index: 'my-index-000001',
  body: {
   settings: {
   analysis: {
   analyzer: {
   my_fingerprint_analyzer: {
   type: 'fingerprint',
   stopwords: '_english_'
   }
   }
   }
   }
  }
)
puts response
response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
   analyzer: 'my_fingerprint_analyzer',
   text: 'Yes yes, Gödel said this sentence is consistent and.'
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
   analysis: {
   analyzer: {
   my_fingerprint_analyzer: {
   type: "fingerprint",
   stopwords: "_english_",
   },
   },
   },
  },
});
console.log(response);
const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_fingerprint_analyzer",
  text: "Yes yes, Gödel said this sentence is consistent and.",
});
console.log(response1);

Console

PUT my-index-000001
{
  "settings": {
   "analysis": {
   "analyzer": {
   "my_fingerprint_analyzer": {
   "type": "fingerprint",
   "stopwords": "_english_"
   }
   }
   }
  }
}
POST my-index-000001/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

上記の例は次の用語を生成します:

Text

[ consistent godel said sentence yes ]

Definition

fingerprint トークナイザーは次の構成要素から成ります:

トークナイザー
- スタンダードトークナイザー
トークンフィルター (順番に)
- 小文字トークンフィルター
- ASCII フォールディング
- ストップトークンフィルター (デフォルトでは無効)
- フィンガープリンティング

fingerprint アナライザーの設定パラメータを超えてカスタマイズする必要がある場合は、custom アナライザーとして再作成し、通常はトークンフィルターを追加することで修正する必要があります。これにより、組み込みの fingerprint アナライザーが再作成され、さらなるカスタマイズの出発点として使用できます:

Python

resp = client.indices.create(
   index="fingerprint_example",
   settings={
   "analysis": {
   "analyzer": {
   "rebuilt_fingerprint": {
   "tokenizer": "standard",
   "filter": [
   "lowercase",
   "asciifolding",
   "fingerprint"
   ]
   }
   }
   }
   },
)
print(resp)

Ruby

response = client.indices.create(
  index: 'fingerprint_example',
  body: {
   settings: {
   analysis: {
   analyzer: {
   rebuilt_fingerprint: {
   tokenizer: 'standard',
   filter: [
   'lowercase',
   'asciifolding',
   'fingerprint'
   ]
   }
   }
   }
   }
  }
)
puts response

Js

const response = await client.indices.create({
  index: "fingerprint_example",
  settings: {
   analysis: {
   analyzer: {
   rebuilt_fingerprint: {
   tokenizer: "standard",
   filter: ["lowercase", "asciifolding", "fingerprint"],
   },
   },
   },
  },
});
console.log(response);

Console

PUT /fingerprint_example
{
  "settings": {
   "analysis": {
   "analyzer": {
   "rebuilt_fingerprint": {
   "tokenizer": "standard",
   "filter": [
   "lowercase",
   "asciifolding",
   "fingerprint"
   ]
   }
   }
   }
  }
}