フィールドデータ型 - スパースベクトル（Sparse vector）

スパースベクトルフィールドタイプ
マルチバリュースパースベクトル

スパースベクトルフィールドタイプ

sparse_vector フィールドは、特徴と重みをインデックス化できるため、後で sparse_vector を使用してドキュメントをクエリする際に利用できます。このフィールドは、レガシーの text_expansion クエリでも使用できます。

sparse_vector は、ELSER マッピングと一緒に使用すべきフィールドタイプです。

Python

resp = client.indices.create(
   index="my-index",
   mappings={
   "properties": {
   "text.tokens": {
   "type": "sparse_vector"
   }
   }
   },
)
print(resp)

Ruby

response = client.indices.create(
  index: 'my-index',
  body: {
   mappings: {
   properties: {
   'text.tokens' => {
   type: 'sparse_vector'
   }
   }
   }
  }
)
puts response

Js

const response = await client.indices.create({
  index: "my-index",
  mappings: {
   properties: {
   "text.tokens": {
   type: "sparse_vector",
   },
   },
  },
});
console.log(response);

コンソール

PUT my-index
{
  "mappings": {
   "properties": {
   "text.tokens": {
   "type": "sparse_vector"
   }
   }
  }
}

ELSER を使用したセマンティック検索を参照して、ELSER を使用して sparse_vector マッピングフィールドにドキュメントを追加する完全な例を確認してください。

マルチバリュースパースベクトル

スパースベクトルの値の配列を渡すとき、同様の名前の特徴の最大値が選択されます。

論文「長文のための学習されたスパース検索の適応」(https://arxiv.org/pdf/2305.18494.pdf) では、これについて詳しく説明しています。要約すると、研究結果は、表現の集約が通常、スコアの集約よりも優れていることを支持しています。

重複する特徴名を持つインスタンスでは、それらを別々に保存するか、ネストされたフィールドを使用する必要があります。

以下は、重複する特徴名を持つドキュメントを渡す例です。この例では、ポジティブな感情とネガティブな感情の2つのカテゴリが存在します。しかし、検索の目的のために、特定の感情ではなく、全体的な影響も求めています。この例では、impact がマルチバリュースパースベクトルとして保存され、重複する名前の最大値のみが保存されます。より具体的には、ここでの最終的な GET クエリは、_score の ~1.2 を返します（これは max(impact.delicious[0], impact.delicious[1]) であり、相対誤差が 0.4% であるため、近似値です）。

Python

resp = client.indices.create(
   index="my-index-000001",
   mappings={
   "properties": {
   "text": {
   "type": "text",
   "analyzer": "standard"
   },
   "impact": {
   "type": "sparse_vector"
   },
   "positive": {
   "type": "sparse_vector"
   },
   "negative": {
   "type": "sparse_vector"
   }
   }
   },
)
print(resp)
resp1 = client.index(
   index="my-index-000001",
   document={
   "text": "I had some terribly delicious carrots.",
   "impact": [
   {
   "I": 0.55,
   "had": 0.4,
   "some": 0.28,
   "terribly": 0.01,
   "delicious": 1.2,
   "carrots": 0.8
   },
   {
   "I": 0.54,
   "had": 0.4,
   "some": 0.28,
   "terribly": 2.01,
   "delicious": 0.02,
   "carrots": 0.4
   }
   ],
   "positive": {
   "I": 0.55,
   "had": 0.4,
   "some": 0.28,
   "terribly": 0.01,
   "delicious": 1.2,
   "carrots": 0.8
   },
   "negative": {
   "I": 0.54,
   "had": 0.4,
   "some": 0.28,
   "terribly": 2.01,
   "delicious": 0.02,
   "carrots": 0.4
   }
   },
)
print(resp1)
resp2 = client.search(
   index="my-index-000001",
   query={
   "term": {
   "impact": {
   "value": "delicious"
   }
   }
   },
)
print(resp2)

Js

const response = await client.indices.create({
  index: "my-index-000001",
  mappings: {
   properties: {
   text: {
   type: "text",
   analyzer: "standard",
   },
   impact: {
   type: "sparse_vector",
   },
   positive: {
   type: "sparse_vector",
   },
   negative: {
   type: "sparse_vector",
   },
   },
  },
});
console.log(response);
const response1 = await client.index({
  index: "my-index-000001",
  document: {
   text: "I had some terribly delicious carrots.",
   impact: [
   {
   I: 0.55,
   had: 0.4,
   some: 0.28,
   terribly: 0.01,
   delicious: 1.2,
   carrots: 0.8,
   },
   {
   I: 0.54,
   had: 0.4,
   some: 0.28,
   terribly: 2.01,
   delicious: 0.02,
   carrots: 0.4,
   },
   ],
   positive: {
   I: 0.55,
   had: 0.4,
   some: 0.28,
   terribly: 0.01,
   delicious: 1.2,
   carrots: 0.8,
   },
   negative: {
   I: 0.54,
   had: 0.4,
   some: 0.28,
   terribly: 2.01,
   delicious: 0.02,
   carrots: 0.4,
   },
  },
});
console.log(response1);
const response2 = await client.search({
  index: "my-index-000001",
  query: {
   term: {
   impact: {
   value: "delicious",
   },
   },
  },
});
console.log(response2);

コンソール

PUT my-index-000001
{
  "mappings": {
   "properties": {
   "text": {
   "type": "text",
   "analyzer": "standard"
   },
   "impact": {
   "type": "sparse_vector"
   },
   "positive": {
   "type": "sparse_vector"
   },
   "negative": {
   "type": "sparse_vector"
   }
   }
  }
}
POST my-index-000001/_doc
{
   "text": "I had some terribly delicious carrots.",
   "impact": [{"I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8},
   {"I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4}],
   "positive": {"I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8},
   "negative": {"I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4}
}
GET my-index-000001/_search
{
  "query": {
   "term": {
   "impact": {
   "value": "delicious"
   }
   }
  }
}

sparse_vector フィールドは、作成された Elasticsearch バージョン 8.0 と 8.10 の間に作成されたインデックスには含めることができません。

sparse_vector フィールドは、厳密に正の値のみをサポートします。負の値は拒否されます。

sparse_vector フィールドは、アナライザー、クエリ、ソート、または集約をサポートしていません。これらは、専門的なクエリ内でのみ使用できます。これらのフィールドで使用することを推奨するクエリは、sparse_vector クエリです。また、レガシーの text_expansion クエリ内でも使用できます。

sparse_vector フィールドは、精度のために 9 ビットの有効桁を保持するだけで、これは約 0.4% の相対誤差に相当します。