fluent-plugin-elasticsearch で数値を文字列として取り込もうとしてどハマりしたのでメモ

はじめに

どハマりしたのでとりあえずメモ
オチとしては「文字列はダブルクウォートで囲む」、「ログ設計もシステム設計のうち」

経緯

現象

以下のようにな数値だけで構成された ID 番号を...

"test_id" : 1234567890123456789

fluent-logger を介してアプリケーションからデフォルトな状態の fluent-plugin-elasticsearch 経由で Elasticsearch に放り込んだ後に kibana や elasticsearch-head 等で見ると...

"test_id" : 1234567890123456800

あれ...上記のように四捨五入されてしまう現象に遭遇。

発見当初

フィールドのタイプが string 以外になっているんだろかいな
mapping の template でフィールドの mapping を string で固定すればイケるかな

調査、対応

とりあえず mapping の template で

フィールドのマッピングを string に固定する。

curl -XPUT localhost:9200/_template/template_1 -d '
{
    "template" : "huga*",
    "mappings" : {
      "huga" : {
        "properties" : {
          "@timestamp" : { "type":"date", "format":"dateOptionalTime" },
          "test_id" : {
            "type" : "string"
          }
        }
      }
    }
  }
'

手動でデータを放り込むと...

curl -XPUT 'http://localhost:9200/huga-test/huga/1' -d '{
   "@timestamp" : "2014-03-13T23:20:00+09:00",
   "test_id" : 1234567890123456789
}'

現象が再現せず... fluent-plugin-elasticsearch で放り込むと....

{
  "_index" : "huga-test",
  "_type" : "huga",
  "_id" : "1",
  "_version" : 1,
  "found" : true, "_source" : {
   "@timestamp" : "2014-03-13T23:20:00+09:00",
   "test_id" : 1234567890123456789
}
}

ありゃりゃ、同じく再現せず。

fluent-plugin-elasticsearch ってどんな風に Elasticsearch にデータを放り込んでいるのか？

以下の通り Bulk API を使って Elasticsearch に放り込んでいる。

    bulk_message << ""

    http = Net::HTTP.new(@host, @port.to_i)
    request = Net::HTTP::Post.new('/_bulk', {'content-type' => 'application/json; charset=utf-8'})
    request.body = bulk_message.join("\n")
    http.request(request).value
  end

ということで Bulk API で放り込んでみる

以下のように test.json を用意する。

{ "index" : { "_index" : "huga-test", "_type" : "huga", "_id" : "2" } }
{"@timestamp" : "2014-03-13T23:25:00+09:00","test_id" : 1234567890123456789}

以下のようにして Elasticsearch に放り込む。

curl -s -XPOST localhost:9200/_bulk --data-binary @test.json;echo

で確認すると...

{
  "_index" : "huga-test",
  "_type" : "huga",
  "_id" : "2",
  "_version" : 1,
  "found" : true, "_source" : {"@timestamp" : "2014-03-13T23:25:00+09:00","test_id" : 1234567890123456789}
}

現象は再現せず...。

Elasticsearch とか kibana で見ると...

kibana で見ると...

f:id:inokara:20140313235732p:plain

elasticsearch-head で見ると...

f:id:inokara:20140313235742p:plain

再現した。

そういえば JSON で文字列を...

JSON で文字列を表現する場合には " ダブルクウォートで囲むんではなかったかなということで以下のようにクエリを修正してみた。

curl -XPUT 'http://localhost:9200/huga-test/huga/4' -d '{
   "@timestamp" : "2014-03-13T23:28:00+09:00",
   "test_id" : "1234567890123456789"
}'

ちゃんと文字列としてして登録してから kibana を見ると...

f:id:inokara:20140313235754p:plain

おお、丸められてまへん。

結論というか...自分はこうした

アプリケーション側での対応がすぐに難しいという大人の事情もあったかどうか置いといて以下のように対応した。

対策 1 mapping を string に固定

以下のようにフィールドの mapping を template で固定。

{
    "template" : "huga*",
    "mappings" : {
      "huga" : {
        "properties" : {
          "@timestamp" : { "type":"date", "format":"dateOptionalTime" },
          "test_id" : {
            "type" : "string"
          }
        }
      }
    }
  }

対策 2 fluent-plugin-typecast でフィールドタイプを矯正

フィールドの mapping を固定だけでは改善しないのは実証済みだったので fluent-plugin-typecast で流れるデータのタイプを矯正するようにした。

  <store>
    type typecast
    item_types test_id:string
    prefix filtered
  </store>

fluent-plugin-typecast に関しては以前にこちらでもお世話になりました...

でもやっぱり...

文字列として判断させたいのであれば " ダブルクウォートで囲むように修正してもらおうと思う。

最後に

と言ってもコマンドラインから curl で取得出来る結果と kibana とかで得られる結果が異なるのはナゼ...
ログ出力もシステムの設計のようにどんな用途で、どんな風に使うのかちゃんと考慮すべきですな
あとはダブルクウォートを忘れずに
勉強になりました

ようへいの日々精進XP

よかろうもん