Logstash+Elasticsearch: Best way to handle JSON arrays

Published: by

  • Categories:

Before I start with the solution, let's review what's the problem we're trying to solve here. If we have these two JSON documents pushed to ES:-

{
    "test": {
        "steps": [{
            "response_time": "100"
        }, {
            "response_time": "101"
        }]
    }
}


{
    "test": {
        "steps": [{
            "response_time": "101"
        }, {
            "response_time": "100"
        }]
    }
}

And you write a Kibana query like:

test.steps.response_time:101


# Full ES query in the background
{
    "query": {
        "query_string": {
           "query": "test.steps.response_time:101"
        }
    }
}

It'll match both documents. Why? Because Elasticsearch flattens the arrays internally. More details:- https://www.elastic.co/guide/en/elasticsearch/guide/current/complex-core-fields.html#object-arrays and https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html

Not just that, if I were to write a query to search all documents with response_time=101 in second element of array, logically, test.steps[1].response_time:101, it's not possible.

To fix this, we can simple create a filter in Logstash which converts these arrays to hashes recursively, ie, all arrays are converted to hashes, even the nested ones. Hence, we want to write a filter which converts arrays like this.

Before:-

{
  "foo": "bar",
  "test": {
    "steps": [
      {
        "response_time": "100"
      },
      {
        "response_time": "101",
        "more_nested": [
          {
            "hello": "world"
          },
          {
            "hello2": "world2"
          }
        ]
      }
    ]
  }
}

After:-

{
  "foo": "bar",
  "test": {
    "steps": {
      "0": {
        "response_time": "100"
      },
      "1": {
        "response_time": "101",
        "more_nested": {
          "0": {
            "hello": "world"
          },
          "1": {
            "hello2": "world2"
          }
        }
      }
    }
  }
}

The filter that can do this is shared below:-

ruby {
	init => "
		def arrays_to_hash(h)
		  h.each do |k,v|
			# If v is nil, an array is being iterated and the value is k.
			# If v is not nil, a hash is being iterated and the value is v.
			value = v || k
			if value.is_a?(Array)
				# "value" is replaced with "value_hash" later.
				value_hash = {}
				value.each_with_index do |v, i|
					value_hash[i.to_s] = v
				end
				h[k] = value_hash
			end

			if value.is_a?(Hash) || value.is_a?(Array)
			  arrays_to_hash(value)
			end
		  end
		end
	  "
	  code => "arrays_to_hash(event.to_hash)"
}

Now, to search the document which contains response_time=101 in second element of array, it's simple.

test.steps.1.response_time:101

Happy ELKing!