instagram

instagram

instagram

Delete Duplicate Documents with Elasticsearch and Ruby

Can you remember the good old times of delete_by_query in Elasticsearch? If you want to delete some documents with a complex query or an aggregation you on your own. I needed to delete some documents which have the same date, so I used the following script.

require 'elasticsearch'
client = Elasticsearch::Client.new

begin
  # find duplicate documents by @timestamp
  result = client.search(
    index: 'my_index*', 
    body: {
      aggs: {
        duplicateCount: {
          terms: {
            field: "@timestamp",
            "min_doc_count": 2,
            size: 100
          },
          aggs: {
            duplicateDocuments: {
              top_hits: {}
            }
          }
        }
      }
    }
  )['aggregations']['duplicateCount']['buckets'].map do |bucket|
    #use the first document of the duplicates
    bucket['duplicateDocuments']['hits']['hits'].first
  end

  result.each do |doc|
    client.delete(index: doc['_index'], type: doc['_type'], id: doc['_id'])
  end
  client.indices.refresh(index: 'my_index*')
end until result.count <= 0

instagram