Delete Duplicate Documents with Elasticsearch and Ruby
Can you remember the good old times of delete_by_query
in Elasticsearch? If you want to delete some documents with a complex query or an aggregation you on your own. I needed to delete some documents which have the same date, so I used the following script.
require 'elasticsearch'
client = Elasticsearch::Client.new
begin
# find duplicate documents by @timestamp
result = client.search(
index: 'my_index*',
body: {
aggs: {
duplicateCount: {
terms: {
field: "@timestamp",
"min_doc_count": 2,
size: 100
},
aggs: {
duplicateDocuments: {
top_hits: {}
}
}
}
}
}
)['aggregations']['duplicateCount']['buckets'].map do |bucket|
#use the first document of the duplicates
bucket['duplicateDocuments']['hits']['hits'].first
end
result.each do |doc|
client.delete(index: doc['_index'], type: doc['_type'], id: doc['_id'])
end
client.indices.refresh(index: 'my_index*')
end until result.count <= 0