Skip to content

tstenner/osdump

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

osdump

osdump is a high performance tool for extracting documents from OpenSearch indexes and saving them into files.

Features

  • As high performance as a single worker solution can be
  • Opensearch queries are based on search_after
  • Uses fastjson for faster json parsing
  • Built-in support for compressing the output using brotli
  • Has some built-in sanity checks to ensure smooth operation
  • Comes with an example of performance optimized query

Installation

$ go install github.com/mikkolehtisalo/osdump@latest

Usage

The configuration options:

$ ~/go/bin/osdump -h
Usage of ./osdump:
  -base string
        opensearch base url (default "https://localhost:9200")
  -brotli
        compress using brotli
  -ca string
        CA certificate (default "ca.pem")
  -debug
        debug logging
  -file string
        target file for export (default "graylog_0.json")
  -index string
        opensearch index (default "graylog_0")
  -password string
        opensearch user (default "password")
  -quality int
        brotli quality setting (default 2)
  -query string
        query template file
  -size int
        search window size (default 1000)
  -user string
        opensearch user (default "graylog")

Example run:

$ ~/go/bin/osdump -user admin -password mysecretpassword -size 1000
2024/12/30 21:08:30 osdump.go:296: Starting to dump graylog_0
2024/12/30 21:08:30 osdump.go:300: Index graylog_0 has 272905 documents to dump
2024/12/30 21:09:53 osdump.go:320: Dumped 272905 records in 82 seconds, average speed 3314/second
2024/12/30 21:09:53 osdump.go:321: Finished dumping graylog_0

Requirements

  • Go 1.22+
  • Access to an OpenSearch instance

Limitations

  • Large dumps may require large amounts of disk space
  • Brotli compression is CPU heavy operation
  • Assumes opensearch security is configured (TLS enabled, and username/password required)
  • Single worker for querying opensearch, for now

Performance notes

  • Smaller window sizes seem to perform worse. Start with 1000 (default) and experiment larger sizes upto 10 000 (maximum supported by opensearch).
  • search_after requires always sorting field. It should never be fielddata type because it will be loaded fully into memory and sorted after that. If your cluster has a lot of activity it will evict the loaded data fast from the caches.
  • Fields of keyword type perform significantly better for sorting so try to always them for sorting. The performance difference may vary depending on the architecture and load of your opensearch cluster between 2x and 1000x.
  • You should never query opensearch without filter. A filter with match_all performs better than a query without filter. Having a filter disables scoring and enables most of the caching features of opensearch.
  • Opensearch has also request query, which attempts to cache the results for a specific request. I forced it to be enabled by request_cache=true , but it probably will not have any effect unless I implement retrying logic at some stage.

Keeping the previous in mind I designed the default query to be following, but you probably have to change (command line option -query) it for your needs;

{
	"size": {{.Size}},
	"query": {"bool": {"must": {"match_all": {}}}},{{if .After}}
	"search_after": ["{{.After}}"],{{end}}
	"sort": [
		{ "gl2_message_id": "asc" }  
	]
}

Contributing

This works for me. If you need more features, or find a bug, please open a pr, or an issue.

License

osdump is licensed under the MIT License.

About

A tool for dumping opensearch index to a file

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 100.0%