osdump
is a high performance tool for extracting documents from OpenSearch indexes and saving them into files.
- As high performance as a single worker solution can be
- Opensearch queries are based on
search_after
- Uses
fastjson
for faster json parsing - Built-in support for compressing the output using
brotli
- Has some built-in sanity checks to ensure smooth operation
- Comes with an example of performance optimized query
$ go install github.com/mikkolehtisalo/osdump@latest
The configuration options:
$ ~/go/bin/osdump -h
Usage of ./osdump:
-base string
opensearch base url (default "https://localhost:9200")
-brotli
compress using brotli
-ca string
CA certificate (default "ca.pem")
-debug
debug logging
-file string
target file for export (default "graylog_0.json")
-index string
opensearch index (default "graylog_0")
-password string
opensearch user (default "password")
-quality int
brotli quality setting (default 2)
-query string
query template file
-size int
search window size (default 1000)
-user string
opensearch user (default "graylog")
Example run:
$ ~/go/bin/osdump -user admin -password mysecretpassword -size 1000
2024/12/30 21:08:30 osdump.go:296: Starting to dump graylog_0
2024/12/30 21:08:30 osdump.go:300: Index graylog_0 has 272905 documents to dump
2024/12/30 21:09:53 osdump.go:320: Dumped 272905 records in 82 seconds, average speed 3314/second
2024/12/30 21:09:53 osdump.go:321: Finished dumping graylog_0
- Go 1.22+
- Access to an OpenSearch instance
- Large dumps may require large amounts of disk space
- Brotli compression is CPU heavy operation
- Assumes opensearch security is configured (TLS enabled, and username/password required)
- Single worker for querying opensearch, for now
- Smaller window sizes seem to perform worse. Start with 1000 (default) and experiment larger sizes upto 10 000 (maximum supported by opensearch).
search_after
requires always sorting field. It should never befielddata
type because it will be loaded fully into memory and sorted after that. If your cluster has a lot of activity it will evict the loaded data fast from the caches.- Fields of
keyword
type perform significantly better for sorting so try to always them for sorting. The performance difference may vary depending on the architecture and load of your opensearch cluster between 2x and 1000x. - You should never query opensearch without filter. A filter with
match_all
performs better than a query without filter. Having a filter disables scoring and enables most of the caching features of opensearch. - Opensearch has also request query, which attempts to cache the results for a specific request. I forced it to be enabled by
request_cache=true
, but it probably will not have any effect unless I implement retrying logic at some stage.
Keeping the previous in mind I designed the default query to be following, but you probably have to change (command line option -query
) it for your needs;
{
"size": {{.Size}},
"query": {"bool": {"must": {"match_all": {}}}},{{if .After}}
"search_after": ["{{.After}}"],{{end}}
"sort": [
{ "gl2_message_id": "asc" }
]
}
This works for me. If you need more features, or find a bug, please open a pr, or an issue.
osdump is licensed under the MIT License.