Repl Testing Plan

Minutes from 10/31/14

Attendees:Krab, Jon, Michal, Greg, Heather

Loaded up grade may be a test that may be leveraged. Joe may have additional tests. Action: Jon get Joe to send additional tests he may have.
Capture stats: Baked into BashoBench. – Action: Greg to investigate
Jordan had a test to prepopulate data. - Kresten to get info from Jordan
Dataset size – 1B objects, replication factor of 3, size ~1K.

Week45, Status from Trifork - 11/10/14

Spent last week figuring out ins and outs of rt_cloud, amazon, tools, etc.
We can now run both mainstream (2.0) and specific branches of riak_ee/riak_test on Amazon
Mikael (miklix) have been working on a test that includes a background load using basho_bench.
Rune (rsltrifork) have been working on loading large datasets on S3 using basho's setup.
We still need to figure out how to get stats out from the rt_cloud environment
All in all, we believe we're now ready to do the "actual test".

REPL Test Plan (L2.5)

The large scale test is to test:

large scale repl (fullsync + realtime)
real time balance working
fullsync not halted by nodes up/down/add/remove
realtime not halted by nodes up/down/add/remove

Output artifacts:

performance graph #keys / #sync-time

Setup

The basic setup involves two clusters of 5 nodes, plus a test driver.

The test is run with a "base load" and a "test script".

The Base Load

For each test, there is a "base load" (that simulates a real time load), and then an operations script (the actual test).

For whatever hardware configuration, the base load should:

Use ~50% of available Memory, CPU and IOPS

I.e., it's not fun to test something that does not excersise the sytem.

The Test Script

The test script involves runnign various operations given the base load:

Active-Active Scenario

In this scenario, the base load involves

Taking writes in both clusters.
Real time enabled in both direction

Clusters in same or different availability zones

Operations

Now, the interesting tests involve

Running fullsync periodically.
... while adding/removing, starting/stopping nodes.

Test outcome

time of doing fullsync (as a function of the # keys)
does fullsync have to be restarted when nodes are added/removed? hopefully not.
validate after stopping the base load generator that data sets are equal.

In addition to the graphs/stats generated by basho_bench (used to generate the base load), we also need to capture CPU, Memory, IOPS load stats for the riak nodes.

Test design

Setup two 5-node clusters connected both ways by RT and FS. Clusters are in different regions: eu-west-1 and us-east-1. Also setup a test-node for each cluster for running the test and bench.
Start basho_bench-isntances for both clusters working all 5 nodes with the background load (50%).

Out of scope for first run

Does realtime rebalance correctly when nodes are added/removed?
Master-Slave repl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly