Repl Testing Plan

2015-02-05 (Feb. 5 2015) Trifork Test Work Summary and Conclussions

Trifork has improved rtcloud in the following areas:

Proisioning of pre-made EBS volumes with Riak data.
Automated build and provisioning of riak_ee from github.
Automated build and provisioning of riak_test from github.
Ansible-driven riak_test runs on provisioned clusters.
Test report generation to capture config and output of riak_test runs.
Test report upload to S3.
AWS Command line utility.
Python code restructuring (RSL: TODO: Elaborate on each point)

12/12/2014 Week Status

Large scale test can now be executed by script.
Some EC2 issues encountered with hanging EBS vols and beam processes dying for no appearent reason.
VPN problem still not resolved meaning no inter-datacenter test possible so far. According to Joe/Engel, 2 VPN implementations exist in rtcloud - one of which works. Engel and Rune to take contact Monday.
1B key (almost) data load under way. Was stopped temporarily by a hanging ebs volume, but is running again.
riak_repl 2.0 branch is currently being cleaned up, so better use another one for the large scale test until 2.0 branch stabilizes.
Hipchat is a good way to stay/get in touch, so use it.
Trifork will try to get hold of Greg to coordinate CI'ing of the large scale rtcloud test with him.
test report from a run on empty clusters: https://docs.google.com/document/d/1IOZxpUKLdu8z4Fj_oLjQ5lMgGvrw7sPBblb2D24ceik/edit
We're still struggling getting the test to complete on the prefilled clusters due to the EC2 issues, and will make a report for such a test run when we have it done.

12/1/2014 Week Status

Over all
- Good progress on L2.5; large scale cluster test
- Example run without any preloaded data on the estimate branch. Bacho_bench results
- Some progress on new FS (MDC4/FS)
3 variants of fullsync discussed
1. keyspace sync, requiring same N-values for buckets on both ends
2. keyspace sync, permitting different N-values
3. clone sync

We will implement #1 above, which requires same N-values across clusters, but does permit different ring sizes in the two clusters. Buckets with same N-value are synced together; so all N=1 buckets together, all N=3 buckets together, etc. If some nodes are down, lower-values N's would make the fullsync fail partially.

Implementing #2 would require all nodes in both clusters to be up during the sync; because syncs need to happen between the primary responsible nodes (which would be guaranteed to host merkle trees for all relevant data).

We still need to understand the requirements for #3 (clone sync). If performance is the priority, it makes sense to only implement this for equal Q and N for both sides.

The new repl/aae will be on branch labeled krab-tagged-aae off the 2.0 branch.

Expect mid-january final delivery

11/21/2014 Week Status

Load-test now up and running - https://github.com/basho/riak_test/blob/enhance/krab/cluster-realtime-rebalance/tests/replication2_large_scale.erl
- background load enabled
- realtime enabled
- run a number of full syncs, nod up-down-enter-leave
- 2 x 5 cluster, US-EAST-1 to EU-WEST-1
Current data set ~30M keys @ 500 bytes avg (~900 bytes on disk avg)
Data images in S3 (mount @ EBS + ButterFS copy-on-write to local disk)

Minutes from 11/14/14

We will run a number of smaller tests next week (30M keys)
Throw in 1/1000 large objects (1-50MB)
Get a base line for these numbers
Micha will find tests that validate RT anf FS while nodes come up/down.
start building a 1B key dataset
meeting with greg to explain the setup (rune + miklix)
do a writeup on the test and how to run it.

Minutes from 10/31/14

Attendees:Krab, Jon, Michal, Greg, Heather

Loaded up grade may be a test that may be leveraged. Joe may have additional tests. Action: Jon get Joe to send additional tests he may have.
Capture stats: Baked into BashoBench. – Action: Greg to investigate
Jordan had a test to prepopulate data. - Kresten to get info from Jordan
Dataset size – 1B objects, replication factor of 3, size ~1K.

Week45, Status from Trifork - 11/10/14

Spent last week figuring out ins and outs of rt_cloud, amazon, tools, etc.
We can now run both mainstream (2.0) and specific branches of riak_ee/riak_test on Amazon
Mikael (miklix) have been working on a test that includes a background load using basho_bench.
Rune (rsltrifork) have been working on loading large datasets on S3 using basho's setup.
We still need to figure out how to get stats out from the rt_cloud environment
All in all, we believe we're now ready to do the "actual test".

REPL Test Plan (L2.5)

The large scale test is to test:

large scale repl (fullsync + realtime)
real time balance working
fullsync not halted by nodes up/down/add/remove
realtime not halted by nodes up/down/add/remove

Output artifacts:

performance graph #keys / #sync-time

Setup

The basic setup involves two clusters of 5 nodes, plus a test driver.

The test is run with a "base load" and a "test script".

The Base Load

For each test, there is a "base load" (that simulates a real time load), and then an operations script (the actual test).

For whatever hardware configuration, the base load should:

Use ~50% of available Memory, CPU and IOPS

I.e., it's not fun to test something that does not excersise the sytem.

The Test Script

The test script involves runnign various operations given the base load:

Active-Active Scenario

In this scenario, the base load involves

Taking writes in both clusters.
Real time enabled in both direction

Clusters in same or different availability zones

Operations

Now, the interesting tests involve

Running fullsync periodically.
... while adding/removing, starting/stopping nodes.

Test outcome

time of doing fullsync (as a function of the # keys)
does fullsync have to be restarted when nodes are added/removed? hopefully not.
validate after stopping the base load generator that data sets are equal.

In addition to the graphs/stats generated by basho_bench (used to generate the base load), we also need to capture CPU, Memory, IOPS load stats for the riak nodes.

Test Data

We need a realistic (large) amount of keys to simulate real-world performance of bulk ops like fullsyncs, and also to make sure we don't fir all data in in-memory caches/buffers.

Init a 5-node Riak-cluster with 1B keys (avg. 500b random data each) with riak's data-dir mounted on 5x600GB EBS volumes.
Snapshot the 5 EBS vols to S3.

Test run

Setup two 5-node test-clusters connected both ways by RT and FS. Clusters are in different regions: eu-west-1 and us-east-1. Also setup a test-node for each cluster for running the test and bench.
Create 2x5 600GB EBS provisioned IOPS (1000 IOPS/s) vols from the snapshot.
Mount the EBS vols.
Start basho_bench-istances for both clusters working all 5 nodes with the background load (~ 50% of max achievable with 2-way RT enabled).
Do more test stuff like start/stop/add/remove nodes and fullsyncs.
Create a test report with background load bench results and fullsync timings and machine metrics.

Out of scope for first run

Does realtime rebalance correctly when nodes are added/removed?
Master-Slave repl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly