-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further writeback optimisation possible? #71
Comments
I have no plan for further optimization because writing back data in FIFO manner fits log-structured caching as I explained before. And I am wondering if such too extravagant optimization is paid off. Does it track all the unwritten data? So how much memory footprint does it require? And the computation time for finding sequentiality if the cache device is too big? I want you to know HDDs are enough smart devices. I measured the effect of the sorting and it was quite positive although the input is random. This means HDDs perform well if the data is sorted by ascending order, even they aren't sequential. Iow, what I don't trust is I/O scheduler, not the HDDs. |
mhm makes sense - seems there is no other way than FIFO. bcache works fine currently its just not maintained by kent anymore as he has started focussing on bcachefs. So we have to keep a lot of extra patches ourself. I don't know enoug about the internals it keeps some hash tree to calculate and measure this. |
That't too bad. In my opinion, bcache has its codebase too complicated and too huge. Only Kent can grab everything in his software. (dm-cache is in similar status) dm-writeboost on the other hand, keeps the codebase as small as possible and it's only 5k lines. And it has been forked by a researcher for his research work. (https://bitbucket.org/yongseokoh/dm-src) I want other developers to join dm-writeboost so I can reduce my tasks. |
Yes the bcache code is very complex. That's why i like dm-writeboost. It looks lightweight. Another question Does dm-writeboost not track the written blocks? What happens if i write block A and than read block A but block A is still in the ram buffer or on the caching device. How does dm-writeback knows it has to answer / read the data not from the backing device? |
You are welcome.
dm-writeboost manages a structure called metablock that corresponds to a 4KB cache block. metablock manages "dirtiness" of the cache block. struct dirtiness {
bool is_dirty;
u8 data_bits;
};
struct metablock {
sector_t sector; /* The original aligned address */
u32 idx; /* Const. Index in the metablock array */
struct hlist_node ht_list; /* Linked to the hash table */
struct dirtiness dirtiness;
}; the member dirtiness has is_dirty and data_bits. is_dirty means if it's still dirty which is almost the same as if it needs to be written back (e.g. if the cache block is a result of read caching, the flag is false). data_bits has 8 bits to manage 8 sectors in each 4KB cache block for the existence of cached data. For deeper understanding, please read |
make some test with fio and compare with enhanceio, what I found is, I've just make a util to count writeback-flushout time, so I'll come back soon with actual datas. |
@bash99 It's quite dependent on the IO amount and the distribution. Please give me the detail of your benchmark. By the way, it's not dw-boostwrite but dm-writeboost you must be using. |
@bash99 Also, please share how you setup your dm-writeboost'd device. It's important to know the max_batched_writeback in particular. |
I don't know what kind of optimization enhanceio does but generally saying, this could be happening because dm-writeboost's writeback is restricted to do from the older segment to newer segment. Think what happens if newer segment is written back before older ones and the caching device gets suddenly broken. |
Basicly it's an official testbox, a 240G Samsung 843T /dev/sda and a 500G Seagate Barracuda 7200.14 /dev/sdb, an i5-4590 with 8G memory. use cache size 31.2GB and back device size 150GB.
I use /dev/sda3 /dev/sdb5 as writeboost setup.
and /dev/sda2 /dev/sdb4 as eio setup
use this funtion to wait back device flush down wait_for_io()
{
dev=$1
awk -v target="$2" '
$14 ~ /^[0-9.]+$/ {
if($14 <= target) { exit(0); }
}' < <(iostat -xy $dev 3)
} use below scripts to kick off eio cleanup short after fio start sleep 3
sysctl -w dev.enhanceio.enchanceio_test.do_clean=1 use below scripts to make sure dirty block is really been cleaned. grep -i dirty /proc/enhanceio/enchanceio_test/stats
dmsetup status webhd | wb_status | grep dirty main test script is below, one for writeboost, one for eio, limit total iops run by use rate_iops。 ./waitio.sh sda 1; time fio --direct=1 --filename=/dev/mapper/webhd --name fio_randw --refill_buffers --ioengine=libaio --rw=randwrite --bs=4k --size=4G --nrfiles=1 --thread --numjobs=16 --iodepth=32 --time_based --runtime=5 --group_reporting --norandommap --rate_iops=500; time ./waitio.sh sda 1; iostat -xy sda 1 1; date; time ./waitio.sh sda 1;
./waitio.sh sda 1; ./sleepkick.sh ; time fio --direct=1 --filename=/dev/sda2 --name fio_randw --refill_buffers --ioengine=libaio --rw=randwrite --bs=4k --size=4G --nrfiles=1 --thread --numjobs=16 --iodepth=32 --time_based --runtime=5 --group_reporting --norandommap --rate_iops=500; time ./waitio.sh sda 1; iostat -xy sda 1 1; date; time ./waitio.sh sda 1; I've change time from 5-10, change size from 48G to 4G. (change /sys/block/sda/queue/scheduler from cfq/noop/deadline make no diffirence)
Below is all utils sh |
@akiradeveloper Not sure EnhanceIO is totally safe on write-back if cache device crashed. but what they in ReadME is
Btw, I've got some funny results from ZFS and Zil cache, but I need more test. I've share my test in google docs. |
@bash99 How do you write back dirty caches with wb? and how do you know its completion? |
@bash99 As the baseline you need the result with HDD only. I think dmwb doesn't do any performance gain with the workload. Usually, client application doesn't write in such a completely sparse way. So it's meaningless to optimize for such workload |
is it possible to limit the IO range by "--size=4G"? |
I have done the similar experiment before (this test isn't workable now because it's not maintained) test("writeback sorting effect") {
val amount = 128 <> 1
slowDevice(Sector.G(2)) { backing =>
fastDevice(Sector.M(129)) { caching =>
Seq(4, 32, 128, 256).foreach { batchSize =>
Writeboost.sweepCaches(caching)
Writeboost.Table(backing, caching, Map("nr_max_batched_writeback" -> batchSize)).create { s =>
XFS.format(s)
XFS.Mount(s) { mp =>
reportTime(s"batch size = ${batchSize}") {
Shell.at(mp)(s"fio --name=test --rw=randwrite --ioengine=libaio --direct=1 --size=${amount}m --ba=4k --bs=4k --iodepth=32")
Shell("sync")
Kernel.dropCaches
s.dropTransient()
s.dropCaches()
}
}
}
}
}
}
} As the result was (at the time I have done this)
This means it takes 85 seconds with HDD only and only 36 seconds with dmwb, which apparently shows dmwb boosts writeback. The dirty data amount was 64MB. I think the max_batched_writeback is now 32 in your case and it should be 90 seconds or so to finish 160MB. |
@akiradeveloper Yes, I'm not sure it's worth to do more optimize on this case, and dmwb is fast than raw HD, about 3 times fast. |
That must be so but it's just a trade-off. Since dmwb is log-structured the writeback thread can easily know that some data is older/newer than the other according to the segment id. Why I made a decision to write back from older ones is the feature is very important when it comes to production use. FYI, there is a wiki page to explain this https://github.com/akiradeveloper/dm-writeboost/wiki/Log-structured-caching-explained
It's not guaranteed but very likely so. |
Dm-writeboost (dmwb) is great in its current version, and i think there might be a room for improvement for better performance. I did some log-structured garbage collection (GC) work on SSD while working as an IBM researcher, and some experience may apply to the FIFO writeback procedure of dmwc. I would like to offer my humble ideas in the following. I would totally agree with Akira that the log-structured nature of dmwb should be strictly kept when writing back data to the backend. The suggested process could be as follows: 1) read back all dirty blocks on $max_batched_writeback segments that are oldest; 2) filter out blocks that are already obsolete (have been re-written) ; 3) sorting and merging by LBA addresses; 4) find out neighboring dirty blocks in other segments and read them out (un-dirty them when writeback succeed) and then issue larger sequential IO requests to the backend. Of course this may take efforts to implement, but it has the potential to make dmwc the best-performing ever and robust write cache given its log-structured nature. If we implement the above, it will make sense to always keep a pre-defined number of segments in dmwb device for the sake of sequential writeback. Any comments? |
Currently dm-wb is able to writeback 1,5-2MB/s for totally random 4k i/o by ordering the segments.
bcache is able to write back with a much higher speed by merging or ordering I/Os sequentially. I think this is possible by keeping a pool of unwritten segments. Currently i work with a permanent pool of 10GB data. So it keeps always 10GB of data on the writeback device and selects then mergable and sequential data to writeback.
Is something like this possible?
The text was updated successfully, but these errors were encountered: