SW-RAID & performance | segfault.segfault.digital

Index

Introduction
Base
Preparations for tests

Introduction

Here are some experiences/notes I had/took when I recently set up for the 3rd time my software RAID, focused on performance.

There are a lot of factors that influence the performance on your system (e.g. the CPU you're using, the disks that will host your RAID, the type of files that you'll store on it, the throughput of the bus that your disks will use, the number of parallel access to the RAID, the most frequent type of access) and finding the right combination right at the beginning is very tricky, therefore my recommendation is:
test, test, test, and then test some more until you have excluded all potential bad settings, ideally leaving only 1 good candidate, which will be the one that you'll end up using.

With "test" I don't mean to run for example just one of the benchmark utilities.
You should try to create a "test set" that emulates what your typical usage will be - for example don't just benchmark a write of a 4GB-file created by using "dd" (e.g. dd if=/dev/zero of=mybigfile bs=64k count=65536) knowing that you'll mostly deal with much smaller files and assuming that the results with them will be the same, as they won't.

Base

Assuming that all your HDDs are the same model and are connected the same way to the NAS/PC/server, create a partition on one of it, format it and perform on it a typical workload.

When you format a partition (this is valid when running the tests and later when setting up your final RAID) doublecheck that they were fully formatted to avoid performance degradation later when you'll write on it.
For example when using ext4 format the partition using the options "-E lazy_itable_init=0,lazy_journal_init=0" to avoid a deferred formatting (which will happen when you don't use the filesystem and when you'll write on it).

In my case, as I knew that the RAID would have to handle big files, I created a big file on it using "time dd if=/dev/zero of=mybigfile bs=64k count=65536 && time sync && rm mybigfile && sync" and summed up the two runtimes => did it a couple of times (to doublecheck that I got reproducible results) => this told me the maximum sequential write speed of my HDD.
If the test took less than ~30 seconds then I made the file bigger and ran again the test (measurements that take less than 30 seconds are not reliable enough).

After having done the above a couple of times, I created again a big file (same as above: "dd if=/dev/zero of=mybigfile bs=64k count=65536 && sync") (make it bigger than the amount of RAM you have or after creating it clear the filesystem cache (with "echo 1 > /proc/sys/vm/drop_caches") => I did a "time cat mybigfile > /dev/null" which told me the maximum sequential read speed of my HDD.

By using the two results collected above and taking into account the type of RAID that I would use I knew the best-case performance that my RAID could achieve - in my case, as I wanted to use a RAID5 with 4 HDDs I knew that the theoretical max throughput was supposed to be 3 times the performance of a single HDD: therefore, as e.g. the write speed of a single HDD was 140MB/s the RAID was expected to deliver in the best case max 420MB/s.

I additionally knew that I would store as well quite a lot of small files and tested therefore as well such scenario (packed a lot of small files into a single ZIP-file, read it once with "cat myzip.zip > /dev/null" to cache it in the filesystem cache to avoid invalidating the benchmark results by making the HDD have to read the zip while writing its contents to the disk, and unpacked + sync'ed it - similar as above).

RAID test preparation

To do your tests you don't have to use the final partition (and therefore RAID) sizes.
Create and use small partitions (e.g. 50GB) that can be sync'ed quickly => this will avoid that you lose a lot of time waiting for the RAID to be synchronized (monitor the progress using "cat /proc/mdstat").

Once the RAID is synchronized ensure that the RAID is then fully formatted (e.g. see above about the "lazy" formatting options for ext4).

Chunk

When you create the RAID you can select the so-called "chunk size" - if you don't then a default value will be selected which you can check by issuing "cat /proc/mdstat" and which in the following example shows that the chunk size is 512k:
==============
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde1[2] sdd1[0] sdc1[4] sdb1[1]
11603447232 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

unused devices:
==============

512k was what my system selected when not defining an option.

This option will have an impact on your system performance => you'll have to try out the different sizes to see which one gives you the best performance.

In my case the 512k chunk size performed less than smaller chunk sizes when dealing with both few big and many small files and I ended up choosing a chunk size of 64k. This is a good example about testing yourself the settings as I read in some guides & posts that a bigger chunk size was optimal for small files.

Filesystem options

1)
Decide which filesystem you want to use.

2)
Decide which block size you want to use in your filesystem.

3)
Then check which options you have available for your filesystem in relation to the RAID and compute their optimal values.
For example the ext-filesystems use the options "stride" and "stripe-width" to align the filesystem to the characteristics of the RAID.
In this case you can use the following PHP-script to compute the two values: