Segfault > IT > General > SW-Raid & performance
First published on: 31.Mar.2014

Last change: 31.Mar.2014

Index

Introduction

Here are some experiences/notes I had/took when I recently set up for the 3rd time my software RAID, focused on performance.

There are a lot of factors that influence the performance on your system (e.g. the CPU you're using, the disks that will host your RAID, the type of files that you'll store on it, the throughput of the bus that your disks will use, the number of parallel access to the RAID, the most frequent type of access) and finding the right combination right at the beginning is very tricky, therefore my recommendation is:
test, test, test, and then test some more until you have excluded all potential bad settings, ideally leaving only 1 good candidate, which will be the one that you'll end up using.

With "test" I don't mean to run for example just one of the benchmark utilities.
You should try to create a "test set" that emulates what your typical usage will be - for example don't just benchmark a write of a 4GB-file created by using "dd" (e.g. dd if=/dev/zero of=mybigfile bs=64k count=65536) knowing that you'll mostly deal with much smaller files and assuming that the results with them will be the same, as they won't.


Base

Assuming that all your HDDs are the same model and are connected the same way to the NAS/PC/server, create a partition on one of it, format it and perform on it a typical workload.

When you format a partition (this is valid when running the tests and later when setting up your final RAID) doublecheck that they were fully formatted to avoid performance degradation later when you'll write on it.
For example when using ext4 format the partition using the options "-E lazy_itable_init=0,lazy_journal_init=0" to avoid a deferred formatting (which will happen when you don't use the filesystem and when you'll write on it).

In my case, as I knew that the RAID would have to handle big files, I created a big file on it using "time dd if=/dev/zero of=mybigfile bs=64k count=65536 && time sync && rm mybigfile && sync" and summed up the two runtimes => did it a couple of times (to doublecheck that I got reproducible results) => this told me the maximum sequential write speed of my HDD.
If the test took less than ~30 seconds then I made the file bigger and ran again the test (measurements that take less than 30 seconds are not reliable enough).

After having done the above a couple of times, I created again a big file (same as above: "dd if=/dev/zero of=mybigfile bs=64k count=65536 && sync") (make it bigger than the amount of RAM you have or after creating it clear the filesystem cache (with "echo 1 > /proc/sys/vm/drop_caches") => I did a "time cat mybigfile > /dev/null" which told me the maximum sequential read speed of my HDD.

By using the two results collected above and taking into account the type of RAID that I would use I knew the best-case performance that my RAID could achieve - in my case, as I wanted to use a RAID5 with 4 HDDs I knew that the theoretical max throughput was supposed to be 3 times the performance of a single HDD: therefore, as e.g. the write speed of a single HDD was 140MB/s the RAID was expected to deliver in the best case max 420MB/s.

I additionally knew that I would store as well quite a lot of small files and tested therefore as well such scenario (packed a lot of small files into a single ZIP-file, read it once with "cat myzip.zip > /dev/null" to cache it in the filesystem cache to avoid invalidating the benchmark results by making the HDD have to read the zip while writing its contents to the disk, and unpacked + sync'ed it - similar as above).


RAID test preparation

To do your tests you don't have to use the final partition (and therefore RAID) sizes.
Create and use small partitions (e.g. 50GB) that can be sync'ed quickly => this will avoid that you lose a lot of time waiting for the RAID to be synchronized (monitor the progress using "cat /proc/mdstat").

Once the RAID is synchronized ensure that the RAID is then fully formatted (e.g. see above about the "lazy" formatting options for ext4).

Chunk

When you create the RAID you can select the so-called "chunk size" - if you don't then a default value will be selected which you can check by issuing "cat /proc/mdstat" and which in the following example shows that the chunk size is 512k:
==============
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde1[2] sdd1[0] sdc1[4] sdb1[1]
      11603447232 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices:
==============

512k was what my system selected when not defining an option.

This option will have an impact on your system performance => you'll have to try out the different sizes to see which one gives you the best performance.

In my case the 512k chunk size performed less than smaller chunk sizes when dealing with both few big and many small files and I ended up choosing a chunk size of 64k. This is a good example about testing yourself the settings as I read in some guides & posts that a bigger chunk size was optimal for small files.

Filesystem options

1)
Decide which filesystem you want to use.

2)
Decide which block size you want to use in your filesystem.

3)
Then check which options you have available for your filesystem in relation to the RAID and compute their optimal values.
For example the ext-filesystems use the options "stride" and "stripe-width" to align the filesystem to the characteristics of the RAID.
In this case you can use the following PHP-script to compute the two values:


#!/usr/bin/php
<?php
//Computes ext-filesystem parameter values to format a RAID5.
 
//---CHANGE ME - START---
//Put here what "cat /proc/mdstat" shows for "chunk" and convert it into bytes
//$iRaidChunkBytes = 524288; //512KB
//$iRaidChunkBytes = 262144; //256KB
//$iRaidChunkBytes = 131072; //128KB
//$iRaidChunkBytes = 65536; //64KB
$iRaidChunkBytes = 32768; //32KB
 
//How big will a filesystem block be? (in bytes)
$iFsBlocksizeBytes = 4096;
 
//How many disks in total do you have in your raid5 (incl. parity)?
$iTotalNbrDisks = 4;
//---CHANGE ME - END---
 
$iResultStride = 0;
$iResultStripeWidth = 0;
 
$iResultStride = $iRaidChunkBytes / $iFsBlocksizeBytes;
$iResultStripeWidth = $iResultStride * ($iTotalNbrDisks-1);
 
 
echo "Use: \"-E stride=" . $iResultStride . ",stripe-width=" . $iResultStripeWidth . "\"$
?>

I used ext4 and every time it automatically detected the RAID and used the optimal values without me having to specify anything.

4)
Decide which mount options you want to use. In my case I mount my ext4 filesystem with "-o noatime,async,barrier=1,nodiscard,journal_async_commit,nodiscard,nodelalloc,data=writeback" which in my opinion give a good tradeoff between performance and reliability.


Testing

Depending if you want to fiddle around as well with the filesystem block size (I didn't and kept the default 4KB of ext4) or not you'll have to repeat the tests for all combinations of RAID chunk size and filesystem block size.

In my case I tested the RAID chunk sizes for 512/256/128/64/32/8 and 4KB and ended up seeing

  • when dealing with writing many small files the 64KB size having a 13% performance improvement compared to 512KB, 6% compared to 256KB and 3% compared to 128KB and no improvements beyond 64.
  • when dealing with writing big files the 64kB size having a 23% performance improvement compared to 512KB and so on with the best performance when using 4KB but having 20% performance penalty compared to 64KB when reading.

After every set of tests, unmount your filesystem, stop the RAID with "mdadm --stop /dev/md0" (or whichever md-device number you're using) and get rid of the traces of the raid by issuing "mdadm --zero-superblock /dev/sd[hddID][partitionID]" for every HDD that was part of the RAID.


Results

In my case I ended up choosing for my RAID5 with 4 HDDs a chunk size of 64KB and for the time being I'm happy with it.

Writing big files on my small NAS with 128MB/s and on my big NAS with 370MB/s (with a 512KB chunk size it used to be ~240MB/s).
The test set of 154'212 small files (avg size 5.6KB) and 25'595 directories is written within 32 seconds and deleted within 7 and when I sync the two NAS using rsync the speedup is ~2x compared to the old setup.