Redundancy with ZFS

05 Nov, 2024 (updated: 05 Nov, 2024)
2061 words | 10 min to read | 6 hr, 11 min to write

In the last part of our self-hosting journey, we set up the basics of our home server with Ansible-NAS. Now, it’s time to dig deeper into redundancy and reliability, the core pillars of any storage solution that aims to protect data long-term. This article will focus on using ZFS for creating a redundant and resilient storage setup.

What is ZFS?

ZFS (Zettabyte File System) is popular in home labs and enterprise solutions alike, and for good reason. ZFS brings a unique combination of redundancy, performance, and data integrity checks, making it ideal for home server environments where data protection is crucial. With ZFS, you can set up various RAID configurations (mirroring and RAID-Z) and use snapshots to back up and restore your data efficiently. Before we dive into setting up ZFS, let’s understand what RAID is.

What is RAID?

RAID (Redundant Array of Independent Disks) is a technology that combines multiple hard drives into a single logical unit to improve redundancy, performance, or both. There are several RAID levels, each with its own advantages and disadvantages. The most common RAID levels are:

  • RAID 0: Striping without Redundancy
  • RAID 1: Mirroring
  • RAID 5: Striping with Parity
  • RAID 6: Striping with Double Parity
  • RAID 10: Mirroring and Striping

And zfs actually adds RAID 7 which is called RAID-Z3 in zfs terms. It’s similar to RAID 6 but with triple parity.

I will not go into details explaining each RAID level, especially because there is a perfect wiki article about it.

But I’ll list some properties of each RAID level for you to understand the differences:

RAID LevelZFS nameMinimum DisksFault ToleranceRead PerformanceWrite PerformanceRestore Performance
RAID 0Striped20HighHighn/a
RAID 1Mirror21HighLowHigh
RAID 5RAID-Z131HighMediumLow
RAID 6RAID-Z242HighLowLow
RAID-Z353HighLowLow
RAID 10RAID-Z141HighHighHigh

Fault tolerance means how many disks can fail without losing data. Read, write, and restore performance are self-explanatory.

The obvious choices for a NAS are RAID 1 and RAID 6. RAID 1 is simple and provides good read performance, while RAID 6 (for improved redundancy, I mean RAID 5 seems to be just inferior) provides more fault tolerance and better write performance. BUT RAID 6 is more complex and requires more disks. More disks mean they consume more energy and generate more heat and noise. On top of this if any disk fails in mirror RAID, you can just replace it and rebuild the mirror by simply copying all the blocks from the heathy disk. In RAID 6 you need to calculate the parity and write it to the new disk, which could take a long time, and you also should have your fingers crossed that no other disk fails during this time.

Yes, you do have better fault tolerance with RAID 6, but I’ll do backups anyway, so I don’t need that much fault tolerance, and I prefer the simplicity of RAID 1 and convenience of buing less disks, but each disk with more capacity which is usually cheaper ($ per Tb).

How to ZFS

What I like about zfs is it’s simplicity for the end user. ZFS provides 2 utilities: zpool and zfs. One is for managing pools, another one for managing datasets.

zpool

Creation of zfs pool is super easy, all it takes is providing bare minimum information - name of the pool (which by default also acts as mounting point name), raid level (optional), and a list of devices. I’ve created my mirror (RAID1) with the following command:

zpool create rust mirror \
/dev/disk/by-id/wwn-0x5000c500db125622 \
/dev/disk/by-id/wwn-0x5000c500db13fa95

The reason why I listed my devices by an ID is that I can plug those devices using any SATA cable. But I could totally do zpool create rust sda sdb instead.

ZFS mounted my pool on /rust. I could provide different mounting poin using -m flag. To get more information on how to work with zpool just man zpool, the manpage documentation is really good.

Now it’s time to create some datasets.

zfs datasets

As soon as the pool is created and (auto)mounted it’s perfectly usable by itself. I could create all the directories under /rust and be done with it. But not all the directories should have all the same properties. Some directories should be encrypted, some directories will store large files and would require larger blocksizes, some would have loads of frequently changed files and might benefit from disbled atime, etc. And this is what datasets are for.

Some of the most useful properties you can set on the zfs dataset are:

  • compression
  • recordsize
  • quota and reservation
  • atime
  • sync
  • copies
  • encryption

There is zfsprops(7) manpage that will give you much more information about all the different options and how they work much better than I can do in a single article. If you are curious how I configured my datasets, here it is, all my datasets and options:

# zfs get all -s local
NAME PROPERTY VALUE SOURCE
rust compression lz4 local
rust xattr sa local
rust dedup off local
rust acltype posix local
rust relatime on local
rust/db/data recordsize 16K local
rust/documents copies 2 local
rust/documents keylocation prompt local
rust/downloads recordsize 4K local
rust/encrypted keylocation prompt local
rust/media recordsize 1M local

Keep in mind that all the nested datasets inherit from parent, i.e. I have rust/docker which simply inherits all the rust properties:

# zfs get all rust/docker -s local,inherited
NAME PROPERTY VALUE SOURCE
rust/docker compression lz4 inherited from rust
rust/docker xattr sa inherited from rust
rust/docker dedup off inherited from rust
rust/docker acltype posix inherited from rust
rust/docker relatime on inherited from rust

Typically, you want encryption=on on all datasets which store your sensitive data, atime=off on all datasets unless you want to record atime, recordsize=1M (or larger) on all datasets which mostly store large files (i.e. movies), and you definitely want compression, it comes with such a minimal overhead that there is virtually no reason not to enable compression on the top level (in my case, rust).

If you came across dedup and thinking it might be a great option to turn on, think twice. Deduplication comes at tremendous overhead, and generally not advised

Definitely read OpenZFS deduplication is good now and you shouldn’t use it first!

Working with datasets’ options

You can specify options using -o flag when creating a dataset, i.e.

zfs create -o recordsize=1m -o rust/media

or using zfs set:

zfs set recordsize=16K rust/db

To get all the options of a specific (or all) datasets, use zfs get <property> <dataset>. The command supports some other flags, feel free to explore them using man zfs-get. My personal most used commands are:

zfs get all
zfs get all -s local
zfs get all <dataset>
zfs get all -s local <dataset>

And finally another useful command you should know about is listing all datasets. It’s as easy as:

zfs list

Once again, man zfs-list has all the information you need know about using the command.

Home server file structure

I ended up using the following file structure:

# tree /rust -L 1
/rust
├── db
├── docker
├── documents
├── downloads
├── encrypted
├── exchange
└── media

Which should be fairly self-explanatory. All the docker mounted volumes are in rust/docker, the only exception is databases’ volumes for data and logs.

# zfs get all rust/db/data -s local
NAME PROPERTY VALUE SOURCE
rust/db/data recordsize 16K local

The recordsize of 16K exactly matches MySQL’s page size, and I mainly use MySQL. I also have a couple of PostgreSQLs storing data in the same dataset, and PostreSQL page is 8K, and having 16K block sizes for Postgre is sub-optimal, but I was lazy to create a dataset for Postgre alone. Another option was to specify recordsize as 8K and make MySQL use 8K blocks via innodb_page_size option as per the MySQL documentation:

Each tablespace consists of database pages. Every tablespace in a MySQL instance has the same page size. By default, all tablespaces have a page size of 16KB; you can reduce the page size to 8KB or 4KB by specifying the innodb_page_size option when you create the MySQL instance. You can also increase the page size to 32KB or 64KB. For more information, refer to the innodb_page_size documentation.

But I was even lazier to use this approach xD

scrub

One of the coolest features of ZFS is it’s ability to find and fix data corruption automagically, thanks to scrub function. A scrub operation is a thorough check of all data on the ZFS pool, ensuring that each block of data is consistent with its checksum. If corruption is detected, ZFS can attempt to repair it using redundant copies in the mirror or parity data in RAID-Z configurations.

To start manual scrub on a zpool, use the following command:

zpool scrub rust # or whatever your pool name is

This initiates a scrub of the entire rust pool, scanning every block for errors. For larger pools, this process may take some time, but ZFS is designed to run scrubs in the background with minimal performance impact on other tasks.

Automating scrubs

It’s a good idea to automate scrubs so you don’t forget to run them regularly. Depending on your server’s setup, you can add a cron job or systemd timer to automatically initiate a scrub, for example, every month. Here’s a simple cron entry to run a monthly scrub on the first day of each month at 3:00 AM:

0 3 1 * * /sbin/zpool scrub rust

This cron job helps keep your data safe with regular, automated checks, so any data issues are quickly caught and corrected.

Monitoring scrub status

To check the status of a scrub, use:

zpool status rust

The output is plain english, so it’s easy to read:

pool: rust
state: ONLINE
scan: scrub in progress since Tue Nov 5 00:51:58 2024
511G scanned at 10.7G/s, 6.28G issued at 134M/s, 2.24T total
0B repaired, 0.27% done, 04:51:54 to go

Once the scrub is complete, the output will show if any data errors were detected and whether they were successfully repaired.

Wrapping up

With your ZFS pool set up, datasets configured, and a regular scrub schedule in place, you’ve created a robust foundation for your home server. ZFS’s combination of redundancy, compression, and data integrity checks makes it a powerful tool for protecting your data long-term.

It’s a lot I didn’t cover about ZFS in this article, but what I did cover should give you a solid starting point for setting up a reliable storage solution. The essentials—pool creation, dataset configuration, and regular scrubbing—are the backbone of any ZFS setup, providing data protection and consistency with minimal maintenance.

Just to give you a hint on what’s not covered (and a good refresher for myself), here is a list:

  • Performance optimisation. We briefly touched on fine-tuning recordsize and compression, but what I haven’t noticed is having NVMe SSD for caching (which I haven’t implemented myself yet, but looking forward for it)
  • Snapshots and Clones. This allows to create a point-in-time backups of datasets. Snapshots are invaluable for rollback after accidental changes and for creating space-efficient clones for testing.
  • Replication. ZFS replication sends datasets to a remote server for an extra layer of backup and disaster recovery. I’ll do a separate article about this once I get around backing up my valuable datasets off site.
  • Encryption. I briefly touched encryption in this article, but there’s more to explore.

To explore ZFS further I’d recomment this article on archlinux wiki.

That’s it for the ZFS. In the next article I’ll show what software I run on my home server.

This article is part of the Let's self-host! series:
  1. Building a NAS / Home Server
  2. Choosing Home Server OS and setting up ansible-nas
  3. Redundancy with ZFS