ZFS

出自 Arch Linux 中文维基

ZFS 是由 太陽計算機公司(現已被甲骨文公司收購)開發的高級文件系統,在 2005 年 11 月作為OpenSolaris 的一部分發佈。

ZFS 的特點包括:存儲池(被稱為 "zpool" 的集成卷管理系統)、寫時複製快照、數據完整性校驗和自動修復(擦除)、RAID-Z、最大 16 Exabyte 文件大小,以及最大 256×10¹⁵ Zettabyte 存儲,且對文件系統(數據集)或文件的數量沒有限制 [1] 。ZFS 採用 通用開發與散佈許可證(CDDL)授權。

ZFS 被稱為 "終極文件系統",穩定、快速、安全、面向未來。由於採用 CDDL 許可,因此與 GPL 不兼容,ZFS 不可能與 Linux 內核一起發佈。然而,這並不妨礙第三方開發者開發並發佈原生的 Linux 內核模塊,比如 OpenZFS (以前被稱為 ZFS on Linux (ZOL))。

ZOL 是一個由 勞倫斯-利弗莫爾國家實驗室 資助的項目,旨在為其大量的存儲需求和超級計算機開發原生的 Linux 內核模塊。

注意:

由於 ZFS 代碼的 CDDL 許可證和 Linux 內核的 GPL 之間可能在法律上不相容 ([2],CDDL-GPL,ZFS in Linux) - 內核不支持 ZFS 的開發。

因此:

  • ZFSonLinux 項目必須跟上 Linux 內核版本。在 ZFSonLinux 發佈穩定版後,由 Arch ZFS 維護者來發佈。
  • 有時會因為不滿足依賴關係,無法進行正常的滾動更新,因為嘗試更新到的新版本內核不受 ZFSonLinux 支持。

安裝[編輯 | 編輯原始碼]

一般情況[編輯 | 編輯原始碼]

警告: 除非你使用這些軟件包的 DKMS 版本,否則 ZFS 和 SPL 內核模塊是與特定的內核版本綁定的。在更新的軟件包被上傳到 AUR 或 archzfs 倉庫之前無法進行內核更新。
提示:如果你目前的內核比較新,你可以將 Linux 版本 降級archzfs 倉庫的版本。

archzfs 倉庫或 Arch 用戶倉庫 安裝:

這些分支依賴於 zfs-utils 軟件包。

通過在命令行中執行 zpool status 來測試安裝情況。如果出現 "insmod" 錯誤,請嘗試 depmod -a

根分區為 ZFS[編輯 | 編輯原始碼]

參見 在 ZFS 上安裝 Arch Linux

DKMS[編輯 | 編輯原始碼]

為了在每次內核升級時自動重新編譯 ZFS 模塊,用戶可以使用 DKMS

注意: 當安裝 dkms 時, 請參閱 動態內核模塊支持

安裝 zfs-dkmsAURzfs-dkms-gitAUR

提示:pacman.conf 中添加 IgnorePkg 條目,以防在進行定期更新時升級這些軟件包。

嘗試使用 ZFS[編輯 | 編輯原始碼]

如果有用戶希望在不會造成數據丟失的情況下,用諸如 ~/zfs0.img ~/zfs1.img ~/zfs2.img 等簡單文件的虛擬塊設備(在 ZFS 術語中被稱為 VDEVs)來試驗 ZFS,可以參閱 嘗試使用 ZFS 文章。這篇文章涵蓋了一些常見的任務,如建立一個 RAIDZ 陣列、故意破壞數據並恢復、快照數據集等。

配置[編輯 | 編輯原始碼]

ZFS 的開發者認為,ZFS 是一個「零管理」的文件系統;因此,配置 ZFS 非常容易。配置主要通過兩個命令完成:zfszpool

自動啟動[編輯 | 編輯原始碼]

為了達到ZFS所謂的「零管理」狀態,您必須啟用 zfs-import-cache.service 來導入存儲池,啟用 zfs-mount.service 來掛載存儲池中可用的文件系統。這樣做的一個好處是不需要在 /etc/fstab 中掛載 ZFS 文件系統,因為 zfs-import-cache.service 會自動根據 /etc/zfs/zpool.cache 文件來導入存儲池。

為每一個你想用 zfs-import-cache.service 自動導入的存儲池執行如下命令:

# zpool set cachefile=/etc/zfs/zpool.cache <pool>
注意:OpenZFS 的 0.6.5.8 版本 開始,ZFS 的服務單元文件有所變化,你必須明確地啟用任何你想運行的 ZFS 服務。更多信息請參見 ArchZFS 問題 72

啟用相關的服務(zfs-import-cache.service)和目標(zfs.targetzfs-import.target),以便在系統啟動時自動導入存儲池:

想要掛載 ZFS 文件系統,你有兩種選擇:

使用 zfs-mount.service 服務[編輯 | 編輯原始碼]

為了在啟動時自動掛載 ZFS,你需要啟用 zfs-mount.servicezfs.target

注意:

使用 zfs-mount-generator[編輯 | 編輯原始碼]

你也可以用 zfs-mount-generator 在啟動時為你的 ZFS 文件系統生成 systemd 掛載單元。systemd 會根據掛載單元自動掛載文件系統,無需使用 zfs-mount.service。具體操作如下:

  1. 創建 /etc/zfs/zfs-list.cache 目錄。
  2. 啟用必要的 ZFS Event Daemon (ZED) 腳本(被稱為 ZEDLET)來創建可掛載的 ZFS 文件系統列表。(如果用的是 OpenZFS >= 2.0.0,這個連結會被自動創建)
    # ln -s /usr/lib/zfs/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d
  3. 啟用 zfs.target 目標,並啟動/啟用 ZFS Event Daemon (zfs-zed.service)。這個服務負責運行上一步提到的腳本。
  4. 你需要在 /etc/zfs/zfs-list.cache 目錄下創建一個以你的存儲池命名的空白文件。只有當這個文件存在時,ZEDLET 才會更新文件系統列表。
    # touch /etc/zfs/zfs-list.cache/<pool-name>
  5. 檢查文件 /etc/zfs/zfs-list.cache/<pool-name> 中的內容。如果該文件為空,確保 zfs-zed.service 處於運行狀態,並運行以下命令來修改你文件系統的 canmount 屬性:
    zfs set canmount=off zroot/fs1
    ;修改這個屬性會讓 ZFS 觸發一個由 ZED 捕獲的事件,ZED 繼而運行 ZEDLET 腳本來更新 /etc/zfs/zfs-list.cache 中的文件。如果 /etc/zfs/zfs-list.cache 中的文件已經更新過,你可以用如下命令來改回 ZFS 文件系統的 canmount 屬性:
    zfs set canmount=on zroot/fs1

你需要為系統裏的每一個 ZFS 存儲池在 /etc/zfs/zfs-list.cache 目錄下創建對應的文件。確保已經參考上文通過啟用 zfs-import-cache.servicezfs-import.target 導入了存儲池。

存儲池[編輯 | 編輯原始碼]

在創建 ZFS 文件系統之前,並不一定要先給它分區。推薦將 ZFS 指向整個硬盤 (例如 /dev/sdx 而不是像 /dev/sdx1 的單個分區),這將 自動創建一個 GPT (GUID 分區表) ,並在磁盤的開始部分為傳統引導程序添加一個 8MB 的保留分區。但是,如果你想要創建具有不同冗餘屬性的多個卷,你可以在現有文件系統中指定一個分區或一個文件。

注意: 如果存儲池中的任何驅動器用來組過軟 RAID,那麼你首先應該 清理所有舊的 RAID 配置信息.
警告: 對於具有 4KB 扇區大小的 先進格式化 磁盤,建議使用 12 的 ashift 值以獲得最佳性能。為了與傳統系統兼容,先進格式化磁盤模擬 512 字節的扇區大小,這導致 ZFS 有時會為 ashift 選項使用一個不理想的值。一旦池被創建,改變 ashift 選項的唯一方法就是重新創建池。與此同時,使用一個 12 的 ashift 值也會減少可用容量。參見 OpenZFS FAQ: 性能考慮, 先進格式化磁盤, 以及 ZFS 和先進格式化磁盤.

識別磁盤[編輯 | 編輯原始碼]

OpenZFS 建議在創建少於 10 個設備的 ZFS 存儲池時使用設備 ID [3]. 使用 塊設備持久化命名#通過 id 和 通過路徑來確定要用於建立 ZFS 池的驅動器列表。

磁盤 ID 應該類似於以下內容:

$ ls -lh /dev/disk/by-id/
lrwxrwxrwx 1 root root  9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0JKRR -> ../../sdc
lrwxrwxrwx 1 root root  9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0JTM1 -> ../../sde
lrwxrwxrwx 1 root root  9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0KBP8 -> ../../sdd
lrwxrwxrwx 1 root root  9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0KDGY -> ../../sdb
警告: 如果你使用設備名稱創建 zpools(例如 /dev/sda,/dev/sdb,...), ZFS 可能無法在啟動時間歇地檢測到 zpools。

使用 GPT 標籤[編輯 | 編輯原始碼]

本文或本章節的語言、語法或風格需要改進。參考:Help:Style

原因:Missing references to Persistent block device naming, it is useless to explain the differences (or even what they are) here.(在Talk:ZFS討論)

通過使用 GPT 分區,磁盤標籤和 UUID 也可以用於 ZFS 掛載。ZFS 驅動器有標籤,但 Linux 在啟動時無法讀取這些標籤。與 MBR 分區不同,GPT 分區直接支持 UUID 和標籤,與分區內的格式無關。讓 ZFS 使用磁盤分區而不是整個磁盤可帶來兩個好處。作業系統不會從 ZFS 已寫入分區扇區的任何不可預測數據中生成偽分區號,而且如果有需要,你還可以很容易地給固態硬盤配置預留空間 (OP),並給機械硬盤配置少量預留空間,以確保 zpool 可以將扇區數略微不同的型號替換到你的鏡像。這樣,就可以零成本地用現有的技術和工具來配置與控制 ZFS。

使用 gdisk 將全部或部分驅動器劃分為單一分區。gdisk 不會自動為分區命名,所以如果需要分區標籤,請使用 gdisk 命令 "c" 為分區添加標籤。比起 UUID,你可能更喜歡標籤的一些原因是:標籤容易控制,標籤可以使你每個磁盤的用途一目了然,而且標籤更短,更容易輸入。這些都是在伺服器宕機和高壓力時的優勢。GPT 分區標籤有足夠的空間,可以存儲大多數國際字符 zhwp:GUID_Partition_Table#Partition_entries,允許以有組織的方式對大型數據池進行標記。

使用 GPT 分區的驅動器具有類似如下所示的標籤和 UUID:

$ ls -l /dev/disk/by-partlabel
lrwxrwxrwx 1 root root 10 Apr 30 01:44 zfsdata1 -> ../../sdd1
lrwxrwxrwx 1 root root 10 Apr 30 01:44 zfsdata2 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Apr 30 01:59 zfsl2arc -> ../../sda1
$ ls -l /dev/disk/by-partuuid
lrwxrwxrwx 1 root root 10 Apr 30 01:44 148c462c-7819-431a-9aba-5bf42bb5a34e -> ../../sdd1
lrwxrwxrwx 1 root root 10 Apr 30 01:59 4f95da30-b2fb-412b-9090-fc349993df56 -> ../../sda1
lrwxrwxrwx 1 root root 10 Apr 30 01:44 e5ccef58-5adf-4094-81a7-3bac846a885f -> ../../sdc1
提示:為了儘量減少輸入和複製/粘貼錯誤,可以為目標 PARTUUID 設置一個局部變量:$ UUID=$(lsblk --noheadings --output PARTUUID /dev/sdXY)

創建 ZFS 池[編輯 | 編輯原始碼]

要創建 ZFS 池,請使用如下命令:

# zpool create -f -m <mount> <pool> [raidz(2|3)|mirror] <ids>
提示:你可能需要先閱讀 先進格式化磁盤,因為建議在創建池時設置 ashift
  • create: 創建池的子命令。
  • -m: 池的掛載點。如果沒有指定掛載點, 池將被掛載到 /<pool>
  • pool: 池的名稱。
  • raidz(2|3)|mirror: 這是從設備清單中創建出的虛擬設備的類型,RAID Z 是單盤奇偶校驗(與 RAID 5 類似),RAID Z2 是 2 盤奇偶校驗(與 RAID 6 類似),RAID Z3 是 3 盤奇偶校驗。另外還有鏡像,它類似於 RAID 1 或 RAID 10,但不限於 2 個設備。如果不指定設備類型,每個設備將被添加為一個與 RAID 0 類似的 vdev。在創建之後,可以在每個單盤 vdev 上添加一個設備來轉換為鏡像,這對於遷移數據很有用。
  • ids: 池中包含的驅動器或分區的 ID

使用單個 RAID-Z vdev 創建池:

# zpool create -f -m /mnt/data bigdata \
               raidz \
                  ata-ST3000DM001-9YN166_S1F0KDGY \
                  ata-ST3000DM001-9YN166_S1F0JKRR \
                  ata-ST3000DM001-9YN166_S1F0KBP8 \
                  ata-ST3000DM001-9YN166_S1F0JTM1

使用兩個鏡像 vdev 創建池:

# zpool create -f -m /mnt/data bigdata \
               mirror \
                  ata-ST3000DM001-9YN166_S1F0KDGY \
                  ata-ST3000DM001-9YN166_S1F0JKRR \
               mirror \
                  ata-ST3000DM001-9YN166_S1F0KBP8 \
                  ata-ST3000DM001-9YN166_S1F0JTM1

先進格式化磁盤[編輯 | 編輯原始碼]

在池創建時,應始終使用 ashift=12, 但具有 8K 扇區的固態硬盤除外(此時應使用 ashift=13)。A vdev of 512 byte disks using 4k sectors will not experience performance issues, but a 4k disk using 512 byte sectors will. Since ashift cannot be changed after pool creation, even a pool with only 512 byte disks should use 4k because those disks may need to be replaced with 4k disks or the pool may be expanded by adding a vdev composed of 4k disks. Because correct detection of 4k disks is not reliable, -o ashift=12 should always be specified during pool creation. See the OpenZFS FAQ for more details.

提示:Use blockdev(8) (part of util-linux) to print the sector size reported by the device's ioctls: blockdev --getpbsz /dev/sdXY as the root user.

Create pool with ashift=12 and single raidz vdev:

# zpool create -f -o ashift=12 -m /mnt/data bigdata \
               raidz \
                  ata-ST3000DM001-9YN166_S1F0KDGY \
                  ata-ST3000DM001-9YN166_S1F0JKRR \
                  ata-ST3000DM001-9YN166_S1F0KBP8 \
                  ata-ST3000DM001-9YN166_S1F0JTM1

GRUB-compatible pool creation[編輯 | 編輯原始碼]

By default, zpool create enables all features on a pool. If /boot resides on ZFS when using GRUB you must only enable features supported by GRUB otherwise GRUB will not be able to read the pool. ZFS includes compatibility files (see /usr/share/zfs/compatibility.d) to assist in creating pools at specific feature sets, of which grub2 is an option.

You can create a pool with only the compatible features enabled:

# zpool create -o compatibility=grub2 $POOL_NAME $VDEVS

驗證存儲池狀態[編輯 | 編輯原始碼]

如果命令成功執行,則不會有任何輸出。使用 mount 命令會顯示存儲池已被掛載。使用 zpool status 會顯示存儲池已被創建:

# zpool status -v
  pool: bigdata
 state: ONLINE
 scan: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        bigdata                                    ONLINE       0     0     0
          -0                                       ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KDGY-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JKRR-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KBP8-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JTM1-part1  ONLINE       0     0     0

errors: No known data errors

這時候建議通過重啟來驗證下 ZFS 存儲池是否在啟動時會被掛載。在向存儲池傳輸任何數據前,最好先處理完所有報錯。

Importing a pool created by id[編輯 | 編輯原始碼]

Eventually a pool may fail to auto mount and you need to import to bring your pool back. Take care to avoid the most obvious solution.

警告: Do not run zpool import pool! This will import your pools using /dev/sd? which will lead to problems the next time you rearrange your drives. This may be as simple as rebooting with a USB drive left in the machine.

Adapt one of the following commands to import your pool so that pool imports retain the persistence they were created with:

# zpool import -d /dev/disk/by-id bigdata
# zpool import -d /dev/disk/by-partlabel bigdata
# zpool import -d /dev/disk/by-partuuid bigdata
注意: Use the -l flag when importing a pool that contains encrypted datasets keys, e.g.:
# zpool import -l -d /dev/disk/by-id bigdata

Finally check the state of the pool:

# zpool status -v bigdata

刪除存儲池[編輯 | 編輯原始碼]

ZFS 可輕鬆地刪除已掛載的存儲池,並移除所有關於 ZFS 設備的元數據。

警告: 這一命令會刪除存儲池/數據集內的所有數據

要刪除存儲池:

# zpool destroy <pool>

要刪除數據集:

# zfs destroy <pool>/<dataset>

接下來檢查下存儲池狀態:

# zpool status
no pools available

Exporting a storage pool[編輯 | 編輯原始碼]

If a storage pool is to be used on another system, it will first need to be exported. It is also necessary to export a pool if it has been imported from the archiso as the hostid is different in the archiso as it is in the booted system. The zpool command will refuse to import any storage pools that have not been exported. It is possible to force the import with the -f argument, but this is considered bad form.

Any attempts made to import an un-exported storage pool will result in an error stating the storage pool is in use by another system. This error can be produced at boot time abruptly abandoning the system in the busybox console and requiring an archiso to do an emergency repair by either exporting the pool, or adding the zfs_force=1 to the kernel boot parameters (which is not ideal). See #On boot the zfs pool does not mount stating: "pool may be in use from other system".

To export a pool:

# zpool export <pool>

Extending an existing zpool[編輯 | 編輯原始碼]

A device (a partition or a disk) can be added to an existing zpool:

# zpool add <pool> <device-id>

To import a pool which consists of multiple devices:

# zpool import -d <device-id-1> -d <device-id-2> <pool>

or simply:

# zpool import -d /dev/disk-by-id/ <pool>

Attaching a device to (create) a mirror[編輯 | 編輯原始碼]

A device (a partition or a disk) can be attached aside an existing device to be its mirror (similar to RAID 1):

# zpool attach <pool> <device-id|mirror> <new-device-id>

You can attach the new device to an already existing mirror vdev (e.g. to upgrade from a 2-device to a 3-device mirror) or attach it to single device to create a new mirror vdev.

Renaming a zpool[編輯 | 編輯原始碼]

Renaming a zpool that is already created is accomplished in 2 steps:

# zpool export oldname
# zpool import oldname newname

Setting a different mount point[編輯 | 編輯原始碼]

The mount point for a given zpool can be moved at will with one command:

# zfs set mountpoint=/foo/bar poolname

升級 zpool[編輯 | 編輯原始碼]

在使用更新版本的 zfs 模塊時,zpools 可能會顯示一條更新提示:

$ zpool status -v
pool: bigdata
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details.
注意:
  • 低版本的 zfs 模塊無法導入高版本的 zpool。
  • 在涉及到重要數據時,建議在執行 zpool upgrade 前先創建一份備份

使用如下命令來升級名為 bigdata 的 zpool:

# zpool upgrade bigdata

使用如下命令來升級所有 zpool:

# zpool upgrade -a

創建數據集[編輯 | 編輯原始碼]

相對於在存儲池中創建文件夾,用戶可以選擇在存儲池中創建數據集。除了快照外,數據集還提供了如配額控制等更強大的控制功能。在創建並掛載數據集前,需確保存儲池中不存在與數據集同名的文件夾。以下命令可用於創建數據集:

# zfs create <存储池名>/<数据集名>

可以對數據集應用 ZFS 特定屬性。例如,你可以對數據集中的文件夾設定配額限制:

# zfs set quota=20G <存储池名>/<数据集名>/<文件夹>

如需了解更多 ZFS 命令,可查閱 zfs(8)zpool(8)

原生加密[編輯 | 編輯原始碼]

ZFS 支持如下幾種加密方式:aes-128-ccm, aes-192-ccm, aes-256-ccm, aes-128-gcm, aes-192-gcmaes-256-gcm。當加密設置為 on 時,將使用 aes-256-gcm 進行加密。See zfs-change-key(8) for a description of the native encryption, including limitations.

支持下列幾種密鑰格式:passphrase, raw, hex

One can also specify/increase the default iterations of PBKDF2 when using passphrase with -o pbkdf2iters <n>, although it may increase the decryption time.

注意:
  • To import a pool with keys, one needs to specify the -l flag, without this flag encrypted datasets will be left unavailable until the keys are loaded. See #Importing a pool created by id.
  • Native ZFS encryption has been made available in the stable 0.8.0 release or newer. Previously it was only available in development versions provided by packages like zfs-linux-gitAUR, zfs-dkms-gitAUR or other development builds. Users who were only using the development versions for the native encryption, may now switch to the stable releases if they wish.
  • The default encryption suite was changed from aes-256-ccm to aes-256-gcm in the 0.8.4 release.

使用如下命令創建通過密碼短語加密的數據集:

# zfs create -o encryption=on -o keyformat=passphrase <存储池名>/<数据集名>

使用密鑰而不是密碼短語進行加密:

# dd if=/dev/random of=/path/to/key bs=1 count=32
# zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///path/to/key <存储池名>/<数据集名>

The easy way to make a key in human-readable form (keyformat=hex):

# od -Anone -x -N 32 -w64 /dev/random | tr -d [:blank:] > /path/to/hex.key

驗證密鑰位置:

# zfs get keylocation <存储池名>/<数据集名>

更改密鑰位置:

# zfs set keylocation=file:///path/to/key <存储池名>/<数据集名>

你也可以下列任意一條命令手動加載密鑰:

# zfs load-key <存储池名>/<数据集名> # load key for a specific dataset
# zfs load-key -a # load all keys
# zfs load-key -r zpool/dataset # load all keys in a dataset

掛載加密數據集:

# zfs mount <存储池名>/<数据集名>

啟動時解鎖/掛載:systemd[編輯 | 編輯原始碼]

可以使用 systemd 單元在啟動時自動解鎖數據集。例如,可以創建如下服務來解鎖特定的數據集:

/etc/systemd/system/zfs-load-key@.service
[Unit]
Description=Load %I encryption keys
Before=systemd-user-sessions.service zfs-mount.service
After=zfs-import.target
Requires=zfs-import.target
DefaultDependencies=no

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/bash -c 'until (systemd-ask-password "Encrypted ZFS password for %I" --no-tty | zfs load-key %I); do echo "Try again!"; done'

[Install]
WantedBy=zfs-mount.service

接下來為每個加密數據集啟動/啟用該服務 (例如 zfs-load-key@pool0-dataset0.service)。注意,- 在 systemd 單元中的定義為 /,詳細資料可參考 systemd-escape(1)

注意: The Before=systemd-user-sessions.service ensures that systemd-ask-password is invoked before the local IO devices are handed over to the desktop environment.

另一種方法是加載所有可能用到的密鑰:

/etc/systemd/system/zfs-load-key.service
[Unit]
Description=Load encryption keys
DefaultDependencies=no
After=zfs-import.target
Before=zfs-mount.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/zfs load-key -a
StandardInput=tty-force

[Install]
WantedBy=zfs-mount.service

接下來啟動/啟用 zfs-load-key.service

登錄時解鎖:PAM[編輯 | 編輯原始碼]

If you are not encrypting the root volume, but only the home volume or a user-specific volume, another idea is to wait until login to decrypt it. The advantages of this method are that the system boots uninterrupted, and that when the user logs in, the same password can be used both to authenticate and to decrypt the home volume, so that the password is only entered once.

First set the mountpoint to legacy to avoid having it mounted by zfs mount -a:

zfs set mountpoint=legacy zroot/data/home

Ensure that it is in /etc/fstab so that mount /home will work:

/etc/fstab
zroot/data/home         /home           zfs             rw,xattr,posixacl,noauto        0 0

On a single-user system, with only one /home volume having the same encryption password as the user's password, it can be decrypted at login as follows: first create /sbin/mount-zfs-homedir

/sbin/mount-zfs-homedir
#!/bin/bash

# simplified from https://talldanestale.dk/2020/04/06/zfs-and-homedir-encryption/

set -eu

# Password is given to us via stdin, save it in a variable for later
PASS=$(cat -)

VOLNAME="zroot/data/home"

# Unlock and mount the volume
zfs load-key "$VOLNAME" <<< "$PASS" || continue
zfs mount "$VOLNAME" || true # ignore errors

do not forget chmod a+x /sbin/mount-zfs-homedir; then get PAM to run it by adding the following line to /etc/pam.d/system-auth:

/etc/pam.d/system-auth
auth       optional                    pam_exec.so          expose_authtok /sbin/mount-zfs-homedir

Now it will transparently decrypt and mount the /home volume when you log in anywhere: on the console, via ssh, etc. A caveat is that since your ~/.ssh directory is not mounted, if you log in via ssh, you must use the default password authentication the first time rather than relying on ~/.ssh/authorized_keys.

If you want to have separate volumes for each user, each encrypted with the user's password, try the linked method.

交換卷[編輯 | 編輯原始碼]

警告:

ZFS 不允許使用交換文件,但您可以將一個 ZFS 卷 (ZVOL) 用作交換空間。需要注意的是必須將 ZVOL 的塊大小設置為系統的 PAGESIZE,後者可以通過運行 getconf PAGESIZE 命令來獲得(對於 x86_64 系統來說,其默認值為 4KiB)。關閉 ZVOL 上的寫入緩存也可以讓系統在低內存狀態下更好運行。

創建一個 8 GiB 的 ZFS 卷:

# zfs create -V 8G -b $(getconf PAGESIZE) -o compression=zle \
              -o logbias=throughput -o sync=always\
              -o primarycache=metadata -o secondarycache=none \
              -o com.sun:auto-snapshot=false <pool>/swap

將其格式化為交換空間:

# mkswap -f /dev/zvol/<pool>/swap
# swapon /dev/zvol/<pool>/swap

要將其永久自動掛載,編輯 /etc/fstab。ZVOLs 支持垃圾回收,這對 ZFS 的塊分配器有潛在幫助,同時當交換空間仍有剩餘時有助於減少其他數據集上的磁盤碎片。

/etc/fstab 中添加如下行:

/dev/zvol/<pool>/swap none swap discard 0 0

Access Control Lists[編輯 | 編輯原始碼]

To use ACL on a dataset:

# zfs set acltype=posixacl <nameofzpool>/<nameofdataset>
# zfs set xattr=sa <nameofzpool>/<nameofdataset>

Setting xattr is recommended for performance reasons [4].

It may be preferable to enable ACL on the zpool as datasets will inherit the ACL parameters. Setting aclinherit=passthrough may be wanted as the default mode is restricted [5]; however, it is worth noting that aclinherit does not affect POSIX ACLs [6]:

# zfs set aclinherit=passthrough <nameofzpool>
# zfs set acltype=posixacl <nameofzpool>
# zfs set xattr=sa <nameofzpool>

Databases[編輯 | 編輯原始碼]

ZFS, unlike most other file systems, has a variable record size, or what is commonly referred to as a block size. By default, the recordsize on ZFS is 128KiB, which means it will dynamically allocate blocks of any size from 512B to 128KiB depending on the size of file being written. This can often help fragmentation and file access, at the cost that ZFS would have to allocate new 128KiB blocks each time only a few bytes are written to.

本文或本章節的事實準確性存在爭議。

原因: At least MariaDB uses a default of 16Kib pages! Check your specific DBMS before setting this value.(在 Talk:ZFS 中討論)


Most RDBMSes work in 8KiB-sized blocks by default. Although the block size is tunable for MySQL/MariaDB, PostgreSQL, and Oracle database, all three of them use an 8KiB block size by default. For both performance concerns and keeping snapshot differences to a minimum (for backup purposes, this is helpful), it is usually desirable to tune ZFS instead to accommodate the databases, using a command such as:

# zfs set recordsize=8K <pool>/postgres

These RDBMSes also tend to implement their own caching algorithm, often similar to ZFS's own ARC. In the interest of saving memory, it is best to simply disable ZFS's caching of the database's file data and let the database do its own job:

注意: L2ARC requires primarycache to function, because it is fed with data evicted from primarycache. If you intend to use the L2ARC, do not set the option below, otherwise no actual data will be cached on L2ARC.
# zfs set primarycache=metadata <pool>/postgres

ZFS uses the ZIL for crash recovery, but databases are often syncing their data files to the file system on their own transaction commits anyway. The end result of this is that ZFS will be committing data twice to the data disks, and it can severely impact performance. You can tell ZFS to prefer to not use the ZIL, and in which case, data is only committed to the file system once. However, doing so on non-solid state storage (e.g. HDDs) can result in decreased read performance due to fragmentation (OpenZFS Wiki) -- with mechanical hard drives, please consider using a dedicated SSD as ZIL rather than setting the option below. In addition, setting this for non-database file systems, or for pools with configured log devices, can also negatively impact the performance, so beware:

# zfs set logbias=throughput <pool>/postgres

These can also be done at file system creation time, for example:

# zfs create -o recordsize=8K \
             -o primarycache=metadata \
             -o mountpoint=/var/lib/postgres \
             -o logbias=throughput \
              <pool>/postgres

Please note: these kinds of tuning parameters are ideal for specialized applications like RDBMSes. You can easily hurt ZFS's performance by setting these on a general-purpose file system such as your /home directory.

/tmp[編輯 | 編輯原始碼]

If you would like to use ZFS to store your /tmp directory, which may be useful for storing arbitrarily-large sets of files or simply keeping your RAM free of idle data, you can generally improve performance of certain applications writing to /tmp by disabling file system sync. This causes ZFS to ignore an application's sync requests (eg, with fsync or O_SYNC) and return immediately. While this has severe application-side data consistency consequences (never disable sync for a database!), files in /tmp are less likely to be important and affected. Please note this does not affect the integrity of ZFS itself, only the possibility that data an application expects on-disk may not have actually been written out following a crash.

# zfs set sync=disabled <pool>/tmp

Additionally, for security purposes, you may want to disable setuid and devices on the /tmp file system, which prevents some kinds of privilege-escalation attacks or the use of device nodes:

# zfs set setuid=off <pool>/tmp
# zfs set devices=off <pool>/tmp

Combining all of these for a create command would be as follows:

# zfs create -o setuid=off -o devices=off -o sync=disabled -o mountpoint=/tmp <pool>/tmp

Please note, also, that if you want /tmp on ZFS, you will need to mask (disable) systemd's automatic tmpfs-backed /tmp (tmp.mount, else ZFS will be unable to mount your dataset at boot-time or import-time.

Transmitting snapshots with ZFS Send and ZFS Recv[編輯 | 編輯原始碼]

It is possible to pipe ZFS snapshots to an arbitrary target by pairing zfs send and zfs recv. This is done through standard output, which allows the data to be sent to any file, device, across the network, or manipulated mid-stream by incorporating additional programs in the pipe.

Below are examples of common scenarios:

Basic ZFS Send[編輯 | 編輯原始碼]

First, let's create a snapshot of some ZFS filesystem:

# zfs snapshot zpool0/archive/books@snap

Now let's send the snapshot to a new location on a different zpool

# zfs send -v zpool0/archive/books@snap | zfs recv zpool4/library

The contents of zpool0/archive/books@snap are now live at zpool4/library

提示: See man zfs-send and man zfs-recv for details on flags.
To and from files[編輯 | 編輯原始碼]

First, let's create a snapshot of some ZFS filesystem:

# zfs snapshot zpool0/archive/books@snap

Write the snapshot to a gzip file:

# zfs send zpool0/archive/books@snap > /tmp/mybooks.gz
警告: Make sure to run zfs send with -w flag if you wish to preserve encryption during the send.

Now restore the snapshot from the file:

# gzcat /tmp/mybooks.gz | zfs recv -F zpool0/archive/books

Send over ssh[編輯 | 編輯原始碼]

First, let's create a snapshot of some ZFS filesystem:

# zfs snapshot zpool1/filestore@snap

Next we pipe our "send" traffic over an ssh session running "recv":

# zfs send -v zpool1/filestore@snap | ssh $HOST zfs recv coldstore/backups

The -v flag prints information about the datastream being generated. If you are using a passphrase or passkey, you will be prompted to enter it.

Incremental Backups[編輯 | 編輯原始碼]

You may wish update a previously sent ZFS filesystem without retransmitting all of the data over again. Alternatively, it may be necessary to keep a filesystem online during a lengthy transfer and it is now time to send writes that were made since the initial snapshot.

First, let's create a snapshot of some ZFS filesystem:

# zfs snapshot zpool1/filestore@initial

Next we pipe our "send" traffic over an ssh session running "recv":

# zfs send -v -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups

Once changes are written, make another snapshot:

# zfs snapshot zpool1/filestore@snap2

The following will send the differences that exist locally between zpool1/filestore@initial and zpool1/filestore@snap2 and create an additional snapshot for the remote filesystem coldstore/backups:

# zfs send -v -i -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups

Now both zpool1/filestore and coldstore/backups have the @initial and @snap2 snapshots.

On the remote host, you may now promote the latest snapshot to become the active filesystem:

# rollback coldstore/backups@snap2

調校[編輯 | 編輯原始碼]

通用[編輯 | 編輯原始碼]

可以使用參數進一步調整 ZFS 池和數據集。

注意: 除配額和預留外,所有可設置的屬性值都會從父數據集繼承。

要檢索當前 ZFS 池的參數狀態,請執行以下操作:

# zfs get all <pool>

要檢索指定數據集的參數狀態,請執行以下操作:

# zfs get all <pool>/<dataset>

要禁用默認啟用的訪問時間功能(atime),請執行以下操作:

# zfs set atime=off <pool>

要禁用特定數據集的訪問時間功能(atime),請執行以下操作:

# zfs set atime=off <pool>/<dataset>

除了完全關閉 atime 之外,您還可以使用 relatime。這為ZFS帶來了默認的 ext4/XFS atime 語義,只有在修改或更改時間發生變化時,或者訪問時間在過去24小時內沒有變化時,才更新訪問時間。這是 atime=offatime=on 之間的折衷。該屬性atimeon 時生效:

# zfs set atime=on <pool>
# zfs set relatime=on <pool>

壓縮功能則是對數據的透明壓縮。ZFS 支持數種不同的壓縮算法,目前默認採用 lz4 。gzip 比較適合用於那些不頻繁寫入並且可壓縮率較高的數據。請參考 OpenZFS Wiki 以獲得更多信息。

要啟用壓縮,請執行:

# zfs set compression=on <pool>

若要將池和/或數據集的屬性重置為默認狀態,請使用 zfs inherit

# zfs inherit -rS atime <pool>
# zfs inherit -rS atime <pool>/<dataset>
注意: 使用 -r 標誌將遞歸重置ZPool中的所有數據集。

Scrubbing[編輯 | 編輯原始碼]

當 ZFS 在讀取數據過程中檢測到錯誤時,它會在可能時靜默修複數據,寫回到磁盤並記錄日誌,使得你可以獲得存儲池中錯誤的概覽。ZFS 沒有 fsck 一類的工具,但提供了稱為 scrubbing 的特性。它會遍歷存儲池中的所有數據,並驗證是否所有塊都可被正常讀取。

要對存儲池執行 scrub:

# zpool scrub <pool>

要中斷運行中的 scrub:

# zpool scrub -s <pool>

多久需要運行一次呢?[編輯 | 編輯原始碼]

根據 Oracle 的博客文章 Disk Scrub - Why and When?:

這一問題對支持人員來說有難度,因為最貼切的回答是「看情況」。所以,在我給出一個較通用的回答前,有些可以用來創建更貼合你需求的答案的提示。
  • 你最舊的備份的有效期是多久?對數據執行 scrub 操作的頻率因至少與你最舊備份的有效期相當,以確保回復點可用。
  • 你通常多久會碰到一次磁盤故障?While the recruitment of a hot-spare disk invokes a "resilver" -- a targeted scrub of just the VDEV which lost a disk -- you should probably scrub at least as often as you experience disk failures on average in your specific environment.
  • 你多久會讀取一次磁盤上最舊的數據?你應偶爾進行一次 scrub,以防止舊數據在你不知道的情況下出現位腐壞。
如果針對上述任一問題的答案是「我不知道」,那最通用的回答是:你應至少每月對 zpool 執行一次 scrub 操作。這一周期對多數用例來說都較為合適,提供了足以在各種高負載環境下完成運行的時間,並快於大型 zpools(192+ 磁盤)出現磁盤故障的時間。

根據 Aaron Toponce 的 ZFS Administration Guide,他建議對消費級磁盤每周執行一次 scrub。

根據服務或定時器運行[編輯 | 編輯原始碼]

注意: 從 OpenZFS 2.1.3 開始提供了每周/月運行的 systemd 定時器/服務。使用時需根據目標存儲池 啟用/啟動 zfs-scrub-weekly@pool-to-scrub.timerzfs-scrub-monthly@pool-to-scrub.timer

可以使用 systemd 定時器/服務來自動對存儲池執行 scrub。

要對特定存儲池執行每月 scrubbing:

/etc/systemd/system/zfs-scrub@.timer
[Unit]
Description=Monthly zpool scrub on %i

[Timer]
OnCalendar=monthly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=multi-user.target
/etc/systemd/system/zfs-scrub@.service
[Unit]
Description=zpool scrub on %i

[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/usr/bin/zpool scrub %i

[Install]
WantedBy=multi-user.target

啟用/啟動 zfs-scrub@pool-to-scrub.timer 單元以為特定 zpool 啟用月度 scrubbing。

啟用 TRIM[編輯 | 編輯原始碼]

要檢查你的 vdev 是否支持 TRIM,你可以通過 -tzpool status 輸出添加 TRIM 信息:

$ zpool status -t tank
pool: tank
 state: ONLINE
  scan: none requested
 config:

	NAME                                     STATE     READ WRITE CKSUM
	tank                                     ONLINE       0     0     0
	  ata-ST31000524AS_5RP4SSNR-part1        ONLINE       0     0     0  (trim unsupported)
	  ata-CT480BX500SSD1_2134A59B933D-part1  ONLINE       0     0     0  (untrimmed)

errors: No known data errors

ZFS 可以手動或通過 autotrim 定時對支持的設備進行 TRIM。

對 zpool 手動進行 TRIM:

 # zpool trim <zpool>

為數據池中所有支持的 vdev 啟用自動 TRIM:

 # zpool set autotrim=on <zpool>
注意: 因為自動 TRIM 與 zpool trim 的操作有所不同,你可能會想偶爾手動執行 TRIM。

要使用 systemd 定時器/服務對特定存儲池每月執行一次完整的 zpool trim

/etc/systemd/system/zfs-trim@.timer
[Unit]
Description=Monthly zpool trim on %i

[Timer]
OnCalendar=monthly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=multi-user.target
/etc/systemd/system/zfs-trim@.service
[Unit]
Description=zpool trim on %i
Documentation=man:zpool-trim(8)
Requires=zfs.target
After=zfs.target
ConditionACPower=true
ConditionPathIsDirectory=/sys/module/zfs

[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/bin/sh -c '\
if /usr/bin/zpool status %i | grep "trimming"; then\
exec /usr/bin/zpool wait -t trim %i;\
else exec /usr/bin/zpool trim -w %i; fi'
ExecStop=-/bin/sh -c '/usr/bin/zpool trim -s %i 2>/dev/null || true'

[Install]
WantedBy=multi-user.target

啟用/啟動 zfs-trim@pool-to-trim.timer 單元以對特定存儲池啟用 TRIM。

SSD Caching[編輯 | 編輯原始碼]

If your pool has no configured log devices, ZFS reserves space on the pool's data disks for its intent log (the ZIL, also called SLOG). If your data disks are slow (e.g. HDD) it is highly recommended to configure the ZIL on solid state drives for better write performance and also to consider a layer 2 adaptive replacement cache (L2ARC). The process to add them is very similar to adding a new VDEV.

All of the below references to device-id are the IDs from /dev/disk/by-id/*.

ZIL[編輯 | 編輯原始碼]

To add a mirrored ZIL:

 # zpool add <pool> log mirror <device-id-1> <device-id-2>

Or to add a single device ZIL (unsafe):

 # zpool add <pool> log <device-id>

Because the ZIL device stores data that has not been written to the pool, it is important to use devices that can finish writes when power is lost. It is also important to use redundancy, since a device failure can cause data loss. In addition, the ZIL is only used for sync writes, so may not provide any performance improvement when your data drives are as fast as your ZIL drive(s).

L2ARC[編輯 | 編輯原始碼]

使用如下命令添加 L2ARC:

# zpool add <pool> cache <device-id>

L2ARC 是只讀緩存,所以不需要任何冗餘。從 ZFS 2.0.0 版本開始,L2ARC 可以在重啟後保留。[7]

L2ARC 通常只在熱數據量比系統內存更,但又小到能放入 L2ARC 的情況下有用。L2ARC 由系統內存中的 ARC 進行索引,每條記錄(默認為 128KiB)消耗 70 字節內存。所以,對應的內存用量可用以下公式計算:

(L2ARC 大小) / (记录大小) * 70 字节

因此,由於 L2ARC 佔用了 ARC 的內存空間,在某些情況下它會造成存儲性能的下降。

ZVOLs[編輯 | 編輯原始碼]

ZFS volumes (ZVOLs) can suffer from the same block size-related issues as RDBMSes, but it is worth noting that the default recordsize for ZVOLs is 8 KiB already. If possible, it is best to align any partitions contained in a ZVOL to your recordsize (current versions of fdisk and gdisk by default automatically align at 1MiB segments, which works), and file system block sizes to the same size. Other than this, you might tweak the recordsize to accommodate the data inside the ZVOL as necessary (though 8 KiB tends to be a good value for most file systems, even when using 4 KiB blocks on that level).

RAIDZ and Advanced Format physical disks[編輯 | 編輯原始碼]

Each block of a ZVOL gets its own parity disks, and if you have physical media with logical block sizes of 4096B, 8192B, or so on, the parity needs to be stored in whole physical blocks, and this can drastically increase the space requirements of a ZVOL, requiring 2× or more physical storage capacity than the ZVOL's logical capacity. Setting the recordsize to 16k or 32k can help reduce this footprint drastically.

See OpenZFS issue #1807 for details.

I/O Scheduler[編輯 | 編輯原始碼]

While ZFS is expected to work well with modern schedulers including, mq-deadline, and none, experimenting with manually setting the I/O scheduler on ZFS disks may yield performance gains.

Troubleshooting[編輯 | 編輯原始碼]

Creating a zpool fails[編輯 | 編輯原始碼]

If the following error occurs then it can be fixed.

# the kernel failed to rescan the partition table: 16
# cannot label 'sdc': try using parted(8) and then provide a specific slice: -1

One reason this can occur is because ZFS expects pool creation to take less than 1 second[8][9]. This is a reasonable assumption under ordinary conditions, but in many situations it may take longer. Each drive will need to be cleared again before another attempt can be made.

# parted /dev/sda rm 1
# parted /dev/sda rm 1
# dd if=/dev/zero of=/dev/sdb bs=512 count=1
# zpool labelclear /dev/sda

A brute force creation can be attempted over and over again, and with some luck the ZPool creation will take less than 1 second. One cause for creation slowdown can be slow burst read writes on a drive. By reading from the disk in parallell to ZPool creation, it may be possible to increase burst speeds.

# dd if=/dev/sda of=/dev/null

This can be done with multiple drives by saving the above command for each drive to a file on separate lines and running

# cat $FILE | parallel

Then run ZPool creation at the same time.

ZFS is using too much RAM[編輯 | 編輯原始碼]

By default, ZFS caches file operations (ARC) using up to two-thirds of available system memory on the host. To adjust the ARC size, add the following to the 內核參數 list:

zfs.zfs_arc_max=536870912 # (for 512MiB)

In case that the default value of zfs_arc_min (1/32 of system memory) is higher than the specified zfs_arc_max it is needed to add also the following to the 內核參數 list:

zfs.zfs_arc_min=268435456 # (for 256MiB, needs to be lower than zfs.zfs_arc_max)

For a more detailed description, as well as other configuration options, see Gentoo:ZFS#ARC.

Does not contain an EFI label[編輯 | 編輯原始碼]

The following error will occur when attempting to create a zfs filesystem,

/dev/disk/by-id/<id> does not contain an EFI label but it may contain partition

The way to overcome this is to use -f with the zfs create command.

No hostid found[編輯 | 編輯原始碼]

An error that occurs at boot with the following lines appearing before initscript output:

ZFS: No hostid found on kernel command line or /etc/hostid.

This warning occurs because the ZFS module does not have access to the spl hosted. There are two solutions, for this. Either place the spl hostid in the 內核參數 in the boot loader. For example, adding spl.spl_hostid=0x00bab10c.

The other solution is to make sure that there is a hostid in /etc/hostid, and then regenerate the initramfs image. Which will copy the hostid into the initramfs image.

Pool cannot be found while booting from SAS/SCSI devices[編輯 | 編輯原始碼]

In case you are booting a SAS/SCSI based, you might occassionally get boot problems where the pool you are trying to boot from cannot be found. A likely reason for this is that your devices are initialized too late into the process. That means that zfs cannot find any devices at the time when it tries to assemble your pool.

In this case you should force the scsi driver to wait for devices to come online before continuing. You can do this by putting this into /etc/modprobe.d/zfs.conf:

/etc/modprobe.d/zfs.conf
options scsi_mod scan=sync

Afterwards, regenerate the initramfs.

This works because the zfs hook will copy the file at /etc/modprobe.d/zfs.conf into the initcpio which will then be used at build time.

On boot the zfs pool does not mount stating: "pool may be in use from other system"[編輯 | 編輯原始碼]

Unexported pool[編輯 | 編輯原始碼]

If the new installation does not boot because the zpool cannot be imported, chroot into the installation and properly export the zpool. See #Emergency chroot repair with archzfs.

Once inside the chroot environment, load the ZFS module and force import the zpool,

# zpool import -a -f

now export the pool:

# zpool export <pool>

To see the available pools, use,

# zpool status

It is necessary to export a pool because of the way ZFS uses the hostid to track the system the zpool was created on. The hostid is generated partly based on the network setup. During the installation in the archiso the network configuration could be different generating a different hostid than the one contained in the new installation. Once the zfs filesystem is exported and then re-imported in the new installation, the hostid is reset. See Re: Howto zpool import/export automatically? - msg#00227.

If ZFS complains about "pool may be in use" after every reboot, properly export pool as described above, and then regenerate the initramfs in normally booted system.

Incorrect hostid[編輯 | 編輯原始碼]

Double check that the pool is properly exported. Exporting the zpool clears the hostid marking the ownership. So during the first boot the zpool should mount correctly. If it does not there is some other problem.

Reboot again, if the zfs pool refuses to mount it means the hostid is not yet correctly set in the early boot phase and it confuses zfs. Manually tell zfs the correct number, once the hostid is coherent across the reboots the zpool will mount correctly.

Boot using zfs_force and write down the hostid. This one is just an example.

$ hostid
0a0af0f8

This number have to be added to the 內核參數 as spl.spl_hostid=0x0a0af0f8. Another solution is writing the hostid inside the initram image, see the installation guide explanation about this.

Users can always ignore the check adding zfs_force=1 in the 內核參數, but it is not advisable as a permanent solution.

Devices have different sector alignment[編輯 | 編輯原始碼]

Once a drive has become faulted it should be replaced A.S.A.P. with an identical drive.

# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -f

but in this instance, the following error is produced:

cannot replace ata-ST3000DM001-9YN166_S1F0KDGY with ata-ST3000DM001-1CH166_W1F478BD: devices have different sector alignment

ZFS uses the ashift option to adjust for physical block size. When replacing the faulted disk, ZFS is attempting to use ashift=12, but the faulted disk is using a different ashift (probably ashift=9) and this causes the resulting error.

For Advanced Format Disks with 4KB blocksize, an ashift of 12 is recommended for best performance. See OpenZFS FAQ: Performance Considerations and ZFS and Advanced Format disks.

Use zdb to find the ashift of the zpool: zdb , then use the -o argument to set the ashift of the replacement drive:

# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -o ashift=9 -f

Check the zpool status for confirmation:

# zpool status -v
pool: bigdata
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jun 16 11:16:28 2014
    10.3G scanned out of 5.90T at 81.7M/s, 20h59m to go
    2.57G resilvered, 0.17% done
config:

        NAME                                   STATE     READ WRITE CKSUM
        bigdata                                DEGRADED     0     0     0
        raidz1-0                               DEGRADED     0     0     0
            replacing-0                        OFFLINE      0     0     0
            ata-ST3000DM001-9YN166_S1F0KDGY    OFFLINE      0     0     0
            ata-ST3000DM001-1CH166_W1F478BD    ONLINE       0     0     0  (resilvering)
            ata-ST3000DM001-9YN166_S1F0JKRR    ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KBP8    ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JTM1    ONLINE       0     0     0

errors: No known data errors

Pool resilvering stuck/restarting/slow?[編輯 | 編輯原始碼]

According to the ZFSonLinux github it is a known issue since 2012 with ZFS-ZED which causes the resilvering process to constantly restart, sometimes get stuck and be generally slow for some hardware. The simplest mitigation is to stop zfs-zed.service until the resilver completes.

Fix slow boot caused by failed import of unavailable pools in the initramfs zpool.cache[編輯 | 編輯原始碼]

Your boot time can be significantly impacted if you update your intitramfs (eg when doing a kernel update) when you have additional but non-permanently attached pools imported because these pools will get added to your initramfs zpool.cache and ZFS will attempt to import these extra pools on every boot, regardless of whether you have exported it and removed it from your regular zpool.cache.

If you notice ZFS trying to import unavailable pools at boot, first run:

$ zdb -C

To check your zpool.cache for pools you do not want imported at boot. If this command is showing (a) additional, currently unavailable pool(s), run:

# zpool set cachefile=/etc/zfs/zpool.cache zroot

To clear the zpool.cache of any pools other than the pool named zroot. Sometimes there is no need to refresh your zpool.cache, but instead all you need to do is regenerate the initramfs.

ZFS Command History[編輯 | 編輯原始碼]

ZFS logs changes to a pool's structure natively as a log of executed commands in a ring buffer (which cannot be turned off). The log may be helpful when restoring a degraded or failed pool.

# zpool history zpool
History for 'zpool':
2023-02-19.16:28:44 zpool create zpool raidz1 /scratch/disk_1.img /scratch/disk_2.img /scratch/disk_3.img
2023-02-19.16:31:29 zfs set compression=lz4 zpool
2023-02-19.16:41:45 zpool scrub zpool
2023-02-19.17:00:57 zpool replace zpool /scratch/disk_1.img /scratch/bigger_disk_1.img
2023-02-19.17:01:34 zpool scrub zpool
2023-02-19.17:01:42 zpool replace zpool /scratch/disk_2.img /scratch/bigger_disk_2.img
2023-02-19.17:01:46 zpool replace zpool /scratch/disk_3.img /scratch/bigger_disk_3.img

Tips and tricks[編輯 | 編輯原始碼]

創建帶有 ZFS 支持的 Archiso 映像[編輯 | 編輯原始碼]

創建 Arch Linux live CD/DVD/USB 映像的具體步驟在 Archiso 中已有描述。如需在映像中加入 ZFS 支持,可以選擇手動構建 AUR 上的 PKGBUILDs,或是在映像中加入非官方用戶倉庫中的預構建包。

使用 AUR 自行構建 ZFS 包[編輯 | 編輯原始碼]

參考正常流程自行構建你需要的 ZFS 包。如果你不確定需要哪個包,zfs-dkmsAURzfs-utilsAUR 可以支持你在 Archiso 映像上做出的多數改動。下一步需要創建一個本地倉庫,並將倉庫添加到新配置的 Pacman 配置文件中。

將構建出的包添加到要安裝的包列表中。下面的例子假設你僅想安裝 zfs-dkmsAURzfs-utilsAUR 包:

packages.x86_64
...
zfs-dkms
zfs-utils

如果你添加了任何 DKMS 包,請確保你同時添加了 ISO 所用內核對應的頭文件包(默認內核對應為 linux-headers)。

使用 archzfs 非官方用戶倉庫[編輯 | 編輯原始碼]

archzfs 非官方用戶倉庫添加到新 Archiso 配置中的 pacman.conf 文件中。

archzfs-linux 軟件包組添加到要安裝的軟件包包列表中(archzfs 倉庫提供的包僅支持 x86_64 架構):

packages.x86_64
...
archzfs-linux
注意: 如果你稍後在運行 modprobe zfs 的過程中出現報錯,需在 packages.x86_64 中加入 linux-headers 包。

收尾[編輯 | 編輯原始碼]

無論你使用了哪種方法,最後都需要構建 ISO 映像

Automatic snapshots[編輯 | 編輯原始碼]

zrepl[編輯 | 編輯原始碼]

The zreplAUR package provides a ZFS automatic replication service, which could also be used as a snapshotting service much like snapper.

For details on how to configure the zrepl daemon, see the zrepl documentation. The configuration file should be located at /etc/zrepl/zrepl.yml. Then, run zrepl configcheck to make sure that the syntax of the config file is correct. Finally, enable zrepl.service.

sanoid[編輯 | 編輯原始碼]

sanoidAUR is a policy-driven tool for taking snapshots. Sanoid also includes syncoid, which is for replicating snapshots. It comes with systemd services and a timer.

Sanoid only prunes snapshots on the local system. To prune snapshots on the remote system, run sanoid there as well with prune options. Either use the --prune-snapshots command line option or use the --cron command line option together with the autoprune = yes and autosnap = no configuration options.

ZFS Automatic Snapshot Service for Linux[編輯 | 編輯原始碼]

注意: zfs-auto-snapshot-gitAUR has not seen any updates since 2019, and the functionality is extremely limited. You are advised to switch to a newer tool like zreplAUR.

The zfs-auto-snapshot-gitAUR package provides a shell script to automate the management of snapshots, with each named by date and label (hourly, daily, etc), giving quick and convenient snapshotting of all ZFS datasets. The package also installs cron tasks for quarter-hourly, hourly, daily, weekly, and monthly snapshots. Optionally adjust the --keep parameter from the defaults depending on how far back the snapshots are to go (the monthly script by default keeps data for up to a year).

To prevent a dataset from being snapshotted at all, set com.sun:auto-snapshot=false on it. Likewise, set more fine-grained control as well by label, if, for example, no monthlies are to be kept on a snapshot, for example, set com.sun:auto-snapshot:monthly=false.

注意: zfs-auto-snapshot-git will not create snapshots during scrubbing. It is possible to override this by editing provided systemd unit and removing --skip-scrub from ExecStart line. Consequences not known, someone please edit.

Once the package has been installed, enable and start the selected timers (zfs-auto-snapshot-{frequent,daily,weekly,monthly}.timer).

創建共享[編輯 | 編輯原始碼]

ZFS 支持創建 NFSSMB 共享。

NFS[編輯 | 編輯原始碼]

首先,確保系統已經安裝並配置了 NFS。注意:無需編輯 /etc/exports。對於 NFS 共享,確保已經啟動 nfs-server.servicezfs-share.service

要將存儲池共享到網絡:

# zfs set sharenfs=on 存储池名

要將數據集共享到網絡:

# zfs set sharenfs=on 存储池名/数据集名

要為特定 IP 段啟用讀寫權限:

# zfs set sharenfs="rw=@192.168.1.100/24,rw=@10.0.0.0/24" 存储池名/数据集名

要確認數據集是否已成功導出:

# showmount -e `hostname`
Export list for hostname:
/path/of/dataset 192.168.1.100/24

要確認當前導出狀態的詳細信息:

# exportfs -v
/path/of/dataset
    192.168.1.100/24(sync,wdelay,hide,no_subtree_check,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)

要通過 ZFS 查看當前 NFS 共享列表:

# zfs get sharenfs

SMB[編輯 | 編輯原始碼]

When sharing through SMB, using usershares in /etc/samba/smb.conf will allow ZFS to setup and create the shares. See Samba#Enable Usershares for details.

/etc/samba/smb.conf
[global]
    usershare path = /var/lib/samba/usershares
    usershare max shares = 100
    usershare allow guests = yes
    usershare owner only = no

Create and set permissions on the user directory as root

# mkdir /var/lib/samba/usershares
# chmod +t /var/lib/samba/usershares

To make a pool available on the network:

# zfs set sharesmb=on nameofzpool

To make a dataset available on the network:

# zfs set sharesmb=on nameofzpool/nameofdataset

To check if the dataset is exported successfully:

# smbclient -L localhost -U%
        Sharename       Type      Comment
        ---------       ----      -------
        IPC$            IPC       IPC Service (SMB Server Name)
        nameofzpool_nameofdataset        Disk      Comment: path/of/dataset
SMB1 disabled -- no workgroup available

To view the current SMB share list by ZFS:

# zfs get sharesmb

Encryption in ZFS using dm-crypt[編輯 | 編輯原始碼]

Before OpenZFS version 0.8.0, ZFS did not support encryption directly (See #Native encryption). Instead, zpools can be created on dm-crypt block devices. Since the zpool is created on the plain-text abstraction, it is possible to have the data encrypted while having all the advantages of ZFS like deduplication, compression, and data robustness. Furthermore, utilizing dm-crypt will encrypt the zpools metadata, which the native encryption can inherently not provide.[10]

dm-crypt, possibly via LUKS, creates devices in /dev/mapper and their name is fixed. So you just need to change zpool create commands to point to that names. The idea is configuring the system to create the /dev/mapper block devices and import the zpools from there. Since zpools can be created in multiple devices (raid, mirroring, striping, ...), it is important all the devices are encrypted otherwise the protection might be partially lost.

For example, an encrypted zpool can be created using plain dm-crypt (without LUKS) with:

# cryptsetup --hash=sha512 --cipher=twofish-xts-plain64 --offset=0 \
             --key-file=/dev/sdZ --key-size=512 open --type=plain /dev/sdX enc
# zpool create zroot /dev/mapper/enc

In the case of a root filesystem pool, the mkinitcpio.conf HOOKS line will enable the keyboard for the password, create the devices, and load the pools. It will contain something like:

HOOKS="... keyboard encrypt zfs ..."

Since the /dev/mapper/enc name is fixed no import errors will occur.

Creating encrypted zpools works fine. But if you need encrypted directories, for example to protect your users' homes, ZFS loses some functionality.

ZFS will see the encrypted data, not the plain-text abstraction, so compression and deduplication will not work. The reason is that encrypted data has always high entropy making compression ineffective and even from the same input you get different output (thanks to salting) making deduplication impossible. To reduce the unnecessary overhead it is possible to create a sub-filesystem for each encrypted directory and use eCryptfs on it.

For example to have an encrypted home: (the two passwords, encryption and login, must be the same)

# zfs create -o compression=off -o dedup=off -o mountpoint=/home/<username> <zpool>/<username>
# useradd -m <username>
# passwd <username>
# ecryptfs-migrate-home -u <username>
<log in user and complete the procedure with ecryptfs-unwrap-passphrase>

Emergency chroot repair with archzfs[編輯 | 編輯原始碼]

To get into the ZFS filesystem from live system for maintenance, there are two options:

  1. Build custom archiso with ZFS as described in #Create an Archiso image with ZFS support.
  2. Boot the latest official archiso and bring up the network. Then enable archzfs repository inside the live system as usual, sync the pacman package database and install the archzfs-archiso-linux package.

To start the recovery, load the ZFS kernel modules:

# modprobe zfs

Import the pool:

# zpool import -a -R /mnt

Mount the boot partition and EFI system partition (if any):

# mount /dev/sda2 /mnt/boot
# mount /dev/sda1 /mnt/efi

Chroot into the ZFS filesystem:

# arch-chroot /mnt /bin/bash

Check the kernel version:

# pacman -Qi linux
# uname -r

uname will show the kernel version of the archiso. If they are different, run depmod (in the chroot) with the correct kernel version of the chroot installation:

# depmod -a 3.6.9-1-ARCH (version gathered from pacman -Qi linux but using the matching kernel modules directory name under the chroot's /lib/modules)

This will load the correct kernel modules for the kernel version installed in the chroot installation.

Regenerate the initramfs. There should be no errors.

Bind mount[編輯 | 編輯原始碼]

Here a bind mount from /mnt/zfspool to /srv/nfs4/music is created. The configuration ensures that the zfs pool is ready before the bind mount is created.

fstab[編輯 | 編輯原始碼]

See systemd.mount(5) for more information on how systemd converts fstab into mount unit files with systemd-fstab-generator(8).

/etc/fstab
/mnt/zfspool		/srv/nfs4/music		none	bind,defaults,nofail,x-systemd.requires=zfs-mount.service	0 0

Monitoring / Mailing on Events[編輯 | 編輯原始碼]

See ZED: The ZFS Event Daemon for more information.

An email forwarder, such as S-nail, is required to accomplish this. Test it to be sure it is working correctly.

Uncomment the following in the configuration file:

/etc/zfs/zed.d/zed.rc
 ZED_EMAIL_ADDR="root"
 ZED_EMAIL_PROG="mailx"
 ZED_NOTIFY_VERBOSE=0
 ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"

Update 'root' in ZED_EMAIL_ADDR="root" to the email address you want to receive notifications at.

If you are keeping your mailrc in your home directory, you can tell mail to get it from there by setting MAILRC:

/etc/zfs/zed.d/zed.rc
export MAILRC=/home/<user>/.mailrc

This works because ZED sources this file, so mailx sees this environment variable.

If you want to receive an email no matter the state of your pool, you will want to set ZED_NOTIFY_VERBOSE=1. You will need to do this temporary to test.

Start and enable zfs-zed.service.

With ZED_NOTIFY_VERBOSE=1, you can test by running a scrub as root: zpool scrub <pool-name>.

Wrap shell commands in pre & post snapshots[編輯 | 編輯原始碼]

Since it is so cheap to make a snapshot, we can use this as a measure of security for sensitive commands such as system and package upgrades. If we make a snapshot before, and one after, we can later diff these snapshots to find out what changed on the filesystem after the command executed. Furthermore we can also rollback in case the outcome was not desired.

znp[編輯 | 編輯原始碼]

E.g.:

# zfs snapshot -r zroot@pre
# pacman -Syu
# zfs snapshot -r zroot@post
# zfs diff zroot@pre zroot@post 
# zfs rollback zroot@pre

A utility that automates the creation of pre and post snapshots around a shell command is znp.

E.g.:

# znp pacman -Syu
# znp find / -name "something*" -delete

and you would get snapshots created before and after the supplied command, and also output of the commands logged to file for future reference so we know what command created the diff seen in a pair of pre/post snapshots.

Remote unlocking of ZFS encrypted root[編輯 | 編輯原始碼]

As of PR #261, archzfs supports SSH unlocking of natively-encrypted ZFS datasets. This section describes how to use this feature, and is largely based on dm-crypt/Specialties#Busybox based initramfs (built with mkinitcpio).

  1. Install mkinitcpio-netconf to provide hooks for setting up early user space networking.
  2. Choose an SSH server to use in early user space. The options are mkinitcpio-tinyssh or mkinitcpio-dropbear, and are mutually exclusive.
    1. If using mkinitcpio-tinyssh, it is also recommended to install tinyssh or tinyssh-convert-gitAUR. This tool converts an existing OpenSSH hostkey to the TinySSH key format, preserving the key fingerprint and avoiding connection warnings. The TinySSH and Dropbear mkinitcpio install scripts will automatically convert existing hostkeys when generating a new initcpio image.
  3. Decide whether to use an existing OpenSSH key or generate a new one (recommended) for the host that will be connecting to and unlocking the encrypted ZFS machine. Copy the public key into /etc/tinyssh/root_key or /etc/dropbear/root_key. When generating the initcpio image, this file will be added to authorized_keys for the root user and is only valid in the initrd environment.
  4. Add the ip= 內核參數 to your boot loader configuration. The ip string is highly configurable. A simple DHCP example is shown below.
    ip=:::::eth0:dhcp
  5. Edit /etc/mkinitcpio.conf to include the netconf, dropbear or tinyssh, and zfsencryptssh hooks before the zfs hook:
    HOOKS=(... netconf <tinyssh>|<dropbear> zfsencryptssh zfs ...)
  6. Regenerate the initramfs.
  7. Reboot and try it out!

Changing the SSH server port[編輯 | 編輯原始碼]

By default, mkinitcpio-tinyssh and mkinitcpio-dropbear listen on port 22. You may wish to change this.

For TinySSH, copy /usr/lib/initcpio/hooks/tinyssh to /etc/initcpio/hooks/tinyssh, and find/modify the following line in the run_hook() function:

/etc/initcpio/hooks/tinyssh
/usr/bin/tcpserver -HRDl0 0.0.0.0 <new_port> /usr/sbin/tinysshd -v /etc/tinyssh/sshkeydir &

For Dropbear, copy /usr/lib/initcpio/hooks/dropbear to /etc/initcpio/hooks/dropbear, and find/modify the following line in the run_hook() function:

/etc/initcpio/hooks/tinyssh
 /usr/sbin/dropbear -E -s -j -k -p <new_port>

Regenerate the initramfs.

Unlocking from a Windows machine using PuTTY/Plink[編輯 | 編輯原始碼]

First, we need to use puttygen.exe to import and convert the OpenSSH key generated earlier into PuTTY's .ppk private key format. Let us call it zfs_unlock.ppk for this example.

The mkinitcpio-netconf process above does not setup a shell (nor do we need need one). However, because there is no shell, PuTTY will immediately close after a successful connection. This can be disabled in the PuTTY SSH configuration (Connection > SSH > [X] Do not start a shell or command at all), but it still does not allow us to see stdout or enter the encryption passphrase. Instead, we use plink.exe with the following parameters:

plink.exe -ssh -l root -i c:\path\to\zfs_unlock.ppk <hostname>

The plink command can be put into a batch script for ease of use.

See also[編輯 | 編輯原始碼]