跳至內容

ZFS

本頁使用了標題或全文手工轉換
出自 Arch Linux 中文维基

ZFS 最早是由太陽電腦為 solaris 作業系統開發並發佈的一個先進檔案系統,現在的 ZFS 一詞通常指 OpenZFS 分支,該分支將原先的實現移植到了包括 Linux 等其它作業系統上,同時繼續開發 solaris ZFS。本文將把 ZFS 和 OpenZFS 視為同義詞。

ZFS 具有豐富的特性,其中包括:更好的頁快取演算法:ARC、去重、池化儲存、快照、複製、數據完整性檢查及自動修復(scrubbing)、RAID-Z 等等。

概念[編輯 | 編輯原始碼]

與將數據存放在單個塊裝置上的常見檔案系統不同,ZFS 將數據存放在儲存池中。儲存池由 vdev(虛擬裝置)組成,而 vdev 又由塊裝置構成。儲存池始終將數據寫入到剩餘空間佔比最大的 vdev 中,使數據條帶化分佈在各個 vdev 內。另一方面,vdev 可以使用如 RAIDZ 和鏡像更複雜的組態。在最簡單的組態中,可以在由單個分區組成的單個 vdev 上建立一個池,其行為與常規檔案系統類似。

建立後,就可以從池中分配儲存資源。這些資源被分組為稱為數據集的單元。數據集有 4 種類型:

  1. 檔案系統:檔案系統基本上就是一個目錄樹,可以像常規檔案系統一樣被掛載到系統命名空間中
  2. 卷(zvol):表現為塊裝置的卷
  3. 快照:檔案系統或卷的快照
  4. 書籤:不儲存任何數據的快照,用於增量複製

數據集被標識為一條唯一路徑,格式如下:

pool(/segment)+((#|@)bookmark/snapshot name)?

其中 # 被用於書籤,@ 被用於快照。

注意: 上述內容是 zpoolconcepts(7)zfsconcepts(7) 的簡短摘要。強烈建議閱讀這兩篇文件,以熟悉其中的概念以及此處未涉及的技術術語。

位於內核樹外導致的問題[編輯 | 編輯原始碼]

由於複雜的法律問題,Linux 內核維護者拒絕將 ZFS 併入到 Linux 內核中,因此 ZFS 被作為樹外內核模組進行開發。由此導致的一個問題是:內核升級經常會導致 ZFS 使用的內核 API 被修改。在這種情況下,ZFS 將被迫修改代碼以使用新的 API,也就意味着在一段時間內 ZFS 將不相容最新的主線內核版本。

提示:由於 linux 會密切跟蹤最新穩定內核版本,如果你不想將 linux 包固定在特定版本,最好是使用 linux-lts

安裝[編輯 | 編輯原始碼]

作為樹外模組,有兩種類型的包可供選擇安裝:一種是為特定內核版本構建的二進制內核模組,另一種是將其原始碼安裝為 DKMS 模組,並在內核更新時自動重新構建。

除了內核模組外,用戶還需要安裝如 zpool(8)zfs(8) 的用戶空間工具。這些工具通常被打包為單個軟件包,名稱為 zfs-utils*

下面提到的所有內核模組都已在依賴中指定所需的 zfs-utils* 包,因此在安裝時你只需滿足其依賴即可。

二進制內核模組[編輯 | 編輯原始碼]

各軟件包的比較
包名 軟件源 ZFS 發佈類型 目標 kernel 二進制包 附註
zfs-linux-ltsAUR AUR 穩定版 linux-lts 強烈建議在構建新版 ZFS 包時使用 devtools,從而無需在升級內核時解除安裝當前安裝的 ZFS 包。
zfs-linuxAUR AUR 穩定版 linux
zfs-linux-lts-poscat archlinuxcn 穩定版 linux-lts 每隔幾小時都會由一個構建器自動拉取更新並重新構建,並提供適用基於 systemd 的 initrd 的 mkinitcpio 勾點
zfs-linux-lts-rc-poscat archlinuxcn 預發佈版 linux-lts
zfs-linux archzfs 穩定版 linux
zfs-linux-lts archzfs 穩定版 linux-lts

DKMS[編輯 | 編輯原始碼]

使用 ZFS 作為根分區[編輯 | 編輯原始碼]

參見在 ZFS 上安裝 Arch Linux

組態[編輯 | 編輯原始碼]

啟動時匯入儲存池[編輯 | 編輯原始碼]

ZFS 提供了用於自動匯入儲存池的 systemd 服務,和為其它單元提供 ZFS 初始化狀態的目標,其中包括:

  • zfs.target:全部 ZFS 服務完成後啟用
  • zfs-import.target:完成 ZFS 儲存池匯入後啟用
  • zfs-volumes.target:所有 zvol 都出現在 /dev 下後啟用
  • zfs-import-scan.service:使用 libblkid 掃描裝置並匯入儲存池
  • zfs-import-cache.service:檢查 zpool.cache 並匯入儲存池
  • zfs-volume-wait.service:等待所有 zvol 都可用

你需要在 zfs-import-scan.servicezfs-import-cache.service 間二選一,並啟用其它全部服務和目標。

zfs-import-scan[編輯 | 編輯原始碼]

zfs-import-scan.service 使用了 zpool import 的預設邏輯:使用 blkid 掃描裝置,意味着其不需要使用 zpool.cache 檔案。鑑於 zpool.cache 已被廢棄,建議使用該方法。

如果 zpool.cache 存在且不為空,那麼 zfs-import-scan.service 就不會啟動,因此需要確保你的所有儲存池匯入時都沒有啟用 cachefile 選項,該操作可通過啟用 zfs 模組的 zfs_autoimport_diable 選項實現。你還需要刪除現有的 zpool.cache,或在啟動時將所有匯入的儲存池的 cachefile 選項設為 none

zfs-import-cache[編輯 | 編輯原始碼]

zfs-import-cache.service 在匯入儲存池時使用 zpool import -c <zpool.cache>,該操作會從 zpool.cache 讀取裝置路徑。

由於重新啟動或出現硬件更改後裝置路徑可能會變化,因此在使用該方法時需注意建立 ZFS 儲存池時使用的裝置路徑,否則會導致快取過時並匯入失敗。關於如何選擇持久化裝置路徑的資訊請參考塊裝置持久化命名

自動掛載檔案系統[編輯 | 編輯原始碼]

zfs-import-scan.servicezfs-import-cache.service 服務會匯入儲存池,但不會掛載任何檔案系統。根據你的系統是否組態了 mountpoint=legacy,有兩種方法可以在啟動時匯入檔案系統。如果你的檔案系統混用了 legacy 與非 legacy 組態,那就需要同時使用兩種方式。

zfs-mount-generator[編輯 | 編輯原始碼]

如果你的檔案系統使用非 legacy 掛載,那麼建議使用 zfs-mount-generator,這是一個 systemd.generator(7),可以為匯入的 ZFS 儲存池中的所有檔案系統生成帶有 canmount=on 選項的 systemd 掛載單元檔案,以便在啟動時掛載檔案系統。由於其需要使用 zfs list 快取,預設情況下執行 zfs-mount-generator 不會產生任何效果,你需要進行以下操作:

  1. 啟用並啟動 zfs-zed.service
  2. 建立 /etc/zfs/zfs-list.cache 目錄
  3. /etc/zfs/zfs-list.cache 中建立以你的儲存池為名的空檔案,ZFS Event Daemon(ZED)只會在儲存池對應檔案存在並可寫入時更新檔案系統清單:
    # touch /etc/zfs/zfs-list.cache/<pool-name>
  4. 檢查 /etc/zfs/zfs-list.cache/<pool-name> 的內容。如果其內容為空,需要執行以下命令修改任意 ZFS 檔案系統的 canmount 屬性,以生成 ZFS 事件來觸發 ZED:
    zfs set canmount=off zroot/fs1
    ,然後執行
    zfs inherit canmount zroot/fs1

fstab[編輯 | 編輯原始碼]

如果你的檔案系統使用 legacy 掛載,那就需要在 fstab 檔案中指定掛載點。裝置一項需填入檔案系統的完整路徑名稱,dump 及 fsck 項需填為 0。

建立 hostid 檔案[編輯 | 編輯原始碼]

通常沒有需要,但還是建議建立 /etc/hostid 檔案:

# zgenhostid $(hostid)

儲存池[編輯 | 編輯原始碼]

嘗試使用 ZFS[編輯 | 編輯原始碼]

如果有用戶希望在不會造成數據遺失的情況下試用 ZFS,可參考 ZFS/Virtual disks

建立 ZFS 池[編輯 | 編輯原始碼]

提示:你可能需要先閱讀 #ashift 屬性,因為建議在建立池時設置 ashift

要建立 ZFS 池,請使用如下命令:

# zpool create -R <root> -o <poolopts> -O <dsetprops> <pool> <vdevs>

其中 vdev 是單個裝置或使用以下格式:

<vdev type> <device> ... <device>
  • -R:在該資料夾下掛載所有檔案系統,用於不影響當前系統
  • -o:指定儲存池屬性,可多次使用。類似 ashift 等屬性在建立後就無法修改(理論上 ashift 是各 vdev 獨立的,但要為 vdev 單獨組態該參數就需要執行 zpool add)。
  • -O:指定儲存池根數據集的屬性,可多次使用。類似 normalization 等屬性在建立後就無法更改。
  • pool:儲存池的名稱
  • vdev type:受支援的 vdev 清單請參考 zpoolconcepts(7)
  • device:塊裝置,可以是完整路徑或路徑的檔名部分
注意: 取決於掛載儲存池使用的具體方法,你可能要注意建立儲存池時使用的裝置路徑。

以在單個分區上建立儲存池為例:

# zpool create -R /mnt pool /dev/sda

使用單個 raidz1 vdev 建立池:

# zpool create -R /mnt pool \
               raidz1 \
                  ata-ST3000DM001-9YN166_S1F0KDGY \
                  ata-ST3000DM001-9YN166_S1F0JKRR \
                  ata-ST3000DM001-9YN166_S1F0KBP8 \
                  ata-ST3000DM001-9YN166_S1F0JTM1

使用兩個鏡像(mirror)vdev 建立池:

# zpool create -R /mnt pool \
               mirror \
                  ata-ST3000DM001-9YN166_S1F0KDGY \
                  ata-ST3000DM001-9YN166_S1F0JKRR \
               mirror \
                  ata-ST3000DM001-9YN166_S1F0KBP8 \
                  ata-ST3000DM001-9YN166_S1F0JTM1

ashift 屬性[編輯 | 編輯原始碼]

ashift 是一個不可修改的 vdev 獨立屬性,決定了(邏輯)磁區大小為 2^ashift 位元組。為提升效能,邏輯磁區大小需要大於或等於硬碟的物理磁區大小。

預設情況下,zpool create 可以自動辨識裝置的物理磁區大小,適用於單硬碟組態場景。

但如果你需要(或打算)組態如 mirrorraidzX 的可替換故障硬碟的 vdev 環境,建議始終設定 ashift=12。這是因為在 512 位元組硬碟上使用 4kb 邏輯磁區大小沒有問題,但反過來操作就會導致效能下降(除非你的硬碟是比較少見的 8kb 磁區大小 SSD)。

提示:可以使用
$ lsblk --filter 'TYPE=="DISK"' -o NAME,PHY-SEC

來檢查硬碟的物理磁區大小。

另外,如果你在使用 NVMe 硬碟,有概率可以將其格式化為更高效能的 LBA 格式(具體參考 nvme-format(1))。

建立相容 GRUB 的儲存池[編輯 | 編輯原始碼]

預設情況下,zpool create 會為儲存池啟用所有特性。如果使用 GRUB 時將 /boot 放置到了 ZFS 下,就需要將 GRUB 不支援的特性全部禁用,否則 GRUB 將無法讀取池中的數據。ZFS 內建了相容性檔案(參見 /usr/share/zfs/compatibility.d),可以幫助建立僅包含部分特性集的儲存池,其中就包括了 grub2。

可以通過如下命令建立包含部分特性集的儲存池:

# zpool create -o compatibility=grub2 $POOL_NAME $VDEVS

驗證儲存池狀態[編輯 | 編輯原始碼]

如果命令成功執行,則不會有任何輸出。使用 mount 命令會顯示儲存池已被掛載。使用 zpool status 會顯示儲存池已被建立:

# zpool status
  pool: bigdata
 state: ONLINE
 scan: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        bigdata                                    ONLINE       0     0     0
          -0                                       ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KDGY-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JKRR-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KBP8-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JTM1-part1  ONLINE       0     0     0

errors: No known data errors

刪除儲存池[編輯 | 編輯原始碼]

要刪除整個儲存池:

# zpool destroy <pool>

接下來檢查下儲存池狀態:

# zpool status
no pools available

匯出儲存池[編輯 | 編輯原始碼]

通過以下命令匯出儲存池:

# zpool export <pool>

擴充現有儲存池[編輯 | 編輯原始碼]

可以通過如下命令將一個裝置(單個分區或硬碟)添加到現有 zpool:

# zpool add <pool> <device-id>

添加裝置為鏡像[編輯 | 編輯原始碼]

可以將裝置(分區或硬碟)作為鏡像附加到現有裝置上(與 RAID1 類似):

# zpool attach <pool> <device-id|mirror> <new-device-id>

你可以將新裝置添加到現有鏡像 vdev 中(例如從 2 盤鏡像變為 3 盤鏡像)或將其附加到單個裝置上以構成新的鏡像 vdev

重新命名儲存池[編輯 | 編輯原始碼]

可以用以下兩步重新命名已建立的儲存池:

# zpool export oldname
# zpool import oldname newname

更換掛載點[編輯 | 編輯原始碼]

可以通過如下命令修改 zpool 的掛載點:

# zfs set mountpoint=/foo/bar poolname

升級儲存池[編輯 | 編輯原始碼]

將 ZFS 升級到新版本後,有可能會獲得一些新功能。出於相容性考慮,ZFS 不會自動為之前建立的儲存池啟用新功能,而需要單獨為每個儲存池手動啟用。

要檢查是否可以升級:

$ zpool upgrade
This system supports ZFS pool feature flags.

All pools are formatted using feature flags.

Every feature flags pool has all supported and requested features enabled.

如果有可以升級的儲存池,會出現如下輸出:

This system supports ZFS pool feature flags.

All pools are formatted using feature flags.


Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(7) for details.

Note that the pool 'compatibility' feature can be used to inhibit
feature upgrades.

POOL  FEATURE

rpool redaction_list_spill raidz_expansion fast_dedup longname large_microzap

升級單個儲存池:

# zpool upgrade <pool>

使用如下命令來升級所有儲存池:

# zpool upgrade -a

建立數據集[編輯 | 編輯原始碼]

相對於在儲存池中建立資料夾,用戶可以選擇在儲存池中建立數據集。除了快照外,數據集還提供了如配額控制等更強大的控制功能。在建立並掛載數據集前,需確儲存儲池中不存在與數據集同名的資料夾。以下命令可用於建立數據集:

# zfs create <存储池名>/<数据集名>

可以對數據集應用 ZFS 特定屬性。例如,你可以對數據集中的資料夾設定配額限制:

# zfs set quota=20G <存储池名>/<数据集名>/<文件夹>

如需了解更多 ZFS 命令,可查閱 zfs(8)zpool(8)

原生加密[編輯 | 編輯原始碼]

ZFS 支援如下幾種加密方式:aes-128-ccm, aes-192-ccm, aes-256-ccm, aes-128-gcm, aes-192-gcmaes-256-gcm。當加密設置為 on 時,將使用 aes-256-gcm 進行加密。See zfs-change-key(8) for a description of the native encryption, including limitations.

支援下列幾種金鑰格式:passphrase, raw, hex

One can also specify/increase the default iterations of PBKDF2 when using passphrase with -o pbkdf2iters <n>, although it may increase the decryption time.

注意:
  • To import a pool with keys, one needs to specify the -l flag, without this flag encrypted datasets will be left unavailable until the keys are loaded. See #Importing a pool created by id.
  • Native ZFS encryption has been made available in the stable 0.8.0 release or newer. Previously it was only available in development versions provided by packages like zfs-linux-gitAUR, zfs-dkms-gitAUR or other development builds. Users who were only using the development versions for the native encryption, may now switch to the stable releases if they wish.
  • The default encryption suite was changed from aes-256-ccm to aes-256-gcm in the 0.8.4 release.

使用如下命令建立通過密碼短語加密的數據集:

# zfs create -o encryption=on -o keyformat=passphrase <存储池名>/<数据集名>

使用金鑰而不是密碼短語進行加密:

# dd if=/dev/random of=/path/to/key bs=32 count=1 iflag=fullblock
# zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///path/to/key <存储池名>/<数据集名>

The easy way to make a key in human-readable form (keyformat=hex):

# od -Anone -x -N 32 -w64 /dev/random | tr -d [:blank:] > /path/to/hex.key

驗證金鑰位置:

# zfs get keylocation <存储池名>/<数据集名>

更改金鑰位置:

# zfs set keylocation=file:///path/to/key <存储池名>/<数据集名>

你也可以下列任意一條命令手動載入金鑰:

# zfs load-key <存储池名>/<数据集名> # load key for a specific dataset
# zfs load-key -a # load all keys
# zfs load-key -r zpool/dataset # load all keys in a dataset

掛載加密數據集:

# zfs mount <存储池名>/<数据集名>

啟動時解鎖/掛載:systemd[編輯 | 編輯原始碼]

可以使用 systemd 單元在啟動時自動解鎖數據集。例如,可以建立如下服務來解鎖特定的數據集:

/etc/systemd/system/zfs-load-key@.service
[Unit]
Description=Load %I encryption keys
Before=systemd-user-sessions.service zfs-mount.service
After=zfs-import.target
Requires=zfs-import.target
DefaultDependencies=no

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/bash -c 'until (systemd-ask-password "Encrypted ZFS password for %I" --no-tty | zfs load-key %I); do echo "Try again!"; done'

[Install]
WantedBy=zfs-mount.service

接下來為每個加密數據集啟動/啟用該服務 (例如 zfs-load-key@pool0-dataset0.service)。注意,- 在 systemd 單元中的定義為 /,詳細資料可參考 systemd-escape(1)

注意: The Before=systemd-user-sessions.service ensures that systemd-ask-password is invoked before the local IO devices are handed over to the desktop environment.

另一種方法是載入所有可能用到的金鑰:

/etc/systemd/system/zfs-load-key.service
[Unit]
Description=Load encryption keys
DefaultDependencies=no
After=zfs-import.target
Before=zfs-mount.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/zfs load-key -a
StandardInput=tty-force

[Install]
WantedBy=zfs-mount.service

接下來啟動/啟用 zfs-load-key.service

登入時解鎖:PAM[編輯 | 編輯原始碼]

本文或本章節的語言、語法或風格需要改進。參考:幫助:風格

原因:Missing prompts in front of commands.(在Talk:ZFS討論)

If you are not encrypting the root volume, but only the home volume or a user-specific volume, another idea is to wait until login to decrypt it[失效連結 2024-11-06 ⓘ]. The advantages of this method are that the system boots uninterrupted, and that when the user logs in, the same password can be used both to authenticate and to decrypt the home volume, so that the password is only entered once.

First set the mountpoint to legacy to avoid having it mounted by zfs mount -a:

zfs set mountpoint=legacy zroot/data/home

Ensure that it is in /etc/fstab so that mount /home will work:

/etc/fstab
zroot/data/home         /home           zfs             rw,xattr,posixacl,noauto        0 0

Alternatively, you can keep using ZFS mounts if you use both:

zfs set canmount=noauto zroot/data/home
zfs set org.openzfs.systemd:ignore=on zroot/data/home

The first will stop ZFS automatically mounting it, and the second systemd, but you will still be able to manually (or through the following scripts) mount it. If you have child datasets, org.openzfs.systemd:ignore=on will be inherited, but you will need to set canmount=noauto on each as it is not inheritable, otherwise they will try to mount without a mountpoint.

On a single-user system, with only one /home volume having the same encryption password as the user's password, it can be decrypted at login as follows: first create /usr/local/bin/mount-zfs-homedir

/usr/local/bin/mount-zfs-homedir
#!/bin/bash
set -eu

# $PAM_USER will be the username of the user, you can use it for per-user home volumes.
HOME_VOLUME="zroot/data/home" 

if [ "$(zfs get keystatus "${HOME_VOLUME}" -Ho value)" != "available" ]; then
  PASSWORD=$(cat -)
  zfs load-key "${HOME_VOLUME}" <<< "$PASSWORD" || continue
fi

# This will also mount any child datasets, unless they use a different key.
echo "$(zfs list -rHo name,keystatus,mounted "${HOME_VOLUME}")" | while IFS=$'\t' read -r NAME KEYSTATUS MOUNTED; do
  if [ "${MOUNTED}" != "yes" ] && [ "${KEYSTATUS}" == "available" ]; then
    zfs mount "${NAME}" || true
  fi
done

do not forget to make it executable; then get PAM to run it by adding the following line to /etc/pam.d/system-auth:

/etc/pam.d/system-auth
auth       optional                    pam_exec.so          expose_authtok /usr/local/bin/mount-zfs-homedir

Now it will transparently decrypt and mount the /home volume when you log in anywhere: on the console, via ssh, etc.

SSH[編輯 | 編輯原始碼]

A caveat is that since your ~/.ssh directory is not mounted, if you log in via ssh, you must use password authentication the first time rather than relying on ~/.ssh/authorized_keys.

If you do not wish to enable (insecure) password authentication, you can instead move ~/.ssh/authorized_keys to a new location. Make /etc/ssh/user_config/ and inside it a folder for each user, owned by that user and with 700 permissions. Then move each user's authorized_keys into their respective folders, and edit the system sshd configuration:

/etc/ssh/sshd_config
AuthorizedKeysFile /etc/ssh/user_config/%u/authorized_keys

Then restart sshd.service. You can also optionally make a link for each user from ~/.ssh/authorized_keys to the new location so users can still edit it as they are used to.

This will let you log in, but your home partition will not be mounted, and you will need to do so manually. There are multiple options to work around this:

SSH Key & Password when required[編輯 | 編輯原始碼]

It is possible to set up PAM to only prompt for a password via SSH when it is necessary to decrypt your home partition. You will need to enable both publickey and keyboard-interactive authentication methods:

/etc/ssh/sshd_config
PubkeyAuthentication yes
KbdInteractiveAuthentication yes
AuthenticationMethods publickey,keyboard-interactive

## Example of excluding a certain user who does not have an encrypted home directory.
#Match User nohome
#  KbdInteractiveAuthentication no
#  AuthenticationMethods publickey
警吿: Note the comma in AuthenticationMethods publickey,keyboard-interactive, this means that you need to do both authentication methods to log in with SSH. The very similar AuthenticationMethods publickey keyboard-interactive means you can do either to log in, which would let someone bypass your public key auth.
注意: You may ask why keyboard-interactive and not password? password is done client-side, so even if the auth is skipped, the user is still prompted and the password is just thrown away. With keyboard-interactive the user does not get prompted at all when we skip it.

This will mean it asks for the password after validating the key, but using PAM we can stop it asking for the password when not needed. We make a script that will fail when the key is not available to us:

/usr/local/bin/require-encrypted-homedir
#!/bin/bash
set -eu

HOME_VOLUME="zroot/data/home" # You can use $PAM_USER to use the username in the volume for a per-user solution.

if [ "$(zfs get keystatus "${HOME_VOLUME}" -Ho value)" != "available" ]; then
  exit 27 # PAM_TRY_AGAIN
elif [[ "${SSH_AUTH_INFO_0:-""}" =~ ^"publickey " ]]; then
  exit 0
else
  # If this happens, it implies a configuration error: either you are allowing auth without a public 
  # key, or have enabled this in a non-SSH PAM service. Both are dangerous and this should block it, 
  # but if you see it, fix your configuration.
  exit 3 # PAM_SERVICE_ERR
fi

And make it executable.

Now we want to configure PAM to call this, and skip asking for the password if the script succeeds because we already have the key available. Add this line above the existing auth line(s) you want to skip (all of them unless you have something else set up) for the SSH service:

/etc/pam.d/sshd
auth sufficient pam_exec.so /usr/local/bin/require-encrypted-homedir
警吿: This is for /etc/pam.d/sshd not /etc/pam.d/system-auth as above. You do not want local users without a public key to be able to skip the password. There a safeguard in the script against this, but still best to be careful.
注意: When using private keys, the auth step is skipped in PAM as the private key authentication is handled entirely by sshd. This means that the script we are adding here will never be run for private keys and they cannot be skipped, however, we still do a check for defence-in-depth to try and ensure a key has been checked.

With this, you will be prompted for a password only when the key is not loaded.

SSH Key & Password[編輯 | 編輯原始碼]

A simpler option is to just enable both methods, meaning your key still gets checked, but then you have to type the password too, which will decrypt your home partition.

/etc/ssh/sshd_config
PubkeyAuthentication yes
PasswordAuthentication yes
AuthenticationMethods publickey,password
警吿: Note the comma in AuthenticationMethods publickey,password, this means that you need to do both authentication methods to log in with SSH. The very similar AuthenticationMethods publickey password means you can do either to log in, which would let someone bypass your public key auth.

This works (and will not let anyone authenticate with just a password), but has the downside of requiring your password every time.

You can also specify something like:

AuthenticationMethods publickey password,publickey

This allows clients to either use either just a public key, or one and a password. Which the client will do will be based on the PreferredAuthentications option. -o PreferredAuthentications=password,publickey will ask for the password, while -o PreferredAuthentications=publickey will not. This is more manual than automated fallback, but has less moving parts, and avoids asking you every time if you prefer publickey by default (you can use host-specific options on clients to simplify setting these options).

交換卷[編輯 | 編輯原始碼]

警吿:

ZFS 不允許使用交換檔案,但您可以將一個 ZFS 卷 (ZVOL) 用作交換空間。需要注意的是必須將 ZVOL 的塊大小設置為系統的 PAGESIZE,後者可以通過執行 getconf PAGESIZE 命令來獲得(對於 x86_64 系統來説,其預設值為 4KiB)。關閉 ZVOL 上的寫入快取也可以讓系統在低記憶體狀態下更好執行。

建立一個 8 GiB 的 ZFS 卷:

# zfs create -V 8G -b $(getconf PAGESIZE) -o compression=zle \
              -o logbias=throughput -o sync=always\
              -o primarycache=metadata -o secondarycache=none \
              -o com.sun:auto-snapshot=false <pool>/swap

將其格式化為交換空間:

# mkswap -f /dev/zvol/<pool>/swap
# swapon /dev/zvol/<pool>/swap

要將其永久自動掛載,編輯 /etc/fstab。ZVOLs 支援垃圾回收,這對 ZFS 的塊分配器有潛在幫助,同時當交換空間仍有剩餘時有助於減少其他數據集上的磁碟碎片。

/etc/fstab 中添加如下行:

/dev/zvol/<pool>/swap none swap discard 0 0

存取控制列表(Access Control Lists,ACL)[編輯 | 編輯原始碼]

要對數據集使用 ACL,請使用如下命令:

# zfs set acltype=posixacl <nameofzpool>/<nameofdataset>
# zfs set xattr=sa <nameofzpool>/<nameofdataset>

出於效能原因,建議組態 xattr [1]

鑑於數據集會繼承 ACL 參數,最好是對 zpool 啟用 ACL。預設模式為 restricted,你可能會需要修改其設置:aclinherit=passthrough [2];但要注意的是,aclinherit 不影響 POSIX ACL [3]:

# zfs set aclinherit=passthrough <nameofzpool>
# zfs set acltype=posixacl <nameofzpool>
# zfs set xattr=sa <nameofzpool>

Databases[編輯 | 編輯原始碼]

ZFS, unlike most other file systems, has a variable record size, or what is commonly referred to as a block size. By default, the recordsize on ZFS is 128KiB, which means it will dynamically allocate blocks of any size from 512B to 128KiB depending on the size of file being written. This can often help fragmentation and file access, at the cost that ZFS would have to allocate new 128KiB blocks each time only a few bytes are written to.

本文或本章節的事實準確性存在爭議。

原因: At least MariaDB uses a default of 16Kib pages! Check your specific DBMS before setting this value.(在 Talk:ZFS 中討論)


Most RDBMSes work in 8KiB-sized blocks by default. Although the block size is tunable for MySQL/MariaDB, PostgreSQL, and Oracle database, all three of them use an 8KiB block size by default. For both performance concerns and keeping snapshot differences to a minimum (for backup purposes, this is helpful), it is usually desirable to tune ZFS instead to accommodate the databases, using a command such as:

# zfs set recordsize=8K <pool>/postgres

These RDBMSes also tend to implement their own caching algorithm, often similar to ZFS's own ARC. In the interest of saving memory, it is best to simply disable ZFS's caching of the database's file data and let the database do its own job:

注意: L2ARC requires primarycache to function, because it is fed with data evicted from primarycache. If you intend to use the L2ARC, do not set the option below, otherwise no actual data will be cached on L2ARC.
# zfs set primarycache=metadata <pool>/postgres

ZFS uses the ZIL for crash recovery, but databases are often syncing their data files to the file system on their own transaction commits anyway. The end result of this is that ZFS will be committing data twice to the data disks, and it can severely impact performance. You can tell ZFS to prefer to not use the ZIL, and in which case, data is only committed to the file system once. However, doing so on non-solid state storage (e.g. HDDs) can result in decreased read performance due to fragmentation (OpenZFS Wiki) -- with mechanical hard drives, please consider using a dedicated SSD as ZIL rather than setting the option below. In addition, setting this for non-database file systems, or for pools with configured log devices, can also negatively impact the performance, so beware:

# zfs set logbias=throughput <pool>/postgres

These can also be done at file system creation time, for example:

# zfs create -o recordsize=8K \
             -o primarycache=metadata \
             -o mountpoint=/var/lib/postgres \
             -o logbias=throughput \
              <pool>/postgres

Please note: these kinds of tuning parameters are ideal for specialized applications like RDBMSes. You can easily hurt ZFS's performance by setting these on a general-purpose file system such as your /home directory.

/tmp[編輯 | 編輯原始碼]

If you would like to use ZFS to store your /tmp directory, which may be useful for storing arbitrarily-large sets of files or simply keeping your RAM free of idle data, you can generally improve performance of certain applications writing to /tmp by disabling file system sync. This causes ZFS to ignore an application's sync requests (eg, with fsync or O_SYNC) and return immediately. While this has severe application-side data consistency consequences (never disable sync for a database!), files in /tmp are less likely to be important and affected. Please note this does not affect the integrity of ZFS itself, only the possibility that data an application expects on-disk may not have actually been written out following a crash.

# zfs set sync=disabled <pool>/tmp

Additionally, for security purposes, you may want to disable setuid and devices on the /tmp file system, which prevents some kinds of privilege-escalation attacks or the use of device nodes:

# zfs set setuid=off <pool>/tmp
# zfs set devices=off <pool>/tmp

Combining all of these for a create command would be as follows:

# zfs create -o setuid=off -o devices=off -o sync=disabled -o mountpoint=/tmp <pool>/tmp

Please note, also, that if you want /tmp on ZFS, you will need to mask (disable) systemd's automatic tmpfs-backed /tmp (tmp.mount, else ZFS will be unable to mount your dataset at boot-time or import-time.

使用 ZFS Send 和 ZFS Recv 傳輸快照[編輯 | 編輯原始碼]

通過搭配使用 zfs sendzfs recv,可以將 ZFS 快照傳輸到任意目標。該操作通過標準輸出實現,可以將數據傳送到任意檔案、裝置或網絡目標,也可以在管道中加入其它程式對數據流進行中間操作。

以下為常見用例:

ZFS Send 基本用法[編輯 | 編輯原始碼]

首先為一個 ZFS 檔案系統建立快照:

# zfs snapshot zpool0/archive/books@snap

然後將快照傳送到另一 zpool 上的新位置:

# zfs send -v zpool0/archive/books@snap | zfs recv zpool4/library

現在 zpool0/archive/books@snap 上的內容就被傳送到了 zpool4/library

提示: 各命令列選項的具體資訊請參考 man zfs-sendman zfs-recv
傳入/傳出到檔案[編輯 | 編輯原始碼]

首先為一個 ZFS 檔案系統建立快照:

# zfs snapshot zpool0/archive/books@snap

將快照寫入到 gzip 檔案:

# zfs send zpool0/archive/books@snap > /tmp/mybooks.gz
警吿: 如果想在傳送時保留加密,需對 zfs send 使用 -w 選項。

從檔案恢復快照:

# gzcat /tmp/mybooks.gz | zfs recv -F zpool0/archive/books

通過 SSH 傳送[編輯 | 編輯原始碼]

首先為一個 ZFS 檔案系統建立快照:

# zfs snapshot zpool1/filestore@snap

下一步是通過管道將數據「傳送」到執行「recv」的 ssh 對談:

# zfs send -v zpool1/filestore@snap | ssh $HOST zfs recv coldstore/backups

使用 -v 選項將會輸出數據流的資訊。如果你使用密碼或金鑰,會出現提示要求進行輸入。

增量備份[編輯 | 編輯原始碼]

You may wish update a previously sent ZFS filesystem without retransmitting all of the data over again. Alternatively, it may be necessary to keep a filesystem online during a lengthy transfer and it is now time to send writes that were made since the initial snapshot.

首先為一個 ZFS 檔案系統建立快照:

# zfs snapshot zpool1/filestore@initial

下一步是通過管道將數據「傳送」到執行「recv」的 ssh 對談:

# zfs send -v -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups

在寫入更改後,再建立一個快照:

# zfs snapshot zpool1/filestore@snap2

接下來傳送本地 zpool1/filestore@initial 和 zpool1/filestore@snap2 兩個快照的區別,然後為遠端檔案系統 coldstore/backups 再建立一個快照:

# zfs send -v -i -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups

現在 zpool1/filestore 和 coldstore/backups 都存有 @initial 和 @snap2 兩個快照。

你可能會想在遠端主機上將最新快照作為當前活動檔案系統:

# rollback coldstore/backups@snap2

調校[編輯 | 編輯原始碼]

通用[編輯 | 編輯原始碼]

可以使用參數進一步調整 ZFS 池和數據集。

注意: 除配額和預留外,所有可設置的屬性值都會從父數據集繼承。

要檢索當前 ZFS 池的參數狀態,請執行以下操作:

# zfs get all <pool>

要檢索指定數據集的參數狀態,請執行以下操作:

# zfs get all <pool>/<dataset>

要禁用預設啟用的存取時間功能(atime),請執行以下操作:

# zfs set atime=off <pool>

要禁用特定數據集的存取時間功能(atime),請執行以下操作:

# zfs set atime=off <pool>/<dataset>

除了完全關閉 atime 之外,您還可以使用 relatime。這為ZFS帶來了預設的 ext4/XFS atime 語意,只有在修改或更改時間發生變化時,或者存取時間在過去24小時內沒有變化時,才更新存取時間。這是 atime=offatime=on 之間的折衷。該屬性atimeon 時生效:

# zfs set atime=on <pool>
# zfs set relatime=on <pool>

壓縮功能則是對數據的透明壓縮。ZFS 支援數種不同的壓縮演算法,目前預設採用 lz4 。gzip 比較適合用於那些不頻繁寫入並且可壓縮率較高的數據。請參考 OpenZFS Wiki 以獲得更多資訊。

要啟用壓縮,請執行:

# zfs set compression=on <pool>

若要將池和/或數據集的屬性重設為預設狀態,請使用 zfs inherit

# zfs inherit -rS atime <pool>
# zfs inherit -rS atime <pool>/<dataset>
注意: 使用 -r 標誌將遞歸重設ZPool中的所有數據集。

Scrubbing[編輯 | 編輯原始碼]

當 ZFS 在讀取數據過程中檢測到錯誤時,它會在可能時靜默修複數據,寫回到磁碟並記錄紀錄檔,使得你可以獲得儲存池中錯誤的概覽。ZFS 沒有 fsck 一類的工具,但提供了稱為 scrubbing 的特性。它會遍歷儲存池中的所有數據,並驗證是否所有塊都可被正常讀取。

要對儲存池執行 scrub:

# zpool scrub <pool>

要中斷執行中的 scrub:

# zpool scrub -s <pool>

多久需要執行一次呢?[編輯 | 編輯原始碼]

根據 Oracle 的網誌文章 Disk Scrub - Why and When?:

這一問題對支援人員來説有難度,因為最貼切的回答是「看情況」。所以,在我給出一個較通用的回答前,有些可以用來建立更貼合你需求的答案的提示。
  • 你最舊的備份的有效期是多久?對數據執行 scrub 操作的頻率因至少與你最舊備份的有效期相當,以確保回覆點可用。
  • 你通常多久會碰到一次磁碟故障?While the recruitment of a hot-spare disk invokes a "resilver" -- a targeted scrub of just the VDEV which lost a disk -- you should probably scrub at least as often as you experience disk failures on average in your specific environment.
  • 你多久會讀取一次磁碟上最舊的數據?你應偶爾進行一次 scrub,以防止舊數據在你不知道的情況下出現位腐壞。
如果針對上述任一問題的答案是「我不知道」,那最通用的回答是:你應至少每月對 zpool 執行一次 scrub 操作。這一周期對多數用例來説都較為合適,提供了足以在各種高負載環境下完成執行的時間,並快於大型 zpools(192+ 磁碟)出現磁碟故障的時間。

根據 Aaron Toponce 的 ZFS Administration Guide,他建議對消費級磁碟每周執行一次 scrub。

根據服務或定時器執行[編輯 | 編輯原始碼]

注意: 從 OpenZFS 2.1.3 開始提供了每周/月執行的 systemd 定時器/服務。使用時需根據目標儲存池 啟用/啟動 zfs-scrub-weekly@pool-to-scrub.timerzfs-scrub-monthly@pool-to-scrub.timer

可以使用 systemd 定時器/服務來自動對儲存池執行 scrub。

要對特定儲存池執行每月 scrubbing:

/etc/systemd/system/zfs-scrub@.timer
[Unit]
Description=Monthly zpool scrub on %i

[Timer]
OnCalendar=monthly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=multi-user.target
/etc/systemd/system/zfs-scrub@.service
[Unit]
Description=zpool scrub on %i

[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/usr/bin/zpool scrub %i

[Install]
WantedBy=multi-user.target

啟用/啟動 zfs-scrub@pool-to-scrub.timer 單元以為特定 zpool 啟用月度 scrubbing。

啟用 TRIM[編輯 | 編輯原始碼]

要檢查你的 vdev 是否支援 TRIM,你可以通過 -tzpool status 輸出添加 TRIM 資訊:

$ zpool status -t tank
pool: tank
 state: ONLINE
  scan: none requested
 config:

	NAME                                     STATE     READ WRITE CKSUM
	tank                                     ONLINE       0     0     0
	  ata-ST31000524AS_5RP4SSNR-part1        ONLINE       0     0     0  (trim unsupported)
	  ata-CT480BX500SSD1_2134A59B933D-part1  ONLINE       0     0     0  (untrimmed)

errors: No known data errors

ZFS 可以手動或通過 autotrim 定時對支援的裝置進行 TRIM。

對 zpool 手動進行 TRIM:

 # zpool trim <zpool>

為數據池中所有支援的 vdev 啟用自動 TRIM:

 # zpool set autotrim=on <zpool>
注意: 因為自動 TRIM 與 zpool trim 的操作有所不同,你可能會想偶爾手動執行 TRIM。

要使用 systemd 定時器/服務對特定儲存池每月執行一次完整的 zpool trim

/etc/systemd/system/zfs-trim@.timer
[Unit]
Description=Monthly zpool trim on %i

[Timer]
OnCalendar=monthly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=multi-user.target
/etc/systemd/system/zfs-trim@.service
[Unit]
Description=zpool trim on %i
Documentation=man:zpool-trim(8)
Requires=zfs.target
After=zfs.target
ConditionACPower=true
ConditionPathIsDirectory=/sys/module/zfs

[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/bin/sh -c '\
if /usr/bin/zpool status %i | grep "trimming"; then\
exec /usr/bin/zpool wait -t trim %i;\
else exec /usr/bin/zpool trim -w %i; fi'
ExecStop=-/bin/sh -c '/usr/bin/zpool trim -s %i 2>/dev/null || true'

[Install]
WantedBy=multi-user.target

啟用/啟動 zfs-trim@pool-to-trim.timer 單元以對特定儲存池啟用 TRIM。

SSD Caching[編輯 | 編輯原始碼]

If your pool has no configured log devices, ZFS reserves space on the pool's data disks for its intent log (the ZIL, also called SLOG). If your data disks are slow (e.g. HDD) it is highly recommended to configure the ZIL on solid state drives for better write performance and also to consider a layer 2 adaptive replacement cache (L2ARC). The process to add them is very similar to adding a new VDEV.

All of the below references to device-id are the IDs from /dev/disk/by-id/*.

ZIL[編輯 | 編輯原始碼]

To add a mirrored ZIL:

 # zpool add <pool> log mirror <device-id-1> <device-id-2>

Or to add a single device ZIL:

 # zpool add <pool> log <device-id>

Because the ZIL device stores data that has not been written to the pool, it is important to use devices that can finish writes when power is lost. It is also important to use redundancy, since a device failure can cause data loss. In addition, the ZIL is only used for sync writes, so may not provide any performance improvement when your data drives are as fast as your ZIL drive(s).

L2ARC[編輯 | 編輯原始碼]

使用如下命令添加 L2ARC:

# zpool add <pool> cache <device-id>

L2ARC 是唯讀快取,所以不需要任何冗餘。從 ZFS 2.0.0 版本開始,L2ARC 可以在重新啟動後保留。[4]

L2ARC 通常只在熱數據量比系統記憶體更,但又小到能放入 L2ARC 的情況下有用。L2ARC 由系統記憶體中的 ARC 進行索引,每條記錄(預設為 128KiB)消耗 70 位元組記憶體。所以,對應的記憶體用量可用以下公式計算:

(L2ARC 大小) / (记录大小) * 70 字节

因此,由於 L2ARC 佔用了 ARC 的記憶體空間,在某些情況下它會造成儲存效能的下降。

ZVOLs[編輯 | 編輯原始碼]

ZFS volumes (ZVOLs) can suffer from the same block size-related issues as RDBMSes, but it is worth noting that the default recordsize for ZVOLs is 8 KiB already. If possible, it is best to align any partitions contained in a ZVOL to your recordsize (current versions of fdisk and gdisk by default automatically align at 1MiB segments, which works), and file system block sizes to the same size. Other than this, you might tweak the recordsize to accommodate the data inside the ZVOL as necessary (though 8 KiB tends to be a good value for most file systems, even when using 4 KiB blocks on that level).

RAIDZ and Advanced Format physical disks[編輯 | 編輯原始碼]

Each block of a ZVOL gets its own parity disks, and if you have physical media with logical block sizes of 4096B, 8192B, or so on, the parity needs to be stored in whole physical blocks, and this can drastically increase the space requirements of a ZVOL, requiring 2× or more physical storage capacity than the ZVOL's logical capacity. Setting the recordsize to 16k or 32k can help reduce this footprint drastically.

See OpenZFS issue #1807 for details.

I/O Scheduler[編輯 | 編輯原始碼]

While ZFS is expected to work well with modern schedulers including, mq-deadline, and none, experimenting with manually setting the I/O scheduler on ZFS disks may yield performance gains. The ZFS recomendation is "[...] users leave the default scheduler 「unless you’re encountering a specific problem, or have clearly measured a performance improvement for your workload」"[5]

排障[編輯 | 編輯原始碼]

Creating a zpool fails[編輯 | 編輯原始碼]

If the following error occurs then it can be fixed.

# the kernel failed to rescan the partition table: 16
# cannot label 'sdc': try using parted(8) and then provide a specific slice: -1

One reason this can occur is because ZFS expects pool creation to take less than 1 second[6][7]. This is a reasonable assumption under ordinary conditions, but in many situations it may take longer. Each drive will need to be cleared again before another attempt can be made.

# parted /dev/sda rm 1
# parted /dev/sda rm 1
# dd if=/dev/zero of=/dev/sdb bs=512 count=1
# zpool labelclear /dev/sda

A brute force creation can be attempted over and over again, and with some luck the ZPool creation will take less than 1 second. One cause for creation slowdown can be slow burst read writes on a drive. By reading from the disk in parallell to ZPool creation, it may be possible to increase burst speeds.

# dd if=/dev/sda of=/dev/null

This can be done with multiple drives by saving the above command for each drive to a file on separate lines and running

# cat $FILE | parallel

Then run ZPool creation at the same time.

ZFS is using too much RAM[編輯 | 編輯原始碼]

By default, ZFS caches file operations (ARC) using up to half of available system memory on the host. To adjust the ARC size, add the following to the Kernel parameters list:

zfs.zfs_arc_max=536870912 # (for 512MiB)

In case that the default value of zfs_arc_min (1/32 of system memory) is higher than the specified zfs_arc_max it is needed to add also the following to the Kernel parameters list:

zfs.zfs_arc_min=268435456 # (for 256MiB, needs to be lower than zfs.zfs_arc_max)

You may also want to increase zfs_arc_sys_free instead (in this example to 8GiB):

# echo $((8*1024**3)) > /sys/module/zfs/parameters/zfs_arc_sys_free

For a more detailed description, as well as other configuration options, see Gentoo:ZFS#ARC.

ZFS should release ARC as applications reserve more RAM, but some applications still get confused, and reported free RAM is always wrong. But in case all your applications work as intended and you have no problems, there is no need to change ARC settings.

No hostid found[編輯 | 編輯原始碼]

An error that occurs at boot with the following lines appearing before initscript output:

ZFS: No hostid found on kernel command line or /etc/hostid.

This warning occurs because the ZFS module does not have access to the spl hosted. There are two solutions, for this. Either place the spl hostid in the 內核參數 in the boot loader. For example, adding spl.spl_hostid=0x00bab10c.

The other solution is to make sure that there is a hostid in /etc/hostid, and then regenerate the initramfs image. Which will copy the hostid into the initramfs image.

Pool cannot be found while booting from SAS/SCSI devices[編輯 | 編輯原始碼]

In case you are booting a SAS/SCSI based, you might occassionally get boot problems where the pool you are trying to boot from cannot be found. A likely reason for this is that your devices are initialized too late into the process. That means that zfs cannot find any devices at the time when it tries to assemble your pool.

In this case you should force the scsi driver to wait for devices to come online before continuing. You can do this by putting this into /etc/modprobe.d/zfs.conf:

/etc/modprobe.d/zfs.conf
options scsi_mod scan=sync

Afterwards, regenerate the initramfs.

This works because the zfs hook will copy the file at /etc/modprobe.d/zfs.conf into the initcpio which will then be used at build time.

On boot the zfs pool does not mount stating: "pool may be in use from other system"[編輯 | 編輯原始碼]

Unexported pool[編輯 | 編輯原始碼]

If the new installation does not boot because the zpool cannot be imported, chroot into the installation and properly export the zpool. See #Emergency chroot repair with archzfs.

Once inside the chroot environment, load the ZFS module and force import the zpool,

# zpool import -a -f

now export the pool:

# zpool export <pool>

To see the available pools, use,

# zpool status

It is necessary to export a pool because of the way ZFS uses the hostid to track the system the zpool was created on. The hostid is generated partly based on the network setup. During the installation in the archiso the network configuration could be different generating a different hostid than the one contained in the new installation. Once the zfs filesystem is exported and then re-imported in the new installation, the hostid is reset. See Re: Howto zpool import/export automatically? - msg#00227.

If ZFS complains about "pool may be in use" after every reboot, properly export pool as described above, and then regenerate the initramfs in normally booted system.

Incorrect hostid[編輯 | 編輯原始碼]

Double check that the pool is properly exported. Exporting the zpool clears the hostid marking the ownership. So during the first boot the zpool should mount correctly. If it does not there is some other problem.

Reboot again, if the zfs pool refuses to mount it means the hostid is not yet correctly set in the early boot phase and it confuses zfs. Manually tell zfs the correct number, once the hostid is coherent across the reboots the zpool will mount correctly.

Boot using zfs_force and write down the hostid. This one is just an example.

$ hostid
0a0af0f8

This number have to be added to the 內核參數 as spl.spl_hostid=0x0a0af0f8. Another solution is writing the hostid inside the initram image, see the installation guide[損壞的連結:無效的章節] explanation about this.

Users can always ignore the check adding zfs_force=1 in the 內核參數, but it is not advisable as a permanent solution.

Devices have different sector alignment[編輯 | 編輯原始碼]

Once a drive has become faulted it should be replaced A.S.A.P. with an identical drive.

# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -f

but in this instance, the following error is produced:

cannot replace ata-ST3000DM001-9YN166_S1F0KDGY with ata-ST3000DM001-1CH166_W1F478BD: devices have different sector alignment

ZFS uses the ashift option to adjust for physical block size. When replacing the faulted disk, ZFS is attempting to use ashift=12, but the faulted disk is using a different ashift (probably ashift=9) and this causes the resulting error.

For Advanced Format disks with 4 KiB block size, an ashift of 12 is recommended for best performance. See OpenZFS FAQ: Performance Considerations and ZFS and Advanced Format disks.

Use zdb to find the ashift of the zpool: zdb , then use the -o argument to set the ashift of the replacement drive:

# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -o ashift=9 -f

Check the zpool status for confirmation:

# zpool status -v
pool: bigdata
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jun 16 11:16:28 2014
    10.3G scanned out of 5.90T at 81.7M/s, 20h59m to go
    2.57G resilvered, 0.17% done
config:

        NAME                                   STATE     READ WRITE CKSUM
        bigdata                                DEGRADED     0     0     0
        raidz1-0                               DEGRADED     0     0     0
            replacing-0                        OFFLINE      0     0     0
            ata-ST3000DM001-9YN166_S1F0KDGY    OFFLINE      0     0     0
            ata-ST3000DM001-1CH166_W1F478BD    ONLINE       0     0     0  (resilvering)
            ata-ST3000DM001-9YN166_S1F0JKRR    ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KBP8    ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JTM1    ONLINE       0     0     0

errors: No known data errors

Pool resilvering stuck/restarting/slow?[編輯 | 編輯原始碼]

According to ZFS issue #840, this is a known issue since 2012 with ZFS-ZED which causes the resilvering process to constantly restart, sometimes get stuck and be generally slow for some hardware. The simplest mitigation is to stop zfs-zed.service until the resilver completes.

Fix slow boot caused by failed import of unavailable pools in the initramfs zpool.cache[編輯 | 編輯原始碼]

Your boot time can be significantly impacted if you update your intitramfs (eg when doing a kernel update) when you have additional but non-permanently attached pools imported because these pools will get added to your initramfs zpool.cache and ZFS will attempt to import these extra pools on every boot, regardless of whether you have exported it and removed it from your regular zpool.cache.

If you notice ZFS trying to import unavailable pools at boot, first run:

$ zdb -C

To check your zpool.cache for pools you do not want imported at boot. If this command is showing (a) additional, currently unavailable pool(s), run:

# zpool set cachefile=/etc/zfs/zpool.cache zroot

To clear the zpool.cache of any pools other than the pool named zroot. Sometimes there is no need to refresh your zpool.cache, but instead all you need to do is regenerate the initramfs.

ZFS Command History[編輯 | 編輯原始碼]

ZFS logs changes to a pool's structure natively as a log of executed commands in a ring buffer (which cannot be turned off). The log may be helpful when restoring a degraded or failed pool.

# zpool history zpool
History for 'zpool':
2023-02-19.16:28:44 zpool create zpool raidz1 /scratch/disk_1.img /scratch/disk_2.img /scratch/disk_3.img
2023-02-19.16:31:29 zfs set compression=lz4 zpool
2023-02-19.16:41:45 zpool scrub zpool
2023-02-19.17:00:57 zpool replace zpool /scratch/disk_1.img /scratch/bigger_disk_1.img
2023-02-19.17:01:34 zpool scrub zpool
2023-02-19.17:01:42 zpool replace zpool /scratch/disk_2.img /scratch/bigger_disk_2.img
2023-02-19.17:01:46 zpool replace zpool /scratch/disk_3.img /scratch/bigger_disk_3.img

小技巧[編輯 | 編輯原始碼]

建立帶有 ZFS 支援的 Archiso 映像[編輯 | 編輯原始碼]

此頁面或章節適合移動到 在自訂的 archiso 安裝媒介中啟用 ZFS 模組

附註: The target links here for its Arch instructions, but is the natural context.(在 Talk:ZFS 討論)


建立 Arch Linux live CD/DVD/USB 映像的具體步驟在 Archiso 中已有描述。如需在映像中加入 ZFS 支援,可以選擇手動構建 AUR 上的 PKGBUILDs,或是在映像中加入非官方用戶倉庫中的預構建包。

使用 AUR 自行構建 ZFS 包[編輯 | 編輯原始碼]

參考正常流程自行構建你需要的 ZFS 包。如果你不確定需要哪個包,zfs-dkmsAURzfs-utilsAUR 可以支援你在 Archiso 映像上做出的多數改動。下一步需要建立一個本地倉庫,並將倉庫添加到新組態的 Pacman 設定檔中。

將構建出的包添加到要安裝的包列表中。下面的例子假設你僅想安裝 zfs-dkmsAURzfs-utilsAUR 包:

packages.x86_64
...
zfs-dkms
zfs-utils

如果你添加了任何 DKMS 包,請確保你同時添加了 ISO 所用內核對應的標頭檔包(預設內核對應為 linux-headers)。

使用 archzfs 非官方用戶倉庫[編輯 | 編輯原始碼]

archzfs 非官方用戶倉庫添加到新 Archiso 組態中的 pacman.conf 檔案中。

archzfs-linux 軟件包組添加到要安裝的軟件包包列表中(archzfs 倉庫提供的包僅支援 x86_64 架構):

packages.x86_64
...
archzfs-linux
注意: 如果你稍後在執行 modprobe zfs 的過程中出現報錯,需在 packages.x86_64 中加入 linux-headers 包。

收尾[編輯 | 編輯原始碼]

無論你使用了哪種方法,最後都需要構建 ISO 映像

自動快照[編輯 | 編輯原始碼]

zrepl[編輯 | 編輯原始碼]

zreplAUR 包提供了一個 ZFS 自動複製服務,可被用作類似 snapper 的快照服務。

關於組態 zrepl 守護行程的詳細方法請參考 zrepl 文件,設定檔位於 /etc/zrepl/zrepl.yml。完成後,先使用 zrepl configcheck 檢查設定檔是否有語法問題,再啟用 zrepl.service 服務。

sanoid[編輯 | 編輯原始碼]

sanoidAUR 是一個策略驅動的快照工具,它還提供了 syncoid,可用於複製快照。它也附帶了 systemd 服務和定時器。

Sanoid 只會清理本地系統的快照,如果要清理遠端系統的快照,需要在遠端系統上使用 prune 選項執行 sanoid。可以使用 --prune-snapshots 命令列選項,也可以將 --cron 命令列選項搭配 autoprune = yesautosnap = no 組態選項使用。

適用於 Linux 的 ZFS 自動快照服務[編輯 | 編輯原始碼]

注意: zfs-auto-snapshot-gitAUR 從 2019 年開始就沒有被更新過了,而且功能上比較受限。建議使用如 zreplAUR 等更新的工具。

The zfs-auto-snapshot-gitAUR 包提供了一個 shell 指令碼,可以自動為所有 ZFS 數據集生成並管理由日期和標籤(如每小時,每日等)命名的快照。該包還會安裝每十五分鐘、每小時、每天、每周、每月進行快照的 cron 任務。你還可以調整 --keep parameter 選項來設定要保留快照的時長(每月快照的指令碼預設將數據保留一年)。

如果不想對某個數據集進行快照,可以對其設置 com.sun:auto-snapshot=false。如果不想進行月度快照,還可以通過標籤進行更精細化的組態:com.sun:auto-snapshot:monthly=false

注意:scrubbing 期間,zfs-auto-snapshot-git 不會建立快照。你可以修改提供的 systemd 單元檔案,移除 ExecStart 中的 --skip-scrub 選項來繞過該限制,但不清楚是否有任何後果,有知道的請協助添加相關內容。

安裝該軟件包後,啟用並啟動你想要的定時器zfs-auto-snapshot-{frequent,daily,weekly,monthly}.timer)。

建立共用[編輯 | 編輯原始碼]

ZFS 支援建立 NFSSMB 共用。

NFS[編輯 | 編輯原始碼]

首先,確保系統已經安裝並組態了 NFS。注意:無需編輯 /etc/exports。對於 NFS 共用,確保已經啟動 nfs-server.servicezfs-share.service

要將儲存池共用到網絡:

# zfs set sharenfs=on 存储池名

要將數據集共用到網絡:

# zfs set sharenfs=on 存储池名/数据集名

要為特定 IP 段啟用讀寫權限:

# zfs set sharenfs="rw=@192.168.1.100/24,rw=@10.0.0.0/24" 存储池名/数据集名

要確認數據集是否已成功匯出:

# showmount -e `hostname`
Export list for hostname:
/path/of/dataset 192.168.1.100/24

要確認當前匯出狀態的詳細資訊:

# exportfs -v
/path/of/dataset
    192.168.1.100/24(sync,wdelay,hide,no_subtree_check,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)

要通過 ZFS 檢視當前 NFS 共用列表:

# zfs get sharenfs

SMB[編輯 | 編輯原始碼]

注意: SMB functionality is very limited. The usershare path must be /var/lib/samba/usershares and the only supported sharesmb options are on and off. Enabling guest access via sharesmb=guest_ok=y is not supported.

When sharing through SMB, using usershares in /etc/samba/smb.conf will allow ZFS to setup and create the shares. See Samba#Enable Usershares for details.

/etc/samba/smb.conf
[global]
    usershare path = /var/lib/samba/usershares
    usershare max shares = 100
    usershare allow guests = yes
    usershare owner only = no

Create and set permissions on the user directory as root

# mkdir /var/lib/samba/usershares
# chmod +t /var/lib/samba/usershares

To make a pool available on the network:

# zfs set sharesmb=on nameofzpool

To make a dataset available on the network:

# zfs set sharesmb=on nameofzpool/nameofdataset

To check if the dataset is exported successfully:

# smbclient -L localhost -U%
        Sharename       Type      Comment
        ---------       ----      -------
        IPC$            IPC       IPC Service (SMB Server Name)
        nameofzpool_nameofdataset        Disk      Comment: path/of/dataset
SMB1 disabled -- no workgroup available

To view the current SMB share list by ZFS:

# zfs get sharesmb

Encryption in ZFS using dm-crypt[編輯 | 編輯原始碼]

Before OpenZFS version 0.8.0, ZFS did not support encryption directly (See #Native encryption). Instead, zpools can be created on dm-crypt block devices. Since the zpool is created on the plain-text abstraction, it is possible to have the data encrypted while having all the advantages of ZFS like deduplication, compression, and data robustness. Furthermore, utilizing dm-crypt will encrypt the zpools metadata, which the native encryption can inherently not provide.[8]

dm-crypt, possibly via LUKS, creates devices in /dev/mapper and their name is fixed. So you just need to change zpool create commands to point to that names. The idea is configuring the system to create the /dev/mapper block devices and import the zpools from there. Since zpools can be created in multiple devices (raid, mirroring, striping, ...), it is important all the devices are encrypted otherwise the protection might be partially lost.

For example, an encrypted zpool can be created using plain dm-crypt (without LUKS) with:

# cryptsetup open --type=plain --hash=sha256 --cipher=aes-xts-plain64 --offset=0 \
             --key-file=/dev/sdZ --key-size=512 /dev/sdX enc
# zpool create zroot /dev/mapper/enc

In the case of a root filesystem pool, the mkinitcpio.conf HOOKS line will enable the keyboard for the password, create the devices, and load the pools. It will contain something like:

HOOKS=(... keyboard encrypt zfs ...)

Since the /dev/mapper/enc name is fixed no import errors will occur.

Creating encrypted zpools works fine. But if you need encrypted directories, for example to protect your users' homes, ZFS loses some functionality.

ZFS will see the encrypted data, not the plain-text abstraction, so compression and deduplication will not work. The reason is that encrypted data has always high entropy making compression ineffective and even from the same input you get different output (thanks to salting) making deduplication impossible. To reduce the unnecessary overhead it is possible to create a sub-filesystem for each encrypted directory and use eCryptfs on it.

For example to have an encrypted home: (the two passwords, encryption and login, must be the same)

# zfs create -o compression=off -o dedup=off -o mountpoint=/home/<username> <zpool>/<username>
# useradd -m <username>
# passwd <username>
# ecryptfs-migrate-home -u <username>
<log in user and complete the procedure with ecryptfs-unwrap-passphrase>

Emergency chroot repair with archzfs[編輯 | 編輯原始碼]

To get into the ZFS filesystem from live system for maintenance, there are two options:

  1. Build custom archiso with ZFS as described in #Create an Archiso image with ZFS support.
  2. Boot the latest official archiso and bring up the network. Then enable archzfs repository inside the live system as usual, sync the pacman package database and install the archzfs-archiso-linux package.

To start the recovery, load the ZFS kernel modules:

# modprobe zfs

Import the pool:

# zpool import -a -R /mnt

Mount the boot partition and EFI system partition (if any):

# mount /dev/sda2 /mnt/boot
# mount /dev/sda1 /mnt/efi

Chroot into the ZFS filesystem:

# arch-chroot /mnt /bin/bash

Check the kernel version:

# pacman -Qi linux
# uname -r

uname will show the kernel version of the archiso. If they are different, run depmod (in the chroot) with the correct kernel version of the chroot installation:

# depmod -a 3.6.9-1-ARCH (version gathered from pacman -Qi linux but using the matching kernel modules directory name under the chroot's /lib/modules)

This will load the correct kernel modules for the kernel version installed in the chroot installation.

Regenerate the initramfs. There should be no errors.

Bind mount[編輯 | 編輯原始碼]

Here a bind mount from /mnt/zfspool to /srv/nfs4/music is created. The configuration ensures that the zfs pool is ready before the bind mount is created.

fstab[編輯 | 編輯原始碼]

See systemd.mount(5) for more information on how systemd converts fstab into mount unit files with systemd-fstab-generator(8).

/etc/fstab
/mnt/zfspool		/srv/nfs4/music		none	bind,defaults,nofail,x-systemd.requires=zfs-mount.service	0 0

Monitoring / Mailing on Events[編輯 | 編輯原始碼]

See ZED: The ZFS Event Daemon for more information.

An email forwarder, such as S-nail, is required to accomplish this. Test it to be sure it is working correctly.

Uncomment the following in the configuration file:

/etc/zfs/zed.d/zed.rc
 ZED_EMAIL_ADDR="root"
 ZED_EMAIL_PROG="mailx"
 ZED_NOTIFY_VERBOSE=0
 ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"

Update 'root' in ZED_EMAIL_ADDR="root" to the email address you want to receive notifications at.

If you are keeping your mailrc in your home directory, you can tell mail to get it from there by setting MAILRC:

/etc/zfs/zed.d/zed.rc
export MAILRC=/home/<user>/.mailrc

This works because ZED sources this file, so mailx sees this environment variable.

If you want to receive an email no matter the state of your pool, you will want to set ZED_NOTIFY_VERBOSE=1. You will need to do this temporary to test.

Start and enable zfs-zed.service.

With ZED_NOTIFY_VERBOSE=1, you can test by running a scrub as root: zpool scrub <pool-name>.

Wrap shell commands in pre & post snapshots[編輯 | 編輯原始碼]

Since it is so cheap to make a snapshot, we can use this as a measure of security for sensitive commands such as system and package upgrades. If we make a snapshot before, and one after, we can later diff these snapshots to find out what changed on the filesystem after the command executed. Furthermore we can also rollback in case the outcome was not desired.

znp[編輯 | 編輯原始碼]

E.g.:

# zfs snapshot -r zroot@pre
# pacman -Syu
# zfs snapshot -r zroot@post
# zfs diff zroot@pre zroot@post 
# zfs rollback zroot@pre

A utility that automates the creation of pre and post snapshots around a shell command is znp.

E.g.:

# znp pacman -Syu
# znp find / -name "something*" -delete

and you would get snapshots created before and after the supplied command, and also output of the commands logged to file for future reference so we know what command created the diff seen in a pair of pre/post snapshots.

Remote unlocking of ZFS encrypted root[編輯 | 編輯原始碼]

As of PR #261, archzfs supports SSH unlocking of natively-encrypted ZFS datasets. This section describes how to use this feature, and is largely based on dm-crypt/Specialties#Busybox based initramfs (built with mkinitcpio).

  1. Install mkinitcpio-netconf to provide hooks for setting up early user space networking.
  2. Choose an SSH server to use in early user space. The options are mkinitcpio-tinyssh or mkinitcpio-dropbear, and are mutually exclusive.
    1. If using mkinitcpio-tinyssh, it is also recommended to install tinyssh or tinyssh-convert-gitAUR. This tool converts an existing OpenSSH hostkey to the TinySSH key format, preserving the key fingerprint and avoiding connection warnings. The TinySSH and Dropbear mkinitcpio install scripts will automatically convert existing hostkeys when generating a new initcpio image.
  3. Decide whether to use an existing OpenSSH key or generate a new one (recommended) for the host that will be connecting to and unlocking the encrypted ZFS machine. Copy the public key into /etc/tinyssh/root_key or /etc/dropbear/root_key. When generating the initcpio image, this file will be added to authorized_keys for the root user and is only valid in the initrd environment.
  4. Add the ip= 內核參數 to your boot loader configuration. The ip string is highly configurable. A simple DHCP example is shown below.
    ip=:::::eth0:dhcp
  5. Edit /etc/mkinitcpio.conf to include the netconf, dropbear or tinyssh, and zfsencryptssh hooks before the zfs hook:
    HOOKS=(... netconf <tinyssh>|<dropbear> zfsencryptssh zfs ...)
  6. Regenerate the initramfs.
  7. Reboot and try it out!

Changing the SSH server port[編輯 | 編輯原始碼]

By default, mkinitcpio-tinyssh and mkinitcpio-dropbear listen on port 22. You may wish to change this.

For TinySSH, copy /usr/lib/initcpio/hooks/tinyssh to /etc/initcpio/hooks/tinyssh, and find/modify the following line in the run_hook() function:

/etc/initcpio/hooks/tinyssh
/usr/bin/tcpserver -HRDl0 0.0.0.0 <new_port> /usr/sbin/tinysshd -v /etc/tinyssh/sshkeydir &

For Dropbear, copy /usr/lib/initcpio/hooks/dropbear to /etc/initcpio/hooks/dropbear, and find/modify the following line in the run_hook() function:

/etc/initcpio/hooks/tinyssh
 /usr/sbin/dropbear -E -s -j -k -p <new_port>

Regenerate the initramfs.

Unlocking from a Windows machine using PuTTY/Plink[編輯 | 編輯原始碼]

First, we need to use puttygen.exe to import and convert the OpenSSH key generated earlier into PuTTY's .ppk private key format. We will call it zfs_unlock.ppk for this example.

The mkinitcpio-netconf process above does not setup a shell (nor do we need need one). However, because there is no shell, PuTTY will immediately close after a successful connection. This can be disabled in the PuTTY SSH configuration (Connection > SSH > [X] Do not start a shell or command at all), but it still does not allow us to see stdout or enter the encryption passphrase. Instead, we use plink.exe with the following parameters:

plink.exe -ssh -l root -i c:\path\to\zfs_unlock.ppk <hostname>

The plink command can be put into a batch script for ease of use.

Enabling bclone support[編輯 | 編輯原始碼]

To use cp --reflink and other commands needing bclone support, it is necessary to upgrade the feature flags if coming from a version prior to 2.2.2. This will allow the pool to have support for bclone. This is done with zpool upgrade, if the status of the pool show this is possible.

It is also required to enable a module parameter, otherwise userspace apps will not be able to use this feature. You can do this by putting this into /etc/modprobe.d/zfs.conf:

/etc/modprobe.d/zfs.conf
options zfs zfs_bclone_enabled=1

Check that is working, and how much space is being saved with the command: zpool get all POOLNAME | grep clon

參考[編輯 | 編輯原始碼]