ZFS messed up a bit

Mr_Magoo · 3 December 2022 14:57

I was attempting to update my BIOS, so I'd unplugged all USB drives except for the one USB stick with the BIOS update.

Of course, the update didn't work (I think the Windows program creates the cryptographic key on-the-fly... the update USB stick I made had the cryptographic key found in the HP file, but it wasn't the correct one), so I'm going to use my wife's Windows computer to make a proper USB stick, rather than the one I cobbled together by perusing HP's code.

But on to the point of this post... before I rebooted, I shut down and plugged all 7 drives back in the way they'd been connected before.

I wanted to see if ZFS handled it alright, so I did a sudo zpool status... the rpool had swapped out the 32 GB memory stick that I'd designated as a ZFS L2ARC cache drive for a 4 TB hard drive!

And that despite the fact that I used the proper designator as the device ID (rather than the device path and node, which can change). That designator (a combination of manufacturer and serial number) should be a symlink to the device path and node, so even if the device path and node changes (as it did in this case), ZFS connects the right drives to the right pools.

But something got switched up. I'm not sure if the reason is because the original rpool cache drive's designator is really long (the bpool cache drive's designator is just as long, and it stayed on the right pool, so...???)

Mr_Magoo · 4 December 2022 01:30

From:
sudo ls -al /dev/disk/by-id
... one gets the proper drive designator to use. It's a combination of the manufacturer and the serial number... almost guaranteed to be a unique set of characters.

One then adds the drive to the pool thusly, using the proper designator:

sudo zpool add rpool cache usb-USB_SanDisk_3.2Gen1_040191ba36b4278c904c8e14fbcd88246946ab2e0b32e810102d37fedccd5e584aa800000000000000000000fe160dd1ff938218815581074e2c9df1-0:0-part1

I attempted to remove the L2ARC cache drive from the rpool after I'd discovered it was corrupted:

sudo zpool remove rpool cache usb-USB_SanDisk_3.2Gen1_040191ba36b4278c904c8e14fbcd88246946ab2e0b32e810102d37fedccd5e584aa800000000000000000000fe160dd1ff938218815581074e2c9df1-0:0-part1

cannot remove cache: no such device in pool
cannot remove usb-USB_SanDisk_3.2Gen1_040191ba36b4278c904c8e14fbcd88246946ab2e0b32e810102d37fedccd5e584aa800000000000000000000fe160dd1ff938218815581074e2c9df1-0:0-part1: no such device in pool

That means ZFS is getting the path and device node of the drive from the proper designator, then replacing the proper designator with the path and device node... so on the next reboot with a change in drive path and node assignment (which is going to happen with external USB drives being plugged and unplugged, even if your zpool drives are not unplugged), the pool is corrupted.

It shouldn't do that. It should continue to look at the designator, sym-linking that to find the device path and node, no matter how that device path and node changes.

pool: rpool
state: ONLINE
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: ZFS Message ID: ZFS-8000-4J
scan: scrub repaired 0B in 0 days 00:04:13 with 0 errors on Sat Dec 3 19:04:17 2022

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
5f52e75c-505f-9941-a9c4-da071f9836f0 ONLINE 0 0 0
cache
pool: rpool
state: ONLINE
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: ZFS Message ID: ZFS-8000-4J

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
5f52e75c-505f-9941-a9c4-da071f9836f0 ONLINE 0 0 0
cache
pool: rpool
state: ONLINE
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: ZFS Message ID: ZFS-8000-4J

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
5f52e75c-505f-9941-a9c4-da071f9836f0 ONLINE 0 0 0
cache
sdc1 FAULTED 0 0 0 corrupted data

But here's the thing... if I attempt to remove sdc1 from the rpool, it says:

sudo zpool remove rpool cache sdc1
cannot remove cache: no such device in pool

But after attempting to remove it by the proper designator and by the device path and node, it does indeed get removed, despite zpool's protestations.

I then add it back in and SCRUB it, whereupon I get a healthy report:

sudo zpool status
pool: bpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:00:06 with 0 errors on Sat Dec 3 19:32:08 2022
config:

NAME STATE READ WRITE CKSUM
bpool ONLINE 0 0 0
c0714ccb-bc6f-5a4c-80bd-46777d248b07 ONLINE 0 0 0
cache
usb-USB_SanDisk_3.2Gen1_040109ab39a31603e14cf811677fa184acc1d2f9da793b38870d29dd2ded3103020300000000000000000000bc9cab2500826a188155810768ad7a27-0:0-part1 ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:02:30 with 0 errors on Sat Dec 3 19:34:37 2022

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
5f52e75c-505f-9941-a9c4-da071f9836f0 ONLINE 0 0 0
cache
usb-USB_SanDisk_3.2Gen1_040191ba36b4278c904c8e14fbcd88246946ab2e0b32e810102d37fedccd5e584aa800000000000000000000fe160dd1ff938218815581074e2c9df1-0:0-part1 ONLINE 0 0 0

errors: No known data errors

So there's a code glitch in zpool.

Mr_Magoo · 4 December 2022 01:31

The strange thing is, there's an identical drive (but with a different serial number and thus a different proper designator) set up as an L2ARC cache drive on the bpool, and it's been rock-solid.

Mr_Magoo · 4 December 2022 02:11

I'm going to try it this way instead of using sudo ls -al /dev/disk/by-id to find the proper designator. I'm going to use the partition ID (PARTUUID).

sudo blkid
/dev/sda2: UUID="5405d924-0153-498f-b466-9cf91df59f81" TYPE="swap" PARTUUID="57f335ca-40e4-cb42-bccf-bb24519812a6"

/dev/sda1: UUID="DBE5-80D4" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="7ab5ce65-2f43-43f4-b813-aeb1ba251aef"

/dev/sda3: LABEL="bpool" UUID="1760198046083221750" UUID_SUB="12635589907017844748" TYPE="zfs_member" PARTUUID="c0714ccb-bc6f-5a4c-80bd-46777d248b07"

/dev/sda4: LABEL="rpool" UUID="1542370796205579292" UUID_SUB="6254009844733705338" TYPE="zfs_member" PARTUUID="5f52e75c-505f-9941-a9c4-da071f9836f0"

/dev/sdb1: LABEL_FATBOOT="EFI" LABEL="EFI" UUID="67E3-17ED" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="3de5981c-50d7-4273-a5a3-3788dd7fc06b"

/dev/sdb2: LABEL="Backup Plus" UUID="5ED5-6F3F" TYPE="exfat" PARTUUID="c5caf1fd-9b23-4c67-afd7-0770fe33db8e"

/dev/sdc1: UUID_SUB="18248089801033944235" TYPE="zfs_member" PARTLABEL="zfs-dd67f5ea3e0b1a1b" PARTUUID="112ff53e-6b95-874a-bb76-a2a3d1978ebf"

/dev/sdd1: UUID_SUB="14019601759092750984" TYPE="zfs_member" PARTLABEL="zfs-2be3f409caf9df9e" PARTUUID="8bba85ae-6f8d-1f4a-9251-1082cbe4c197"

/dev/sde1: LABEL="WinBackUp" UUID="43F3679C1F30B96E" TYPE="ntfs" PTTYPE="dos" PARTUUID="cbabdc98-01"

/dev/sdf1: LABEL_FATBOOT="HP_TOOLS" LABEL="HP_TOOLS" UUID="E732-4098" TYPE="vfat" PARTUUID="14ddedfc-01"

/dev/sdg1: LABEL_FATBOOT="OldRB" LABEL="OldRB" UUID="2023-08B2" TYPE="vfat" PARTUUID="d90c4c30-01"

/dev/sdc9: PARTUUID="b6295787-c695-0d4d-a8d1-20f8d34862bf"

/dev/sdd9: PARTUUID="f3dfc00f-700e-f34c-a54f-8c07063be336"

sdc1 = 112ff53e-6b95-874a-bb76-a2a3d1978ebf

sdd1 = 8bba85ae-6f8d-1f4a-9251-1082cbe4c197

So we want to add sdc1 as an L2ARC cache drive on the bpool:

sudo zpool add bpool cache 112ff53e-6b95-874a-bb76-a2a3d1978ebf

And we want to add sdd1 as an L2ARC cache drive on the rpool:

sudo zpool add rpool cache 8bba85ae-6f8d-1f4a-9251-1082cbe4c197

Now... we'll see if ZFS is stable with this configuration.

That gives:

sudo zpool status
pool: bpool
state: ONLINE

NAME STATE READ WRITE CKSUM
bpool ONLINE 0 0 0
c0714ccb-bc6f-5a4c-80bd-46777d248b07 ONLINE 0 0 0
cache
112ff53e-6b95-874a-bb76-a2a3d1978ebf ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
5f52e75c-505f-9941-a9c4-da071f9836f0 ONLINE 0 0 0
cache
8bba85ae-6f8d-1f4a-9251-1082cbe4c197 ONLINE 0 0 0

errors: No known data errors

Mr_Magoo · 4 December 2022 03:27

zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write

bpool 487M 1.40G 0 0 4.43K 11.1K
c0714ccb-bc6f-5a4c-80bd-46777d248b07 487M 1.40G 0 0 4.43K 11.1K
cache - - - - - -
112ff53e-6b95-874a-bb76-a2a3d1978ebf 500K 28.7G 0 0 1.16K 1.33K

rpool 9.91G 910G 49 33 1.52M 1.01M
5f52e75c-505f-9941-a9c4-da071f9836f0 9.91G 910G 49 33 1.52M 1.01M
cache - - - - - -
8bba85ae-6f8d-1f4a-9251-1082cbe4c197 607M 28.1G 0 11 13.6K 1.47M

I'm looking into ways of 'warming up' the L2ARC caches, given the small size of the bpool. That'll force more files onto the cache drives, so the bandwidth of the main drive and the cache drives contribute higher data throughput.

seanhinkley · 4 December 2022 05:35

Im not too familiar with zfs as i havn't used it, but i do know it has issues when used in conjunction with hardware raid. Are you using a raid card or onboard mobo raid features?

Mr_Magoo · 4 December 2022 20:58

No, no raid card or onboard mobo raid features. Zorin OS running on a laptop. One main drive of 1TB size, two 32 GB memory sticks acting as ZFS L2ARC persistent cache drives, one for the boot pool (bpool), one for the root pool (rpool).

I tested the new configuration... I shut down, moved the rpool L2ARC cache drive to another USB plug, then rebooted.

That would normally result (in both Windows and in Linux) in assigning a different node to that drive, thus the drive letter would change. The node changed from /dev/sdc1 to /dev/sdf1.

I started Terminal and issued sudo zpool status... all was well, no problems whatsoever. ZFS had correctly connected to the UUID of the correct partition (the PARTUUID), no matter the drive path (/dev/) and node (sd??).

Mr_Magoo · 4 December 2022 21:25

You can see all the attributes of ZFS by:
sudo zfs get all bpool
sudo zfs get all rpool

I set:
sudo zfs set sync=always bpool
sudo zfs set sync=always rpool

That ensures that the data is safely written to either the rust pool (the spinning disk) or the L2ARC cache drive before it is reported as being written.

[EDIT]
That slows things down by quite a lot, so if you don't absolutely need it for data security, don't use it. Set it back to normal:
sudo zfs set sync=standard bpool
sudo zfs set sync=standard rpool
[/EDIT]

I set:
sudo zfs set atime=off bpool
sudo zfs set atime=off rpool

For each file access, ZFS records the access time. Thus, it must write that access time to disk, which of course takes time. Disabling atime improves I/O performance on file systems with lots of small files that are accessed frequently. With the relatime feature enabled and atime disabled, file access times are only updated once every 24 hours or if the access time changes, rather than on every access.

Mr_Magoo · 4 December 2022 23:14

sudo gedit
surf to /etc/logrotate.d

For each file in that directory, edit them to remove the compress option, then save the files.

https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/Ubuntu%2022.04%20Root%20on%20ZFS.html

As /var/log is already compressed by ZFS, logrotate’s compression is going to burn CPU and disk I/O for (in most cases) very little gain. Also, if you are making snapshots of /var/log , logrotate’s compression will actually waste space, as the uncompressed data will live on in the snapshot. You can edit the files in /etc/logrotate.d by hand to comment out compress...

Mr_Magoo · 5 December 2022 01:54

sudo gedit /sys/module/zfs/parameters/l2arc_noprefetch
Change 1 to 0
Save the file

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-noprefetch

Setting to 0 can increase L2ARC hit rates for workloads where the ARC is too small for a read workload that benefits from prefetching. Also, if the main pool devices are very slow, setting to 0 can improve some workloads such as backups.

Mr_Magoo · 5 December 2022 05:17

sudo gedit /sys/module/zfs/parameters/l2arc_headroom

I've set this to 10. That reads 80 MB of the tail-end of the ARC.

The default is 2. With the default setting of l2arc_write_max = 8388508, that means the L2ARC cache can read the 16 MB tail-end of the ARC to get the files which populate the L2ARC.

We don't want to increase l2arc_write_max when using flash memory sticks as L2ARC cache drives, that would wear them out more quickly. But we want to populate the L2ARC cache more, so increasing l2arc_headroom is called for.

system · 5 March 2023 05:17

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.