I recently moved my files to a new zfs-pool and used that chance to properly configure my datasets.

This led me to discovering zfs-deduplication.

As most of my storage is used by my jellyfin library (~7-8Tb), which is mostly uncompressed bluray rips I thought I might be able to save some storage using deduplication in addition to compression.

Has anyone here used that for similar files before? What was your experience with it?

I am not too worried about performance. The dataset in question is rarely changed. Basically only when I add more media every couple of months. I also have overshot my cpu-target when originally configuring my server so there is a lot of headroom there. I have 32Gb of ram which is not really fully utilized either (but I also would not mind upgrading to 64 too much).

My main concern is that I am unsure it is useful. I suspect just because of the amount of data and similarity in type there would statistically be a lot of block-level duplication but I could not find any real world data or experiences on that.

  • greyfox@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    9 days ago

    Like most have said it is best to stay away from ZFS deduplication. Especially if your data set is media the chances of an entire ZFS block being the same as any other is small unless you somehow have multiple copies of the same content.

    Imagine two mp3s with the exact same music content but with slightly different artist metadata. A single bit longer or shorter at the beginning of the file and even if the file spans multiple blocks ZFS won’t be able to duplicate a single byte. A single bit offsetting the rest of the file just a little is enough to throw off the block checksums across every block in the file.

    To contrast with ZFS, enterprise backup/NAS appliances with deduplication usually do a lot more than block level checks. They usually check for data with sliding window sizes/offsets to find more duplicate data.

    There are still some use cases where ZFS can help. Like if you were doing multiple full backups of VMs. A VM image has a fixed size so the offset issue above isn’t an issue, but if beware that enabling deduplication for even a single ZFS filesystem affects the entire pool, even ZFS filesystems that have deduplication disabed. The deduplication table is global for the pool and once you have turned it on you really can’t get rid of it. If you get into a situation where you don’t have enough memory to keep the deduplication table in memory ZFS will grind to a halt and the only way to completely remove deduplication is to copy all of your data to a new ZFS pool.

    If you think this feature would still be useful for you, you might want to wait for 2.3 to release (which isn’t too far off) for the new fast dedup feature which fixes or at least prevents a lot of the major issues with ZFS dedup

    More info on the fast dedup feature here https://github.com/openzfs/zfs/discussions/15896