Bug 2350 - ZFS->ZFS storage_migrate does not work if zfs feature@encryption=enabled
Summary: ZFS->ZFS storage_migrate does not work if zfs feature@encryption=enabled
Status: ASSIGNED
Alias: None
Product: pve
Classification: Unclassified
Component: Storage (show other bugs)
Version: 6
Hardware: PC Linux
: --- bug
Assignee: Bugs
URL: https://github.com/zfsonlinux/zfs/iss...
: 2351 (view as bug list)
Depends on:
Blocks: 2714 4857
  Show dependency tree
 
Reported: 2019-08-30 10:08 CEST by yakuraku
Modified: 2024-04-25 08:35 CEST (History)
28 users (show)

See Also:


Attachments
slot gaocr (64.27 KB, text/html)
2023-03-03 00:54 CET, Anonymous888
Details

Note You need to log in before you can comment on or make changes to this bug.
Description yakuraku 2019-08-30 10:08:28 CEST
When replicate from fron PVE to another from&to an encrypted zfs it will cancel the process with the message: 

cannot send rpool/disks/vm-100-disk-0@__replicate_100-0_1566629820__: encrypted dataset rpool/disks/vm-100-disk-0 may not be sent with properties without the raw flag
Comment 1 Fabian Grünbichler 2019-09-04 12:46:32 CEST
*** Bug 2351 has been marked as a duplicate of this bug. ***
Comment 2 Fabian Grünbichler 2019-09-04 12:50:20 CEST
merged both bugs and adapted title and component accordingly.

sending regular streams with -p should be feasible, but requires patching ZFS code. we are currently investigating the needed changes and will propose the needed changes upstream.

see https://github.com/zfsonlinux/zfs/issues/8169
Comment 3 Mark Rottler 2020-05-29 20:25:54 CEST
I am also running into this with Proxmox 6.2. I cannot find the upstream ZFS issue you linked. Has there been any progress on this? Any new upstream issue ID?
Comment 4 Fabian Grünbichler 2020-06-02 09:16:41 CEST
it seems like this issue was lost when upstream moved their github repository. AFAIK there hasn't been any progress on this front yet.
Comment 5 Juho Ylikorpi 2020-06-06 10:18:52 CEST
Would it be possible to just use keyfile and syncronize it between hosts?

Pool is created with,

zfs create -o encryption=aes-128-gcm -o keyformat=raw -o keylocation=file:///zfs-encrypt/key zfs-pool1/encrypted

...and script is creating tmpfs /zfs-encrypt & syncronizing /zfs-encrypt/key between hosts on boot (by systemd service).

This approach needs only following modifications to code,

--- /usr/share/perl5/PVE/Storage/ZFSPoolPlugin.pm.orig  2020-01-28 21:17:44.331998473 +0200
+++ /usr/share/perl5/PVE/Storage/ZFSPoolPlugin.pm       2020-01-28 22:45:12.344902193 +0200
@@ -718,7 +718,7 @@
     # For zfs we always create a replication stream (-R) which means the remote
     # side will always delete non-existing source snapshots. This should work
     # for all our use cases.
-    my $cmd = ['zfs', 'send', '-Rpv'];
+    my $cmd = ['zfs', 'send', '-Rpvw'];
     if (defined($base_snapshot)) {
        my $arg = $with_snapshots ? '-I' : '-i';
        push @$cmd, $arg, $base_snapshot;
@@ -771,6 +771,9 @@
        die $err;
     }

+    eval { run_command(['zfs', 'set', 'keylocation=file:///zfs-encrypt/key', $zfspath]) };
+    eval { run_command(['zfs', 'load-key', $zfspath]) };
+
     return;
 }
Comment 6 Fabian Grünbichler 2020-06-08 08:58:55 CEST
(In reply to Juho Ylikorpi from comment #5)
> Would it be possible to just use keyfile and syncronize it between hosts?
> 
> Pool is created with,
> 
> zfs create -o encryption=aes-128-gcm -o keyformat=raw -o
> keylocation=file:///zfs-encrypt/key zfs-pool1/encrypted
> 
> ...and script is creating tmpfs /zfs-encrypt & syncronizing /zfs-encrypt/key
> between hosts on boot (by systemd service).
> 
> This approach needs only following modifications to code,
> 
> --- /usr/share/perl5/PVE/Storage/ZFSPoolPlugin.pm.orig  2020-01-28
> 21:17:44.331998473 +0200
> +++ /usr/share/perl5/PVE/Storage/ZFSPoolPlugin.pm       2020-01-28
> 22:45:12.344902193 +0200
> @@ -718,7 +718,7 @@
>      # For zfs we always create a replication stream (-R) which means the
> remote
>      # side will always delete non-existing source snapshots. This should
> work
>      # for all our use cases.
> -    my $cmd = ['zfs', 'send', '-Rpv'];
> +    my $cmd = ['zfs', 'send', '-Rpvw'];
>      if (defined($base_snapshot)) {
>         my $arg = $with_snapshots ? '-I' : '-i';
>         push @$cmd, $arg, $base_snapshot;
> @@ -771,6 +771,9 @@
>         die $err;
>      }
> 
> +    eval { run_command(['zfs', 'set',
> 'keylocation=file:///zfs-encrypt/key', $zfspath]) };
> +    eval { run_command(['zfs', 'load-key', $zfspath]) };
> +
>      return;
>  }

it's unfortunately not that easy, since we can't just switch from regular to raw sends for existing replications.
Comment 7 2b 2020-06-08 21:23:36 CEST
If we just remove -Rp flags from zfs send command in ZFSPoolPlugin.pm, it seems replication starts work without --raw flag, and no need to load same encryption keys on the destination node. Is there any drawbacks for such workaround?

As stated in zfs send's man pages, -p flag is implicit when -R is specified. So the only question is, how important -R flag is for proxmox's migration functional.
Comment 8 Fabian Grünbichler 2020-06-09 09:29:25 CEST
(In reply to 2b from comment #7)
> If we just remove -Rp flags from zfs send command in ZFSPoolPlugin.pm, it
> seems replication starts work without --raw flag, and no need to load same
> encryption keys on the destination node. Is there any drawbacks for such
> workaround?
> 
> As stated in zfs send's man pages, -p flag is implicit when -R is specified.
> So the only question is, how important -R flag is for proxmox's migration
> functional.

very important unfortunately, as we need to keep properties like volblocksize, quotas, etc. in sync. we also want to send intermediate snapshots. and we also want to delete no longer existing snapshots on the target side. I think unless we find an easy way to fix this on the ZFS side (e.g., to get the behaviour of -R split into multiple flags), the only other option would be to detect encryption and switch to a 'manual -Rp' mode if it's enabled:

- read out list of properties
- filter out those we don't want to transfer (e.g. encryption related, read-only, ...)
- send with -I and a list of -o settings generated from previous steps
- read out list of snapshots
- delete snapshots no longer existing on source node which exist on target node

obviously this is much more involved (with more potentially blocking/load generating calls to 'zfs'), and I am also not sure how properties of intermediate snapshots are handled or whether we need to sync each one as a separate incremental stream, which would complicate error handling.

@Fabi: any plans of looking at this again in the near future? or should we re-assign?
Comment 9 Fiona Ebner 2020-06-09 09:43:28 CEST
> @Fabi: any plans of looking at this again in the near future? or should we
> re-assign?

I'm busy with other stuff for now, so feel free to re-assign.
Comment 10 Stoiko Ivanov 2020-06-09 13:34:44 CEST
Short summary form an off-list/tracker discussion:

if trying to implement this inside ZFS
* maybe the simplest way would be to add an extra flag on the receive side, which excludes the encryption properties (zfs receive has the -x flag for this, but it is not implemented in combination with -R)
* else the splitting of -R would need:
** one option to sync the snapshots on both sides
** one option to implement a recursive descent
** -p is already present for the properties

I'll try to get this in some presentable shape
Comment 11 Andreas Palm 2020-09-14 12:41:25 CEST
As there isn´t any progress on this upstream yet (please correct if I´m wrong) I would like to pick up the idea of "Juho Ylikorpi" again.

According to ZFS Manpage "If the -R flag is used to send encrypted datasets, then -w must also be specified". So we could add -w flag if and only if the dataset to be exported is encrypted. This would lead to a working storage migration.

Of course this solution has many drawbacks:
- dataset ist replicated raw so any further sends have to be raw; so if a better solution will exist in the future it could be necessary to destroy and recreate such replicas
- on the target the dataset will build a dedicated encryptionroot which needs explicit key-loading

If this solution is feasible depends on the use case. I would prefer a solution like this until a better one is available because this way I´m missing some usability but at least have my datasets replicated. In case the replicated Node fails I can get up the replicas much faster than without replication at all.

Maybe we could make this solutions only available via an opt-in option in pvesm export... but this will complicate this temporary solution further.

I did a small test on my system to verify my idea and it is working. So let me know what you think about it.
Comment 12 Tin Vo 2021-01-08 03:31:41 CET
I would change the end of the volume_import function to this instead:

>   if (my $err = $@) {
>	if (defined($base_snapshot)) {
>	    eval { run_command(['zfs', 'rollback', '-r', '--', "$zfspath\@$base_snapshot"]) };
>	} else {
>	    eval { run_command(['zfs', 'destroy', '-r', '--', $zfspath]) };
>	}
>	die $err;
>    }
>    my $msg = '';
>    my $output = sub { $msg .= "$_[0]" };
>    run_command(['zfs', 'get', '-Ho', 'value', 'keystatus', $zfspath], outfunc => $output);
>    if ($msg eq "unavailable") {
>      eval { run_command(['zfs', 'set', 'keylocation=file:///run/encryption.key', $zfspath]) };
>      eval { run_command(['zfs', 'load-key', $zfspath]) };
>      eval { run_command(['zfs', 'change-key', '-i', $zfspath]) };
>    }
>    return "$storeid:$dataset";
The encryption key is stored in the file at /run/encryption.key

This would set and load the encryption key and change it to inherit from parent only when it's an encrypted dataset.
Comment 13 Andreas Palm 2021-01-25 16:32:23 CET
Are there any other drawbacks in using raw streams besides the fact that you can´t switch from regular to raw and vice versa?

As you can´t switch a dataset from unencrypted to encrypted, using -w just for encrypted datasets in volume_export seems to be no problem in practice.

I added -w in my personal setup and this seems to be working fine. Together with the changes to volume_import from "Tin Vo" this seems to be an acceptable solution for now.

Just leave a positive answer if I shall create a PATCH and PR.
Comment 14 venzislav@gmail.com 2021-02-03 13:20:14 CET
Please do it and if you can add more description for the solution and how to apply it.
We also need this functionality because we just discover that it is not possible to migrate or replicate between encrypted volumes and it is needed to be encrypted in case of stolen hardware.
Comment 15 Tobias McNulty 2021-04-23 00:56:53 CEST
Is there a workaround for this bug (that does not require editing code), when the use case is simply to perform an offline migration of a VM from a failing node to another?
Comment 16 Fabian Grünbichler 2021-04-23 10:20:37 CEST
(In reply to Tobias McNulty from comment #15)
> Is there a workaround for this bug (that does not require editing code),
> when the use case is simply to perform an offline migration of a VM from a
> failing node to another?

if all your disks are on ZFS (and you feel comfortable with manually doing lower-level operations), you can manually zfs send/recv, power the guest off, (incremental) send/recv again, then mv the config file in /etc/pve and start on the new node.
Comment 17 ornanovitch 2021-11-15 12:03:31 CET
Hi.

Are there any plans to implement this soon?

We want to try the patch from Tin Vo & Andreas Palm but at least we need to be sure it's still safe on PVE 7.0

For the record, upstream issue https://github.com/openzfs/zfs/issues/10507 is still opened and the feature request is being oriented to a new flag (something like --no-preserve-encryption) compatible with -R in order to decrypt while sending and to re-encrypt while receiving.
Comment 18 wasd@exxor.de 2022-02-19 01:00:58 CET
(In reply to ornanovitch from comment #17)

> (something like --no-preserve-encryption)

Personally I think in a usual Proxmox setup a flag like this would be bad due to the overhead of the decryption on the source and re-encryption with a different key on the target. Proxmox replications should be as fast as possible, so these seconds matter.

Having the same key for the encrypted (root) dataset on every Proxmox node on the other hand should be no problem in most setups. So I think an approach in that direction would be better.

However, neither seems trivial and I'd rather have any working replication & encryption than none. So I'd also appreciate it if this issue wouldn't get dropped/neglected.
Comment 19 Fabian Grünbichler 2022-02-21 10:07:24 CET
the problem is that encryption (especially in combination with zfs send/recv) is both not very ergonomic yet, but also regularly hit by (sometimes catastrophic) bugs upstream. until these are ironed out, supporting native ZFS encryption with replication on our end is not really feasible, as much as we'd like to "make it work".
Comment 22 roland.kletzing 2023-03-09 14:47:23 CET
the title of this ticket my lead to false assumption.

to clarify - you can use file backed virtual machines on zfs datasets and storage migration should work even with different settings on source and target regarding encryption.

but you are out of luck regarding replication, as replication currently is only possible with ZVOL backed VMs.
Comment 23 Fabian Grünbichler 2023-03-09 15:04:57 CET
yes, it "only" affects the ZFS storage type, not a directory storage on top of a ZFS dataset (in which case, PVE doesn't know about the ZFS underneath, and also doesn't use zfs send/recv to migrate anything).
Comment 24 venzislav@gmail.com 2023-03-09 15:09:57 CET
I just testing new installation in "shady" datacenter and we have cluster of proxmox servers with encrypted zfs storage.
We can migrate online (working) VM between servers without any problem.
The problems are 2 major:
 1. if VM has clould-init drive transfer fails despite machine is online;
 2. Cannot transfer offline machine between proxmox servers ??? 

Why we can transfer online machine and cannot offline ?
We are using same crypt key for all machines.

This functionality is must if you host servers in remote datacenter and you care of the safety of the data ! PLESE PLESE give some fix ! :)
Comment 25 Fabian Grünbichler 2023-03-09 15:21:36 CET
there won't be support for native encryption until the bugs are ironed out for this very reason. if you require encryption at rest, please use LUKS below your zpool.
Comment 26 Alexander K 2023-03-29 18:34:24 CEST
(In reply to Fabian Grünbichler from comment #25)
> there won't be support for native encryption until the bugs are ironed out
> for this very reason. if you require encryption at rest, please use LUKS
> below your zpool.


Hello Fabian. 
To be honest, ZFS doesn't forgive being non-monigamic. It don't allow any loop/LUKS/dm-crypt below, nor any other filesystem on the same drive. All above attempts ends up with huge performance degradation and "Task hang for more than X seconds". This is what I learned the very hard way.
Comment 27 Mark Rottler 2023-03-29 18:52:20 CEST
After running into this bug back in mid-2020, I ended up running mirrored ZFS on top of LUKS. In my case, I also experienced performance problems at first. I was able to fix those performance issues by adding an Intel Optane drive for the ZFS Intent Log.

I honestly don't know if ZFS on top of LUKS was the source of my performance problems, but the Intel Optane ZIL made an immediate and significant difference. I run what I would consider a moderately loaded three node cluster, so it's certainly possible that I would again encounter performance issues if the cluster was put under high load.

That's my experience, for what it is worth....
Comment 28 Fabian Grünbichler 2023-03-30 08:34:14 CEST
lots of people are running LUKS below ZFS without any issues. obviously there is some overhead (it does encryption after all), so if your CPU is not fast, you will feel it, but you would hit the same penalty (or even worse) with ZFS encryption as well..
Comment 29 Alexander K 2023-03-30 08:57:43 CEST
Hi Fabian and Mark, thanks for your comments. IMO, that perf degradation is not a result of that linear overhead caused by storage subsystem stacking itself: e.g. when you put loop into loop into loop - there will be slight overhead obviously. 

As you might notice, proxmox says in bold - no HW Raid for ZFS. The reason is how ZFS's IO works.
Let's imagine the following, which is not the actual reason, but it illustrates why stacking is bad.
Blocks are going to the buffer, and flushed to the disk when some conditions are met. As a result of having a very complex and, again stacked, road to the storage- zlog, zil, arc, etc, zfs employs its own ioscheduling. 
When the block is being flushed to the storage, ZFS flushes it and, say waits for the storage confirmation. When under ZFS lying the raw storage, it gets confirmation once it is done. However, if another iosched is lying underneath, it could decide not to flush it right away, but flush later. In this case, ZFS will wait forever. When a lot of blocks will be in this situation, the server will get 100% IOWAIT without any noticeable disk throughput usage.

I faced it a couple of years ago when tried DM-CRYPT or LUKS for ZFS (before native encryption was released). The result was : 
64 GB ARC memory for 192GB memory on a server
512 GB SATA SSD Drive
99% IOWAIT
and .... 512 KBps SSD read/write utilization

a couple of days ago I set up proxmox to another server, with EXT3 os partition and ZFS encryption partition for VMs on the same Sata SSD. The result was quite similar (not that drastic, but similar) - IOWAIT was jumping up to 99% too often, and average RW utilization for that SSD was about 24 MBps when SSD could handle up to 512 MBps. Yesterday I moved the ZFS partition to the dedicated drive and now everything runs smoothly. 
So, LUKS is not an option for ZFS encryption.
Comment 30 Alexander K 2023-03-30 09:01:55 CEST
Some furthermore additions:
I spent weeks on make it works on the dmcrypt - tossing ioscheds, caching and another parameters. Nothing helped. Every time it resulted: 

huge IOwait.
no IO activity for the drive
no CPU utilization.

Probably things went better after that time, however, my 2 days old attempt to put ZFS encrypted storage along to PMX system partition - said not much changed.
Comment 31 Esi Y 2023-12-16 20:18:48 CET
(In reply to Fabian Grünbichler from comment #4)
> it seems like this issue was lost when upstream moved their github
> repository. AFAIK there hasn't been any progress on this front yet.

Current issue link with reasoning why it's not going to go away:
https://github.com/openzfs/zfs/issues/10507#issuecomment-651162104