Yes, we are a medical/dental/pharmacy university and because of some of the specific data needs of our org we have a large on-prem ecosystem. We are currently a VMWare shop, but Broadcom’s business strategies have made us look for alternatives. I’ve used Proxmox in the homelab for years and have been feeling as its gotten more and more polished it’s ready to be considered for production work. Currently we have a lab environment of previous gen hardware which I want to use as a test-bed for possible production platform moves.
Proxmox isn’t VMware yet, but it’s close. The HA doesn’t work the same, I’ve struggled with something akin to DRS. If you use on-host storage, you have to constantly do replication work to keep them synced and even then a failover is essentially a storage rollback to the last sync. If you use iscsi storage, you have to be very careful. Snapshotting is only functional when backed by a few of the storage types, and we use ZFS. ZFS over isci is somewhat brittle, but we have a TrueNAS device which supports it here. We use Veeam as our enterprise backup solution, and I have no idea how these will work together. Veeam talks directly to our Nimble storage, does storage-based snapshots, and replicates them to our other site. Veeam theoretically does talk to TrueNAS, but without supporting Proxmox I don’t know what the backup/recovery flow would look like. Veeam is looking into this: https://community.veeam.com/discussion-boards-66/veeam-researching-support-for-vmware-alternative-proxmox-as-backup-buyers-fret-about-broadcom-6530 We tried to use TrueNas ZFS snapshots for just general VM semi-backup, but unless you want to rollback your whole dataset, it doesn’t work well. You have to make separate snapshot tasks for the specific zvol/dataset, otherwise you’re rolling your whole dataset back. Also, I tried mounting a snapshot, hoping to then share it as an iSCSI extent and remount it to a VM and pull out a specific file…this didn’t work at all, I can’t get the UI to show the promoted clone so I can try to present it to the host.
When coming back from a power-off, if your Proxmox hosts are in a cluster, there’s no cluster-aware startup order (HA disables the entire startup delay system). That’s not great, our apps have SQL dependencies which need to be started first.
That’s the issues, and it sounds negative, but ultimately for a zero-cost hypervisor that’s under active development those issues need to be viewed through the lens of the overwhelming achievement that the project is and continues to be.
For the dependency issues I used systemd in my homelab. I have not tested HA as I only have gigabit and limited hardware so performance would take a hit.
The 2 biggest issues I’ve noticed with Proxmox is:
When the cluster gets out of Quotem it is really hard to reestablish consensus.
Proxmox is a normal Linux system which sounds good but updating individual packages with apt can be problematic and doing a version upgrade is a hassle. It would be better if it was immutable so that you could upgrade and downgrade easily. Ideally it would be automated so that nodes could automatically upgraded and then test for stability. If an upgrade fails it should roll it back.
Those are all fair, also the entire open-vswitch setup is very clunky. I always avoid the UI and just edit /etc/network/interfaces directly, especially for vlan networks. I dislike that it wants 3 nodes but I understand, still 2 nodes in the homelab is pretty reasonable. I wish in general the HA was more configurable, robust, and intuitive.
Does that need to be true though? For like true “counting in how many 9’s” HA of course. But there’s nothing technically preventing high availability in 2 nodes; if the storage is shared and there’s a process to keep the memory in sync it should be possible with 2 nodes have some degree of high availability, even if it’s with big warnings.
The problem with 2 nodes is there is no way to identity which node has the issue. From the hosts perspective all it “knows” is that the other node isn’t reachable. Technically it could assume that it is the functional one but there is also the possibility that both machines assume they are the working one and then spin up the same VM.
You can cluster two nodes but as soon as one node can’t reach the other everything freezes to prevent loss of consensus.
The reason I suggest 5 nodes is because 3 only gives the possibility for one node to fail. If one fails and then the remaining 2 can’t sort out what is happening the cluster freezes to prevent loss of consensus. Also having 5 machines means you have more flexibility.
I also want to point out that you need fast networking for HA but I’m sure you already know that.
We’re converting our workplace lab to Proxmox and it’s a great ramp for eventually leaving vmware. Great system.
Can you share any details? You say our workspace so I assume you are talking about work.
Yes, we are a medical/dental/pharmacy university and because of some of the specific data needs of our org we have a large on-prem ecosystem. We are currently a VMWare shop, but Broadcom’s business strategies have made us look for alternatives. I’ve used Proxmox in the homelab for years and have been feeling as its gotten more and more polished it’s ready to be considered for production work. Currently we have a lab environment of previous gen hardware which I want to use as a test-bed for possible production platform moves.
Proxmox isn’t VMware yet, but it’s close. The HA doesn’t work the same, I’ve struggled with something akin to DRS. If you use on-host storage, you have to constantly do replication work to keep them synced and even then a failover is essentially a storage rollback to the last sync. If you use iscsi storage, you have to be very careful. Snapshotting is only functional when backed by a few of the storage types, and we use ZFS. ZFS over isci is somewhat brittle, but we have a TrueNAS device which supports it here. We use Veeam as our enterprise backup solution, and I have no idea how these will work together. Veeam talks directly to our Nimble storage, does storage-based snapshots, and replicates them to our other site. Veeam theoretically does talk to TrueNAS, but without supporting Proxmox I don’t know what the backup/recovery flow would look like. Veeam is looking into this: https://community.veeam.com/discussion-boards-66/veeam-researching-support-for-vmware-alternative-proxmox-as-backup-buyers-fret-about-broadcom-6530 We tried to use TrueNas ZFS snapshots for just general VM semi-backup, but unless you want to rollback your whole dataset, it doesn’t work well. You have to make separate snapshot tasks for the specific zvol/dataset, otherwise you’re rolling your whole dataset back. Also, I tried mounting a snapshot, hoping to then share it as an iSCSI extent and remount it to a VM and pull out a specific file…this didn’t work at all, I can’t get the UI to show the promoted clone so I can try to present it to the host.
When coming back from a power-off, if your Proxmox hosts are in a cluster, there’s no cluster-aware startup order (HA disables the entire startup delay system). That’s not great, our apps have SQL dependencies which need to be started first.
That’s the issues, and it sounds negative, but ultimately for a zero-cost hypervisor that’s under active development those issues need to be viewed through the lens of the overwhelming achievement that the project is and continues to be.
Veeam was apparently looking into officially supporting proxmox. I don’t remember seeing any timetable though.
For the dependency issues I used systemd in my homelab. I have not tested HA as I only have gigabit and limited hardware so performance would take a hit.
The 2 biggest issues I’ve noticed with Proxmox is:
When the cluster gets out of Quotem it is really hard to reestablish consensus.
Proxmox is a normal Linux system which sounds good but updating individual packages with apt can be problematic and doing a version upgrade is a hassle. It would be better if it was immutable so that you could upgrade and downgrade easily. Ideally it would be automated so that nodes could automatically upgraded and then test for stability. If an upgrade fails it should roll it back.
Those are all fair, also the entire open-vswitch setup is very clunky. I always avoid the UI and just edit /etc/network/interfaces directly, especially for vlan networks. I dislike that it wants 3 nodes but I understand, still 2 nodes in the homelab is pretty reasonable. I wish in general the HA was more configurable, robust, and intuitive.
For HA you are always going to need 3 nodes at least. Most HA systems need 5 or more
Does that need to be true though? For like true “counting in how many 9’s” HA of course. But there’s nothing technically preventing high availability in 2 nodes; if the storage is shared and there’s a process to keep the memory in sync it should be possible with 2 nodes have some degree of high availability, even if it’s with big warnings.
The problem with 2 nodes is there is no way to identity which node has the issue. From the hosts perspective all it “knows” is that the other node isn’t reachable. Technically it could assume that it is the functional one but there is also the possibility that both machines assume they are the working one and then spin up the same VM.
You can cluster two nodes but as soon as one node can’t reach the other everything freezes to prevent loss of consensus.
The reason I suggest 5 nodes is because 3 only gives the possibility for one node to fail. If one fails and then the remaining 2 can’t sort out what is happening the cluster freezes to prevent loss of consensus. Also having 5 machines means you have more flexibility.
I also want to point out that you need fast networking for HA but I’m sure you already know that.
Have you looked at Proxmox Backup Server?
https://www.proxmox.com/en/proxmox-backup-server/overview
Yes, but it’s not an option yet. We’re heavily invested in veeam and are not looking to replace that piece yet.