Forum
Welcome, Guest
Username: Password: Remember me

TOPIC:

URGENT: VMs corrupted after transferred to HAL 6 years 7 months ago #1406

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
We recreated the scenario in your migration without HA-Lizard (2 pools. live migrate a VM from pool 1 -> pool 2)

Before the migration is complete, a new VM appears on the target pool in the OFF state. The migration is still ongoing at this point (for an additional minute or so).

if left alone, the VM will complete its transfer and start on the new pool.

If a user manually starts the VM on the target host while the migration is still ongoing it will crash the VM. This is essentially what happened in your case. The only difference being that HA-Lizard performed the start.

There is no trigger or flag (that we could identify) available on the new VM to tell us that we should not start it, so HA-Lizard starts it because it is configured to do so.

You may want to consider raising this as a possible bug with Citrix since the VM should NOT be start(able) on the target pool while the migration is ongoing.

In the interim, you could also change your HA-Lizard default settings to prevent this from happening. By default, HA-Lizard will start any VM that is not running. This should really be changed so that you can control which VMs should not be started. In the case of a live migration from a non-HAL pool to a HAL-pool, this would have not occurred if GLOBAL_VM_HA was disabled.

Please Log in or Create an account to join the conversation.

URGENT: VMs corrupted after transferred to HAL 6 years 2 months ago #1521

I know this is old, but I want to corroborate Salvatore's comments above.
So far today, I have successfully live migrated 17 VMs (with disk sizes from 8gb to 350gb) from traditional, 4 host, shared storage pool to a new HAL 2 node cluster without any issues. I have the GLOBAL_VM_HA set to 0.

Please Log in or Create an account to join the conversation.

URGENT: VMs corrupted after transferred to HAL 6 years 2 months ago #1523

  • Mauritz
  • Mauritz's Avatar Topic Author
  • Offline
  • Posts: 43
Hi Rob,

Last year September, out of the blue (with no prior warnings) we lost a handful of virtual machines due to what I can only assume was split brain. We woke up the morning to a number of VM's spitting out kernel warnings and on closer inspection one of the hosts lost access to multipath and was running at loads of over 30. I need to iterate that both hosts were still running, so there was never a time that HA could actually take place, just 1 hosts was not able to connect / retain connection to the shared storage repo. Now, it's not the fault of HAL that the host ran under heavy load (we could never quite figure out why) at all but in reality the fact that the host ran under severe loads, lost connection to the storage and then subsequently had pretty much all of our VM's corrupted was enough to get rid of HAL and just run standalone hosts. HAL is not recommended by Xenserver as they already have a working HA model in place so what you're ultimately doing is installing an additional unsupported layer into your hypervisor and if you're running production machines for clients, you need to measure up whether your clients would prefer having HA over the risk that if something out of the ordinary happens they may lose the VM entirely (if you did not make backups).

The product itself works, sure, but as there are too many different variants of issues that can occur, we simply cannot risk losing VM's if a broad spectrum of scenarios can take place over 1 scenario that if host A is down host B will take over. Most people considering HAL would be smaller companies which cannot necessarily fork out a true "HA" hardware model, so if something were to go wrong (and trust me it can), will you be able to fully recover your environment, keep clients happy and get all systems up and running fast enough? Remember that if disaster strikes you then first have to pinpoint the issue with HAL first, then Xenserver and then make decisions based on that. If you follow my previous forum messages (some very nervous and erratic) where days went past without any feedback, is that something you truly can afford? you will not find answers elsewhere on HAL but here unless you pay the support fee so the rest of the Xenserver industry would probably not be able to assist you with any actual concerns. These are the unfortunate facts we were faced with AFTER disaster struck.

Unfortunately for us we relied heavily on HAL (as that was our "backup") that we never took snapshots of VM's so we had to physically setup the different VM's again to get all working. We've since reverted back to just running VM's on isolated hosts and make daily snapshots. If the host were to go down it's a lot faster (and safer) to just import from a backup than having disaster strike, trying to figure out what's going on, waiting for responses for days and then making decisions. This would null the idea of having HAL in the first place.

In a testing environment sure or if you can securely say "if corruption takes place due to one of my host losing access to the storage repo I will be okay" then go for HAL else you should really read the fine print first and measure up whether your environment is so perfect that the occasional spike or unforeseen error wont affect you.

All the best and I hope that what happened to us never happens to you

Please Log in or Create an account to join the conversation.

URGENT: VMs corrupted after transferred to HAL 6 years 2 months ago #1524

Hi Mauritz,
I hate to hear what happened to you. I've been in those situations and it's truly a bad day when this happens. However, I'm not convinced in your case that this was HAL's fault ; it indicates an underlying hardware issue of some sorts (L1 Network, Storage RAID?) occurred that caused the problem.
As I'm sure you're aware now, a replication strategy like iSCSI-HA/drbd/HAL is not a backup in of itself ; it's a downtime mitigation strategy. No HA system is 100% reliable. One should always perform scheduled backups of their data :(
That being said, I can't imagine any company not shelling out for support from a company if it's needed - most vendors have support contracts that you have to maintain to even get support on a product.

Again, I hate to hear what happened. With any luck, things are back to normal now.

Please Log in or Create an account to join the conversation.