Forum
Welcome, Guest
Username: Password: Remember me
This is the optional category header for the Suggestion Box.

TOPIC:

Master Up, slave lost network & VM stuck on slave 7 years 7 months ago #970

  • Tobias Kreidl
  • Tobias Kreidl's Avatar Topic Author
  • Offline
  • Posts: 14
Situation:
2-server pool, the master is up and running. The slave server lost all network connectivity and has a VM running on it. Settings are:
DISABLED_VAPPS=()
ENABLE_LOGGING=1
FENCE_ACTION=stop
FENCE_ENABLED=0
FENCE_FILE_LOC=/etc/ha-lizard/fence
FENCE_HA_ONFAIL=0
FENCE_HEURISTICS_IPS=10.15.9.1
FENCE_HOST_FORGET=0
FENCE_IPADDRESS=
FENCE_METHOD=POOL
FENCE_MIN_HOSTS=2
FENCE_PASSWD=
FENCE_QUORUM_REQUIRED=1
FENCE_REBOOT_LONE_HOST=0
FENCE_USE_IP_HEURISTICS=1

Added Note: Changing "FENCE_HA_ONFAIL=1" made no difference in the behavior.

The one stuck VM on the networkless slave server did not restart on the master (it's marked true), not to mention the VM has disappeared from view under XenCenter! It's still running, though, according to "xe vm-list". It shows up as enabled for HA via "ha-cfg get-vm-ha". Pool OP_Mode is set to 2. A clean shutdown of the slave node forces the migration to work, but not this state where the server is running but has lost just the network connectivity.

What settings do I need to modify to allow the VM to be restarted on the master server automatically? And why did the VM disappear altogether from view in XenCenter?! We're running XenServer 6.5 SP1, fully patched up to XS65ESP1034.

Thank you for any assistance!

--Tobias

Please Log in or Create an account to join the conversation.

Last edit: by Tobias Kreidl.

Master Up, slave lost network & VM stuck on slave 7 years 7 months ago #971

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
Looks like you have fencing disabled "FENCE_ENABLED=0". Try setting this to 1 and retry your test scenario.
"ha-cfg set fence_enabled 1"

You can verify this by reviewing the logs on the master. If you parse /var/log/messages by greping "ha-lizard" you will get a fairly readable log that should walk you through all the logic leading up the failed HA event.
The following user(s) said Thank You: Tobias Kreidl

Please Log in or Create an account to join the conversation.

Master Up, slave lost network & VM stuck on slave 7 years 7 months ago #972

  • Tobias Kreidl
  • Tobias Kreidl's Avatar Topic Author
  • Offline
  • Posts: 14
Many thanks, Salvatore, this may have done it. I am not on premises where the servers are but did an ifconfig down on the active ethernet NICs and other than having to refresh XenCenter, the migrations took place as they should! I will test to morrow with the physical networks getting disconnected to verify. Best regards,
--Tobias

Please Log in or Create an account to join the conversation.

Master Up, slave lost network & VM stuck on slave 7 years 7 months ago #973

  • Tobias Kreidl
  • Tobias Kreidl's Avatar Topic Author
  • Offline
  • Posts: 14
Progress. That took care of the issue of the network being severed on the host that's not the pool master, but with the pool master, it requires an emergency transition to pool master to promote the slave server and after that, it still took a long, long timeout to allow connecting via XenCenter to the new pool master. Then, the only VMs still running were those that had been originally running on that server. Any attempts to contact the other host of course fail because there is no network present, hence it's not possible to kill, restart, etc. any VMs left dangling on that now network-isolated host (at least not via xe). If one logs onto it via a KMV or iDRAC connection, you see the VMs are still running, but totally isolated from the pool.

Is there any other optional flag that can be set to immediately kill and restart the VMs on the remaining host and declare it to be master in the event of a sudden total network loss to the slave host? The hope is to allow the pool to keep all designated VMs running, no matter what. Thanks again for your assistance.

Please Log in or Create an account to join the conversation.

Master Up, slave lost network & VM stuck on slave 7 years 7 months ago #974

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
In an HA event where the master fails, the slave should perform the following:

1- fence
2- if success, become master, Poweroff all VMs that were on master and start on slave.

No manual intervention should be required other than reconnecting XenCenter to the slave IP if desired.

Can you re-run your test and capture the logs from the slave and attach here so we can see exactly what is going on.

Please Log in or Create an account to join the conversation.

Master Up, slave lost network & VM stuck on slave 7 years 7 months ago #975

  • Tobias Kreidl
  • Tobias Kreidl's Avatar Topic Author
  • Offline
  • Posts: 14
Of course, I'd be happy to oblige. Let me bring things back up to where they were so I can replicate that exact event and I'll grep for any ha-lizard entries in the messages file, sanitize anything for host names and add a file containing all that info for you. Thank you. Give me a little time here to take care of all that.

I don't see how it can power off any VMs on the old master if there is no network connectivity to it, though... I'm talking about something like the cables being cut in two or the network switch failing altogether.

Please Log in or Create an account to join the conversation.

Last edit: by Tobias Kreidl.