Forum
Welcome, Guest
Username: Password: Remember me
This is the optional category header for the Suggestion Box.

TOPIC:

HA-lizard Version 2.2.0 Trial 5 years 3 months ago #1723

  • Nathan Scannell
  • Nathan Scannell's Avatar Topic Author
  • Offline
  • Posts: 38
Hi Salvatore,

I've tested a couple of disaster scenarios using the rpm upgrade package and for the most part, it has proven well.

Thought I'd just start a new thread here to discuss it in detail.

First Test:
Simulated Master host failure (Power supply failure on Master)
The slave auto-promoted and could be accessed in GUI *tick*
Fixed the old Master and rebooted. Came back online as a Slave *tick*

Second Test:
Simulated a further failure, this time the NEW Master (old slave)
The Slave (Old Master) auto-promoted and came online *tick*
The failed new Master (the original Slave came back online as a slave BUT, DRBD failed to connect to the peer node with this message at boot, interfering with Plymouth screen.

***************************************************************
DRBD's startup script waits for the peer node(s) to appear.
- If this node was already a degraded cluster before the
reboot, the timeout is 0 seconds. [degr-wfc-timeout]
- If the peer was available before the reboot, the timeout
is 0 seconds. [wfc-timeout]
(These values are for resource 'iscsi1'; 0 sec -> wait forever)
To abort waiting enter 'yes' [ 14]:
***************************************************************

I've attempted to manually repair the DRBD connection at this point but it stubbornly refuses to establish a connection to Master and thus prevents tgtd service from starting.

Any idea how to rectify? I know the case simulated here is a little obtuse but theoretically, the hosts should be able to fail/recover indefinitely. This particular case could feasibly arise... eg two power supply units from the same batch are bad and fail in close succession.

Please Log in or Create an account to join the conversation.

HA-lizard Version 2.2.0 Trial 5 years 3 months ago #1725

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
It could you be one of 2 things

- we have seen sporadic issues with Linux bridge using a bond where ARP peer notifications fail occasionally. This is purely related to the kernel. We have worked around it in 2 ways. 1) don't use a bond for replication 2) switch the bond type to LACP. You can verify whether this is your issue by simply pinging the peer replication ip from each host. If it fails, then this is likely the cause

- if it's not the above, then it could be drbd split brain. Drbd will do this if it cannot reliably determine which host is most up to date and is done so intentionally to preserve data integrity. If the is the issue, we ship a script that will resolve it. Run the below on each host and follow the instructions. /etc/iscsi-ha/scripts/drbd_sb_tool

Please Log in or Create an account to join the conversation.

Last edit: by Salvatore Costantino.

HA-lizard Version 2.2.0 Trial 5 years 3 months ago #1726

  • Nathan Scannell
  • Nathan Scannell's Avatar Topic Author
  • Offline
  • Posts: 38
Thanks...

It's not split-brain. That was my first suspicion but there was never any sign of split-brain occurring.

I ran the script anyway but it just automating my previous manual attempts.

Also, I've avoided using bonds but that reminds me though... i never switched over to bridge. Is open vswitch supported yet? I though the installer would warn me if not.


I'll run the simulation again in bridge mode anyway just to document the result.

Please Log in or Create an account to join the conversation.

Last edit: by Nathan Scannell.

HA-lizard Version 2.2.0 Trial 5 years 3 months ago #1727

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
Ovs was a problem in 6.x. It's been working fine in 7.x. Are you able to ping replication IPs from each of the hosts? Could be firewall too.

Please Log in or Create an account to join the conversation.

HA-lizard Version 2.2.0 Trial 5 years 3 months ago #1728

  • Nathan Scannell
  • Nathan Scannell's Avatar Topic Author
  • Offline
  • Posts: 38
OMG... Bad crossover cable. :blush:

I made it myself because I ran out of proper stock hah

:whistle:

Please Log in or Create an account to join the conversation.

HA-lizard Version 2.2.0 Trial 5 years 3 months ago #1730

  • Nathan Scannell
  • Nathan Scannell's Avatar Topic Author
  • Offline
  • Posts: 38
So... It wasn't my cable after all...

Intermittently, my adapters are not lighting up after reboot. It seems they have trouble negotiating a link.

Problem is resolved by installing a switch instead of crossover cable.

They are Intel Desktop Pro 1000 PCI-E NICs. Might pay to avoid these in xover scenarios.

Please Log in or Create an account to join the conversation.