Forum
Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1

TOPIC:

server lockup ; drbd not replicating now 4 years 4 months ago #1941

  • Rob Hall
  • Rob Hall's Avatar Topic Author
  • Offline
  • Posts: 42
Good afternoon everyone!
So, we have a two server pool running XenServer 7.3 and ha-lizard / iscsi-ha. This has been running flawlessly for over a year, but this past weekend the master partially locked up - I say partially because it stopped responding to all requests and the VMs stopped responding, but the toolstack was alive just enough to prevent the other host from failing over to master. After I power cycled the affected host, the secondard did indeed become the master, iscsi-ha took ownership of the floating IP, and everything was working again. However, DRBD replication is broken. this is what iscsi-cfg status shows on the secondardy host:
| iSCSI-HA Version IHA_2.1.5 |
| Wed Oct 30 16:15:35 EDT 2019 |

| iSCSI-HA Status: Running 13054 |
| Last Updated: Wed Oct 30 16:15:22 EDT 2019 |
| HOST ROLE: SLAVE |
| VIRTUAL IP: 10.100.4.1 is not local |
| ISCSI TARGET: Stopped [expected stopped] |
| DRBD ROLE: iscsi1=Secondary |
| DRBD CONNECTION: iscsi1 in WFConnection state |
Control + C to exit


| DRBD Status |

| version: 8.4.5 (api:1/proto:86-101) |
| srcversion: 2A6B2FA4F0703B49CA9C727 |
| 1: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r
|
| ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:5066752 |


The primary shows:
| iSCSI-HA Version IHA_2.1.5 |
| Wed Oct 30 16:16:02 EDT 2019 |

| iSCSI-HA Status: Running 6924 |
| Last Updated: Wed Oct 30 16:15:58 EDT 2019 |
| HOST ROLE: MASTER |
| DRBD ROLE: iscsi1=Primary |
| DRBD CONNECTION: iscsi1 in WFConnection state |
| ISCSI TARGET: Running [expected running] |
| VIRTUAL IP: 10.100.4.1 is local |
Control + C to exit


| DRBD Status |

| version: 8.4.5 (api:1/proto:86-101) |
| srcversion: 2A6B2FA4F0703B49CA9C727 |
| 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r
|
| ns:0 nr:107308992 dw:90494016 dr:2131559628 al:750459 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:798008972 |


The Primary/Secondary roles seem correct, so I don't think it's split brain - any other ideas on where I need to go from here?

Thanks!
--
Rob

Please Log in or Create an account to join the conversation.

server lockup ; drbd not replicating now 4 years 4 months ago #1942

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
Hi Rob,
There are 2 possibilities here.

1) the replication link between the hosts did not come up. If you are using a bonded interface (not LACP), there is a known issue where ARP is blocked occasionally on these types of links where there is a bridge stacked on top of a bond (as is the case with xenserver). You can check for this condition by simply trying to ping across the replication link. If ping fails, unplugging any of the replication ports and re plugging typically resolves this. You can do this by pulling a singe cable, or, if the hosts are remote, "ifconfig <bondinterface> down && ifconfig <bondinterface> up"

an example would be "ifconfig eth4 down && ifconfig eth4 up"

2) if its not the replication interface, it could be DRBD split brain (which is not a cluster split brain). DRBD can go into split brain which is a protection mechanism when it can't work out which host has the most up to date data. If this is the case, we ship a recovery script that you can run on each host to remedy the issue. Run the following script on each host and follow the prompts. This typically corrects any DRBD issues. Also, I highly recommend that you have backups of your data before running this script.
/etc/iscsi-ha/scripts/drbd-sb-tool

Please Log in or Create an account to join the conversation.

server lockup ; drbd not replicating now 4 years 4 months ago #1943

  • Rob Hall
  • Rob Hall's Avatar Topic Author
  • Offline
  • Posts: 42
Hi Salvatore,
Ok, I feel like an idiot. You're right, I've ran into the bonded link issue before, and it completely slipped my mind. That is exactly what the problem was. Thank you for reminding me!

Please Log in or Create an account to join the conversation.

server lockup ; drbd not replicating now 4 years 4 months ago #1944

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
Hi Rob,
FYI - we have addressed the bond issue in the latest version of iscsi-ha. The condition, if present, will be detected and corrected within 2 minutes after all services start on a host. It is designed to work with active/active and active/backup bonds (not LACP).

halizard.org/release/iscsi-ha/iscsi-ha-2.2.3-2.rpm

Please Log in or Create an account to join the conversation.

  • Page:
  • 1