Forum
Welcome, Guest
Username: Password: Remember me

TOPIC:

iSCSI SR Broken if I reboot Slave Server 8 years 2 months ago #657

  • Adam Ward
  • Adam Ward's Avatar Topic Author
  • Offline
  • Posts: 41
Hi,

We are about to put HA-Lizard into production, but have an issue I'd like to clear up first:

If we cleanly reboot the Slave XenServer, the iSCSI Storage Repository is marked as "Broken".

No matter what I do (leave it alone, try and replug it using Xen or the HA-Lizard replug script) the Slave iSCSI SR remains as "Broken".

The only way I can fix this is to shut both servers down, restart the Pool Master, then restart the Slave. The iSCSI SR eventually comes back online...

I do get warnings in XenServer "Failed to attach storage or server start" and sometimes the Pool Master requires a "Repair Storage" to connect to the iSCSI SR. Is this correct?

Can anyone explain why this happens / help me fix it?

Thanks in advance,

Adam

Please Log in or Create an account to join the conversation.

iSCSI SR Broken if I reboot Slave Server 8 years 2 months ago #659

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
Can you reproduce the issue and check whether the slave can ping the floating storage IP (10.10.10.3 by default). If that is OK, try to telnet from the slave to the storage IP of the master with:
"telnet <storage ip> 3260"

If telnet fails to connect there is something at the networking layer preventing the slave from seeing the storage IP.

Please Log in or Create an account to join the conversation.

iSCSI SR Broken if I reboot Slave Server 8 years 2 months ago #662

  • Adam Ward
  • Adam Ward's Avatar Topic Author
  • Offline
  • Posts: 41
Hi,

No, the Slave Server cannot ping 10.10.10.3

Its weird, if I reboot it, sometimes it starts and connects, other times it does not.

I CAN telnet from the slave server to 10.10.10.1 - e.g.:
from the Slave i can "telnet 10.10.10.1 3260" and it drops me to a telnet prompt.

Any ideas?

Best regards,

Adam

Please Log in or Create an account to join the conversation.

Last edit: by Adam Ward.

iSCSI SR Broken if I reboot Slave Server 8 years 2 months ago #663

  • Adam Ward
  • Adam Ward's Avatar Topic Author
  • Offline
  • Posts: 41

Adam Ward wrote: Hi,

No, the Slave Server cannot ping 10.10.10.3

Its weird, if I reboot it, sometimes it starts and connects, other times it does not.

I CAN telnet from the slave server to 10.10.10.1 - e.g.:
from the Slave i can "telnet 10.10.10.1 3260" and it drops me to a telnet prompt.

Any ideas?

Best regards,

Adam


Couple more tests:

Started the Pool Master from a Cold Boot - Master connected to the SR correctly. Allowed 5 minutes and then Started the Slave from Cold Boot.

The Slave started but could not connect to the SR, and I couldn't Ping 10.10.10.3 from the Slave console.

Rebooted the Slave - which then connected to the SR! Just to check, I rebooted the Slave again and it came back online and connected to the SR...

Weird...

Please Log in or Create an account to join the conversation.

Last edit: by Adam Ward.

iSCSI SR Broken if I reboot Slave Server 8 years 2 months ago #664

  • Salvatore Costantino
  • Salvatore Costantino's Avatar
  • Offline
  • Posts: 722
We've tried to reproduce the issue with 6 slave reboots (cold and warm) and could not reproduce it.

Can you try the following to check whether this is an ARP related issue:

Reproduce the issue such that the slave fails to connect to the storage on reboot.

On the Master, find your DRBD interface name (assuming your local replication IP is 10.10.10.1)

ip a | grep 10.10.10.1

You should get something like this:

"inet 10.10.10.1/24 brd 10.10.10.255 scope global xenbr1" where the interface name is "xenbr1"

from the master shell, perform the following (assuming your I/F name is xenbr1 and floating IP is 10.10.10.3):

arping -I xenbr1 -U 10.10.10.3 -c 2 -w 2

see if the slave fixes itself and report back your results.

Please Log in or Create an account to join the conversation.

iSCSI SR Broken if I reboot Slave Server 8 years 2 months ago #665

  • Adam Ward
  • Adam Ward's Avatar Topic Author
  • Offline
  • Posts: 41

sc wrote: We've tried to reproduce the issue with 6 slave reboots (cold and warm) and could not reproduce it.

Can you try the following to check whether this is an ARP related issue:

Reproduce the issue such that the slave fails to connect to the storage on reboot.

On the Master, find your DRBD interface name (assuming your local replication IP is 10.10.10.1)

ip a | grep 10.10.10.1

You should get something like this:

"inet 10.10.10.1/24 brd 10.10.10.255 scope global xenbr1" where the interface name is "xenbr1"

from the master shell, perform the following (assuming your I/F name is xenbr1 and floating IP is 10.10.10.3):

arping -I xenbr1 -U 10.10.10.3 -c 2 -w 2

see if the slave fixes itself and report back your results.


Hi,

I now cannot get it to break! I have cold booted the servers, the only thing different is that I allowed the Master to come online, left it alone for 15 mins and THEN brought the Slave online. Since doing this, I cannot break the iSCSI SR.

Thanks for your help - I think this may be OK now.

regards,

Adam

Please Log in or Create an account to join the conversation.