My Team in The States report an issue with a Red Hat iSCSI Initiator having issues connecting to a Volume exported by a ZFS Server.
There is an issue on GitLab.
As I always do when I troubleshot a problem, I create a forensics post-mortem document recording everything I do, so later, others can learn how I fix it, or they can learn the steps I did in order to troubleshoot.
Please note: Some Ip addresses have been manually edited.
2019-08-09 10:20:10 Start of the investigation
I log into the Server, with Ip Address: xxx.yyy.16.30. Is an All-Flash-Array Server with RHEL6.10 and DRAID v.08091350.
Htop shows normal/low activity.
I check the addresses in the iSCSI Initiator (client), to make sure it is connecting to the right Server.
[root@Host-164 ~]# ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:25:90:c5:1e:ea brd ff:ff:ff:ff:ff:ff
inet xxx.yyy.13.164/16 brd xxx.yyy.255.255 scope global eno1
valid_lft forever preferred_lft forever
inet6 fe80::225:90ff:fec5:1eea/64 scope link
valid_lft forever preferred_lft forever
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
link/ether 00:25:90:c5:1e:eb brd ff:ff:ff:ff:ff:ff
4: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 24:8a:07:a4:94:9c brd ff:ff:ff:ff:ff:ff
inet 192.168.100.164/24 brd 192.168.100.255 scope global enp3s0f0
valid_lft forever preferred_lft forever
inet6 fe80::268a:7ff:fea4:949c/64 scope link
valid_lft forever preferred_lft forever
5: enp3s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 24:8a:07:a4:94:9d brd ff:ff:ff:ff:ff:ff
inet 192.168.200.164/24 brd 192.168.200.255 scope global enp3s0f1
valid_lft forever preferred_lft forever inet6 fe80::268a:7ff:fea4:949d/64 scope link
valid_lft forever preferred_lft forever
I see the luns on the host, connecting to the 10Gbps of the Server:
[root@Host-164 ~]# iscsiadm -m session
tcp: [10] 192.168.100.30:3260,1 iqn.2003-01.org.linux-iscsi:vol4 (non-flash)
tcp: [11] 192.168.100.30:3260,1 iqn.2003-01.org.linux-iscsi:vol5 (non-flash)
tcp: [7] 192.168.100.30:3260,1 iqn.2003-01.org.linux-iscsi:vol1 (non-flash)
tcp: [8] 192.168.100.30:3260,1 iqn.2003-01.org.linux-iscsi:vol2 (non-flash)
tcp: [9] 192.168.100.30:3260,1 iqn.2003-01.org.linux-iscsi:vol3 (non-flash)
Finding the misteries…
Executing cat /proc/partitions is a bit strange respect mount:
[root@Host-164 ~]# cat /proc/partitions
major minor #blocks name
8 0 125034840 sda
8 1 512000 sda1
8 2 124521472 sda2
253 0 12505088 dm-0
253 1 112013312 dm-1
8 32 104857600 sdc
8 16 104857600 sdb
8 48 104857600 sdd
8 64 104857600 sde
8 80 104857600 sdf
As mount has this:
/dev/sdg1 on /mnt/large type ext4 (ro,relatime,seclabel,data=ordered)
Lsblk shows that /dev/sdg is not present:
[root@Host-164 ~]# lsblk
NAME
MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 119.2G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 118.8G 0 part
├─rhel-swap 253:0 0 11.9G 0 lvm [SWAP]
└─rhel-root 253:1 0 106.8G 0 lvm /
sdb 8:16 0 100G 0 disk
sdc 8:32 0 100G 0 disk
sdd 8:48 0 100G 0 disk
sde 8:64 0 100G 0 disk
sdf 8:80 0 100G 0 disk
And as expected:
[root@Host-164 ~]# ls -al /mnt/large
ls: reading directory /mnt/large: Input/output error
total 0
I see that the Volumes appear to not having being partitioned:
[root@Host-164 ~]# fdisk /dev/sdf
Welcome to fdisk (util-linux 2.23.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Device does not contain a recognized partition table
Building a new DOS disklabel with disk identifier 0xddf99f40.
Command (m for help): p
Disk /dev/sdf: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xddf99f40
Device Boot
Start
End
Blocks Id System
Command (m for help): q
I create a partition and format with ext2
[root@Host-164 ~]# mke2fs /dev/sdb1
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
6553600 inodes, 26214144 blocks
1310707 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
800 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
I mount:
[root@Host-164 ~]# mount /dev/sdb1 /mnt/vol1
I fill the volume from the client, and it works. I check the activity in the Server with iostat and there are more MB/s written to the Server’s drives than actually speed copying in the client.
I completely fill 100GB but speed is slow. We are working on a 10Gbps Network so I expected more speed.
I check the connections to the Server:
[root@obs4602-1810 ~]# netstat | grep -v "unix"
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55300 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55298 ESTABLISHED
tcp 0 0 xxx.yyy.18.10:ssh xxx.yyy.12.154:57137 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55304 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55301 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55306 ESTABLISHED
tcp 0 0 xxx.yyy.18.10:ssh xxx.yyy.12.154:56395 ESTABLISHED
tcp 0 0 xxx.yyy.18.10:ssh xxx.yyy.14.52:57330 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55296 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55305 ESTABLISHED
tcp 0 0 xxx.yyy.18.10:ssh xxx.yyy.12.154:57133 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55303 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55299 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.176:57542 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55302 ESTABLISHED
I see many connections from Host 180, I check that and another member of the Team is using that client to test with vdbench against the Server.
This explains the slower speed I was getting.
Conclusions
- There was a local problem on the Host. The problems with the disconnection seem to be related to a connection that was lost (sdg). All that information was written to iSCSI buffer, not to the Server. In fact, that volume was mapped in the system with another letter, sdg was not in use.
- Speed was slow due to another client pushing Data to the Server too
- Windows clients with auto reconnect option are not reporting timeout reports while in Red Hat clients iSCSI connection timeouts. It should be increased
2020-03-10 22:16 IST TIP: At that time we were using Google suite and Skype to communicate internally with the different members across the world. If we had used a tool like Slack, and we had a channel like #engineering for example or #sanjoselab, then I could have paged and asked “Is somebody using obs4602-1810?“