One of my colleagues showed me dstat, a very nice tool for system monitoring, and bandwidth of a drive monitoring. Also ifstat, as complement to iftop is very cool for Network too. This functionality is also available in CTOP.py
As I shared in the past news of the blog, I’m resuming my contributions to ZFS Community.
Long time ago I created some ZFS tools that I want to share soon as Open Source.
I equipped myself with the proper Hardware to test on SAS and SATA:
12G Internal PCI-E SAS/SATA HBA RAID Controller Card, Broadcom’s SAS 3008, compatible for SAS 9300-8I. This is just an HDA (Host Data Adapter), it doesn’t support RAID. Only connects up to 8 drives or 1024 through expander, to my computer. It has a bandwidth of 9,600 MB/s which guarantees me that I’ll be able to add 12 SAS SSD Enterprise grade at almost the max speed of the drives. Those drives perform at 900 MB/s so if I’m using all of them at the same time, like if I have a pool of 8 + 3 and I rebuild a broken drive or I just push Data, I would be using 12×900 = 10,800 MB/s. Close. Fair enough.
VANDESAIL Mini-SAS Cables, 1m Internal Mini-SAS to 4x SAS SATA Forward Breakout Cable Hard Drive Data Transfer Cable (SAS Cable).
SilverStone SST-FS212B – Aluminium Trayless Hot Swap Mobile Rack Backplane / Internal Hard Drive Enclosure for 12x 2.5 Inch SAS/SATA HDD or SSD, fit in any 3x 5.25 Inch Drive Bay, with Fan and Lock, black
Terminator is here. I ordered this T-800 head a while ago and finally arrived.
Finally I will have my empty USB keys located and protected. ;)
If you use ZFS with spinning drives and you share iSCSI, you will need to use a SLOG device for ZIL otherwise you’ll see your iSCSI connections interrupted.
What is a ZIL?
ZIL: Acronym for ZFS Intended Log. Logs synchronous operations to disk
SLOG: Acronym for (S)eperate (LOG) Device
In ZFS Data is first written and stored in-memory, then it’s flushed to drives. This can take 10 seconds normally, a bit more in certain occasions.
So without SLOG it can happen that if a power loss occurs, you may loss the last 10 seconds of Data submitted.
The SLOG device brings security that if there is a power loss, after remounting the pool, the information in the SLOG, acknowledged to iSCSI clients, is not lost and flushed to the Hard drives conforming the pool. Basically this device keeps the writings that come from network and flushes to the Hard drives and then clears this data from the SLOG.
The SLOG also allows ZFS to sort how the transactions will be written, to do in a more efficient way.
Normally I’m describing configurations with a fast device for SLOG ZIL, like one or a pair of NVMe drive or SAS SSD, most commonly in mirror a pool of 12 HDD drives or more SAS preferentially, maybe SATA, with 14TB or more each.
As the SLOG device will persist your Data if there is a power off, and submit to the pool the accepted transactions, it is clear that you cannot spare yourself from having a SLOG ZIL device (or better a mirror). It is needed to bring security when remotely writing.
But what happens if we have a kind of business where we don’t care about that the last 10 seconds writings may be lost? (ZFS will never get corrupted due to its kinda journal system), just because we are filling a Server the fastest possible, migrating from another, or because we are running workouts that can be retaken is some data is lost… do we really need to have the speed constrain of an SSD?. Examples are a Hadoop node, or a SETI@Home client. Tasks will be resumed if something failed.
Or maybe you fill your servers with sync=always, so writing it’s safe, and then you use them only for read, or for a Statics Internet Caches (CDNs like Akamai, Cloudfare…) or you use it for storing Backups, write once read many. You don’t really need the constraint speed of a ZIL running at 800 MB/s.
Let me put in another way, we have 2 NIC 100Gbps, in bonding, so 200Gbps (equivalent to (25GB/s Gigabytes per second), 90 HDD drives that can work in parallel up to 250 MB/s each (22.5GB/s) and our Server has a pair or SAS SSD ZIL in mirror, that writes at 900 MB/s (Megabytes per second, so 0.9 GB/s), so our bottleneck or constraint is the SLOG ZIL.
Adding one RAMDISK, or better two RAMDISKs in mirror, we can get to much more highers speeds. I cannot tell you how much, but in my tests with regular configurations (8D+3P) I was achieving more than 2 GB (Gigabytes) per second sustained of Data to the pool. Take in count that the speed writing to the pool does not only depend on the speed on the ZIL, and the speed of the HDD spinning drives (slow, between 100 and 250 MB/s), but also about the config of the pool (number of vdevs, distributions of data and parity drives) and the throughput of your IOC (Input Output Controller), and the number of them.
Live real scenarios use to be more in the line of having 2x10GbpE cards, combined in bonding making 20Gbps, so being able to transmit 2.5GB/s. So to get the max speed of our Network this Ramdrive will do it. Also NVMe devices used as ZIL will do it.
The problem with the NVMe is that they are connected to the PCI Express bus, and so they are not hot swap. If one dies, you cannot replace without stopping the Server.
The problem with the SSD is that they are not made for writing, they will die, so you need at least a mirror and for heavy IO I strongly recommend you to go with Enterprise grade SAS SSD drives. Those are made to last.
SSD Enterprise grade are double price versus one common SSD, but that peace of mind and extra lasting is worth it. And you don’t need a very big device, only has to hold 10 seconds of Data at max speed. So if you can ingest Data through the Network at 20 Gbps (2.5GB/s) you only need approximately 25 GB of space of the SLOG. 50 GB if you want to be more than safe.
Also you can use partitions instead of complete devices for the SLOG (like for the ZFS pool, where you can add complete drives, or partitions).
If you write locally, and you have 4 IOC’s capable of delivering 8 GB/s each, and you write to a Dataset to the pool, and not to a ZVOL which are slow by nature, you can get astonishing combined speed writing to the drives. If you are migrating a Server to another new, where you can resume if power goes down, then it’s safe to disable sync (set async) while this process runs, and turn sync on when going live to production. If you use async you don’t need to use a SLOG.
4 IOC’s able to deliver 8 GB/s are enough to provide sustained speed to 90 HDD SAS drives. 90x200MB/s=18GB/s required at max speed or 90x250MB/s=22.5GB/s.
The HDD drives provide different speeds in the inner and in the outer areas of the drive, so normally those drives up to 8TB perform between 100 and 200 MB/s, and the drives from 10TB SAS to 14TB SAS perform between 145 and 250 MB/s. I cannot tell about the 16 TB as I’ve not tested them.
The instructions to set a Ramdrive and to assign to a pool are like this:
#!/usr/bin/env bash
RAM_GB=1
RAM_DRIVE_SIZE_IN_BYTES=$((RAM_GB*1048576))
if [[ $(id -u) -ne 0 ]] ; then
echo "Please run as root"
exit 1
fi
modprobe brd rd_nr=1 rd_size=${RAM_DRIVE_SIZE_IN_BYTES} max_part=0
echo "Use it like: zpool add carlespool log ram0"
If you created more than one Ramdisk you can add a mirror for the slog to the pool with:
You can partition the Ramdrive and add a partition but we want to add the whole ram device.
Obviously you cannot put other things to that Ramdisk (like the Metadata) as you need persistence for that.
In any case, please, avoid JBODs loaded of big HDD drives with low bandwidth micro SATA like 3Gbps per channel to the Server, and RAID. The bandwidth is too low. Your rebuilds will take forever.
With ZFS you’ll resilver (rebuild) only the actual data, not the whole drive.
This is a shell script I made long time ago and I use it to monitor in real time what’s the total or individual bandwidth and maximum bandwidth achieved, for READ and WRITE, of Hard drives and NMVe devices.
It uses iostat to capture the metrics, and then processes the maximum values, the combined speed of all the drives… has also an interesting feature to let out the booting device. That’s very handy for Rack Servers where you boot from an SSD card or and SD, and you want to monitor the speed of the other (SAS probably) devices.
I used it to monitor the total bandwidth achieved by our 4U60 and 4U90 Servers, the All-Flash-Arrays 2U and the NVMe 1U units in Sanmina and the real throughput of IOC (Input Output Controllers).
I used also to compare what was the real data written to ZFS and mdraid RAID systems, and to disks and the combined speed with different pool configurations, as well as the efficiency of iSCSI and NFS from clients to the Servers.
You can specify how many times the information will be printed, whether you want to keep the max speed of each device per separate, and specify a drive to exclude. Normally it will be the boot drive.
If you want to test performance metrics you should make sure that other programs are not running or using the swap, to prevent bias. You should disable the boot drive if it doesn’t form part of your tests (like in the 4U60 with an SSD boot drive in a card, and 60 hard drive bays SAS or SATA).
You may find useful tools like iotop.
You can find the code here, and in my gitlab repo:
#!/usr/bin/env bash
AUTHOR="Carles Mateo"
VERSION="1.4"
# Changelog
# 1.4
# Added support for NVMe drives
# 1.3
# Fixed Decimals in KB count that were causing errors
# 1.2
# Added new parameter to output per drive stats
# Counting is performed in KB
# Leave boot device empty if you want to add its activity to the results
# Specially thinking about booting SD card or SSD devices versus SAS drives bandwidth calculation.
# Otherwise use i.e.: s_BOOT_DEVICE="sdcv"
s_BOOT_DEVICE=""
# If this value is positive the loop will be kept n times
# If is negative ie: -1 it will loop forever
i_LOOP_TIMES=-1
# Display all drives separatedly
i_ALL_SEPARATEDLY=0
# Display in KB or MB
s_DISPLAY_UNIT="M"
# Init variables
i_READ_MAX=0
i_WRITE_MAX=0
s_READ_MAX_DATE=""
s_WRITE_MAX_DATE=""
i_IOSTAT_READ_KB=0
i_IOSTAT_WRITE_KB=0
# Internal variables
i_NUMBER_OF_DRIVES=0
s_LIST_OF_DRIVES=""
i_UNKNOWN_OPTION=0
# So if you run in screen you see colors :)
export TERM=xterm
# ANSI colors
s_COLOR_RED='\033[0;31m'
s_COLOR_BLUE='\033[0;34m'
s_COLOR_NONE='\033[0m'
for i in "$@"
do
case $i in
-b=*|--boot_device=*)
s_BOOT_DEVICE="${i#*=}"
shift # past argument=value
;;
-l=*|--loop_times=*)
i_LOOP_TIMES="${i#*=}"
shift # past argument=value
;;
-a=*|--all_separatedly=*)
i_ALL_SEPARATEDLY="${i#*=}"
shift # past argument=value
;;
*)
# unknown option
i_UNKNOWN_OPTION=1
;;
esac
done
if [[ "${i_UNKNOWN_OPTION}" -eq 1 ]]; then
echo -e "${s_COLOR_RED}Unknown option${s_COLOR_NONE}"
echo "Use: [-b|--boot_device=sda -l|--loop_times=-1 -a|--all-separatedly=1]"
exit 1
fi
if [ -z "${s_BOOT_DEVICE}" ]; then
i_NUMBER_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | wc --lines`
s_LIST_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | awk '{printf $1" ";}'`
else
echo -e "${s_COLOR_BLUE}Excluding Boot Device:${s_COLOR_NONE} ${s_BOOT_DEVICE}"
# Add an space after the name of the device to prevent something like booting with sda leaving out drives like sdaa sdab sdac...
i_NUMBER_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | grep -v "${s_BOOT_DEVICE} " | wc --lines`
s_LIST_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | grep -v "${s_BOOT_DEVICE} " | awk '{printf $1" ";}'`
fi
AR_DRIVES=(${s_LIST_OF_DRIVES})
i_COUNTER_LOOP=0
for s_DRIVE in ${AR_DRIVES};
do
AR_DRIVES_VALUES_AVG[i_COUNTER_LOOP]=0
AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]=0
AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]=0
i_COUNTER_LOOP=$((i_COUNTER_LOOP+1))
done
echo -e "${s_COLOR_BLUE}Bandwidth for drives:${s_COLOR_NONE} ${i_NUMBER_OF_DRIVES}"
echo -e "${s_COLOR_BLUE}Devices:${s_COLOR_NONE} ${s_LIST_OF_DRIVES}"
echo ""
while [ "${i_LOOP_TIMES}" -lt 0 ] || [ "${i_LOOP_TIMES}" -gt 0 ] ;
do
s_READ_PRE_COLOR=""
s_READ_POS_COLOR=""
s_WRITE_PRE_COLOR=""
s_WRITE_POS_COLOR=""
# In MB
# s_IOSTAT_OUTPUT_ALL_DRIVES=`iostat -d -m -y 1 1 | grep "sd\|nvm"`
# In KB
s_IOSTAT_OUTPUT_ALL_DRIVES=`iostat -d -y 1 1 | grep "sd\|nvm"`
if [ -z "${s_BOOT_DEVICE}" ]; then
s_IOSTAT_OUTPUT=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
else
# Add an space after the name of the device to prevent something like booting with sda leaving out drives like sdaa sdab sdac...
s_IOSTAT_OUTPUT=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | grep -v "${s_BOOT_DEVICE} " | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
fi
if [ "${i_ALL_SEPARATEDLY}" -eq 1 ]; then
i_COUNTER_LOOP=0
for s_DRIVE in ${AR_DRIVES};
do
s_IOSTAT_DRIVE=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | grep $s_DRIVE | head --lines=1 | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
i_IOSTAT_READ_KB=`printf "%s" "${s_IOSTAT_DRIVE}" | awk -F '|' '{print $1;}'`
i_IOSTAT_WRITE_KB=`printf "%s" "${s_IOSTAT_DRIVE}" | awk -F '|' '{print $2;}'`
if [ "${i_IOSTAT_READ_KB%.*}" -gt ${AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]%.*} ]; then
AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]=${i_IOSTAT_READ_KB}
echo -e "New Max Speed Reading for ${s_COLOR_BLUE}$s_DRIVE${s_COLOR_NONE} at ${s_COLOR_RED}${i_IOSTAT_READ_KB} KB/s${s_COLOR_NONE}"
echo
fi
if [ "${i_IOSTAT_WRITE_KB%.*}" -gt ${AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]%.*} ]; then
AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]=${i_IOSTAT_WRITE_KB}
echo -e "New Max Speed Writing for ${s_COLOR_BLUE}$s_DRIVE${s_COLOR_NONE} at ${s_COLOR_RED}${i_IOSTAT_WRITE_KB} KB/s${s_COLOR_NONE}"
fi
i_COUNTER_LOOP=$((i_COUNTER_LOOP+1))
done
fi
i_IOSTAT_READ_KB=`printf "%s" "${s_IOSTAT_OUTPUT}" | awk -F '|' '{print $1;}'`
i_IOSTAT_WRITE_KB=`printf "%s" "${s_IOSTAT_OUTPUT}" | awk -F '|' '{print $2;}'`
# CAST to Integer
if [ "${i_IOSTAT_READ_KB%.*}" -gt ${i_READ_MAX%.*} ]; then
i_READ_MAX=${i_IOSTAT_READ_KB%.*}
s_READ_PRE_COLOR="${s_COLOR_RED}"
s_READ_POS_COLOR="${s_COLOR_NONE}"
s_READ_MAX_DATE=`date`
i_READ_MAX_MB=$((i_READ_MAX/1024))
fi
# CAST to Integer
if [ "${i_IOSTAT_WRITE_KB%.*}" -gt ${i_WRITE_MAX%.*} ]; then
i_WRITE_MAX=${i_IOSTAT_WRITE_KB%.*}
s_WRITE_PRE_COLOR="${s_COLOR_RED}"
s_WRITE_POS_COLOR="${s_COLOR_NONE}"
s_WRITE_MAX_DATE=`date`
i_WRITE_MAX_MB=$((i_WRITE_MAX/1024))
fi
if [ "${s_DISPLAY_UNIT}" == "M" ]; then
# Get MB
i_IOSTAT_READ_UNIT=${i_IOSTAT_READ_KB%.*}
i_IOSTAT_WRITE_UNIT=${i_IOSTAT_WRITE_KB%.*}
i_IOSTAT_READ_UNIT=$((i_IOSTAT_READ_UNIT/1024))
i_IOSTAT_WRITE_UNIT=$((i_IOSTAT_WRITE_UNIT/1024))
fi
# When a MAX is detected it will be displayed in RED
echo -e "READ ${s_READ_PRE_COLOR}${i_IOSTAT_READ_UNIT} MB/s ${s_READ_POS_COLOR} (${i_IOSTAT_READ_KB} KB/s) Max: ${i_READ_MAX_MB} MB/s (${i_READ_MAX} KB/s) (${s_READ_MAX_DATE})"
echo -e "WRITE ${s_WRITE_PRE_COLOR}${i_IOSTAT_WRITE_UNIT} MB/s ${s_WRITE_POS_COLOR} (${i_IOSTAT_WRITE_KB} KB/s) Max: ${i_WRITE_MAX_MB} MB/s (${i_WRITE_MAX} KB/s) (${s_WRITE_MAX_DATE})"
if [ "$i_LOOP_TIMES" -gt 0 ]; then
i_LOOP_TIMES=$((i_LOOP_TIMES-1))
fi
done
This article is more an exercise, like a game, so you get to know certain things about Linux, and follow my mental process to uncover this. Is nothing mysterious for the Senior Engineers but Junior Sys Admins may enjoy this reading. :)
Ok, so the first thing is I wrote an script in order to completely backup my NVMe hard drive to a gziped file and then I will use this, as a motivation to go deep into investigations to understand.
So basically, we are going to restart the computer, boot with Linux Live USB Key, mount the Seagate Hard Drive, and run the script.
We are booting with a Live Linux Cd in order to have our partition unmounted and unmodified while we do the backup. This is in order to avoid corruption or data loss as a live Filesystem is getting modifications as we read it.
The problem with this first script is that it will generate a big gzip file.
By big I mean much more bigger than 2GB. Not all physical supports support files bigger than 2GB or 4GB, but even if they do, it’s a pain to transfer this over the Network, or in USB files, so we are going to do a slight modification.
Then one may ask himself, wait, if pipes use STDOUT and STDIN and dd is displaying into the screen, then will our gz file get corrupted?.
I like when people question things, and investigate, so let’s answer this question.
If it was a young member of my Team I would ask:
Ok, try,it. Check the output file to see if is corrupted.
So they can do zcat or zless to inspect the file, see if it has errors, and to make sure:
gzip -v -t nvme.img.gz
nvme.img.gz: OK
Ok, so what happened?, because we were seeing output in the screen.
Assuming the young Engineer does not know the answer I would had told:
Ok, so you know that if dd would print to STDOUT, then you won’t see it, cause it would be sent to the pipe, so there is something more you’re missing. Let’s check the source code of dd to see what status=progress does
And then look for “progress”.
Soon you’ll find things like everywhere:
if (progress_time)
fputc ('\r', stderr);
Ok, pay attention to where is the data written: stderr. So basically the answer is: dd status=progress does not corrupt STDOUT and prints into the screen because it uses STDERR.
Other funny ways to get the progress would be to use:
So you would see in real time what was the advance and finally 512GB where compressed to around 336GB in 336 files of 1 GB each (except the last one)
Another funny way would had been sending USR1 signal to the dd process:
Hope you enjoyed this little exercise about the importance of going deep, to the end, to understand what’s going on on the system. :)
Instead of gzip you can use bzip2 or pixz. pixz is very handy if you want to just compress a file, as it uses multiple processors in parallel for the tasks.
xz or lrzip are other compressors. lrzip aims to compress very large files, specially source code.
You do df -h or ls / and the terminal freezes and not even CTRL + C works, you have a lock.
Normally this is due to a lock of the system trying to perform an IO.
Could be a physical spinning disk failing, but the most probably nowadays is that you have a network mount point and it is timing out.
If you execute mount and you get a timeout, and when you finally see the list you see a NFS, iSCSI or another kind of Network mount (you will see an Ip Address), check for errors.
To do this in CentOS/RHEL you can do as root:
dmesg | grep -i "timed"
or depending on the System
cat /var/log/messages | grep -i "timed"
You’ll get something like this:
[root@compute01 carles]# dmesg -T | grep timed | head -n5
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:45 2020] nfs: server storage07 not responding, timed out
Please note I use dmesg -T in order to have human readable date instead of Unix Epoch.
My Team in The States report an issue with a Red Hat iSCSI Initiator having issues connecting to a Volume exported by a ZFS Server.
There is an issue on GitLab.
As I always do when I troubleshot a problem, I create a forensics post-mortem document recording everything I do, so later, others can learn how I fix it, or they can learn the steps I did in order to troubleshoot.
Please note: Some Ip addresses have been manually edited.
2019-08-09 10:20:10 Start of the investigation
I log into the Server, with Ip Address: xxx.yyy.16.30. Is an All-Flash-Array Server with RHEL6.10 and DRAID v.08091350.
Htop shows normal/low activity.
I check the addresses in the iSCSI Initiator (client), to make sure it is connecting to the right Server.
[root@Host-164 ~]# ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:25:90:c5:1e:ea brd ff:ff:ff:ff:ff:ff
inet xxx.yyy.13.164/16 brd xxx.yyy.255.255 scope global eno1
valid_lft forever preferred_lft forever
inet6 fe80::225:90ff:fec5:1eea/64 scope link
valid_lft forever preferred_lft forever
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
link/ether 00:25:90:c5:1e:eb brd ff:ff:ff:ff:ff:ff
4: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 24:8a:07:a4:94:9c brd ff:ff:ff:ff:ff:ff
inet 192.168.100.164/24 brd 192.168.100.255 scope global enp3s0f0
valid_lft forever preferred_lft forever
inet6 fe80::268a:7ff:fea4:949c/64 scope link
valid_lft forever preferred_lft forever
5: enp3s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 24:8a:07:a4:94:9d brd ff:ff:ff:ff:ff:ff
inet 192.168.200.164/24 brd 192.168.200.255 scope global enp3s0f1
valid_lft forever preferred_lft forever inet6 fe80::268a:7ff:fea4:949d/64 scope link
valid_lft forever preferred_lft forever
I see the luns on the host, connecting to the 10Gbps of the Server:
/dev/sdg1 on /mnt/large type ext4 (ro,relatime,seclabel,data=ordered)
Lsblk shows that /dev/sdg is not present:
[root@Host-164 ~]# lsblk
NAME
MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 119.2G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 118.8G 0 part
├─rhel-swap 253:0 0 11.9G 0 lvm [SWAP]
└─rhel-root 253:1 0 106.8G 0 lvm /
sdb 8:16 0 100G 0 disk
sdc 8:32 0 100G 0 disk
sdd 8:48 0 100G 0 disk
sde 8:64 0 100G 0 disk
sdf 8:80 0 100G 0 disk
And as expected:
[root@Host-164 ~]# ls -al /mnt/large
ls: reading directory /mnt/large: Input/output error
total 0
I see that the Volumes appear to not having being partitioned:
[root@Host-164 ~]# fdisk /dev/sdf
Welcome to fdisk (util-linux 2.23.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Device does not contain a recognized partition table
Building a new DOS disklabel with disk identifier 0xddf99f40.
Command (m for help): p
Disk /dev/sdf: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xddf99f40
Device Boot
Start
End
Blocks Id System
Command (m for help): q
I create a partition and format with ext2
[root@Host-164 ~]# mke2fs /dev/sdb1
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
6553600 inodes, 26214144 blocks
1310707 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
800 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
I mount:
[root@Host-164 ~]# mount /dev/sdb1 /mnt/vol1
I fill the volume from the client, and it works. I check the activity in the Server with iostat and there are more MB/s written to the Server’s drives than actually speed copying in the client.
I completely fill 100GB but speed is slow. We are working on a 10Gbps Network so I expected more speed.
I check the connections to the Server:
[root@obs4602-1810 ~]# netstat | grep -v "unix"
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55300 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55298 ESTABLISHED
tcp 0 0 xxx.yyy.18.10:ssh xxx.yyy.12.154:57137 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55304 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55301 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55306 ESTABLISHED
tcp 0 0 xxx.yyy.18.10:ssh xxx.yyy.12.154:56395 ESTABLISHED
tcp 0 0 xxx.yyy.18.10:ssh xxx.yyy.14.52:57330 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55296 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55305 ESTABLISHED
tcp 0 0 xxx.yyy.18.10:ssh xxx.yyy.12.154:57133 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55303 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55299 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.176:57542 ESTABLISHED
tcp 0 0 192.168.10.10:iscsi-target 192.168.10.180:55302 ESTABLISHED
I see many connections from Host 180, I check that and another member of the Team is using that client to test with vdbench against the Server.
This explains the slower speed I was getting.
Conclusions
There was a local problem on the Host. The problems with the disconnection seem to be related to a connection that was lost (sdg). All that information was written to iSCSI buffer, not to the Server. In fact, that volume was mapped in the system with another letter, sdg was not in use.
Speed was slow due to another client pushing Data to the Server too
Windows clients with auto reconnect option are not reporting timeout reports while in Red Hat clients iSCSI connection timeouts. It should be increased
2020-03-10 22:16 IST TIP: At that time we were using Google suite and Skype to communicate internally with the different members across the world. If we had used a tool like Slack, and we had a channel like #engineering for example or #sanjoselab, then I could have paged and asked “Is somebody using obs4602-1810?“
1- Make sure the zfs file exists under zfs/contrib/initramfs/scripts/local-top/
if not exists, create a file called zfs under zfs/contrib/initramfs/scripts/local-top/ and add the following to that file:
#!/bin/sh PREREQ=”mdadm mdrun multipath”
prereqs() { echo “$PREREQ” }
case $1 in # get pre-requisites prereqs) prereqs exit 0 ;; esac
# # Helper functions # message() { if [ -x /bin/plymouth ] && plymouth –ping; then plymouth message –text=”$@” else echo “$@” >&2 fi return 0 }
udev_settle() { # Wait for udev to be ready, see https://launchpad.net/bugs/85640 if [ -x /sbin/udevadm ]; then /sbin/udevadm settle –timeout=30 elif [ -x /sbin/udevsettle ]; then /sbin/udevsettle –timeout=30 fi return 0 }
activate_vg() { # Sanity checks if [ ! -x /sbin/lvm ]; then [ “$quiet” != “y” ] && message “lvm is not available” return 1 fi
# Detect and activate available volume groups /sbin/lvm vgscan /sbin/lvm vgchange -a y –sysinit return $? }
I came with this solution when one of my 4U60 Servers had two slots broken. You’ll not use this in Production, as SLOG loses its function, but I managed to use one $40K USD broken Server and to demonstrate that the Speed of the SLOG device (ZFS Intented Log or ZIL device) sets the constraints for the writing speed.
The ZFS DRAID config I was using required 60 drives, basically 58 14TB Spinning drives and 2 SSD for the SLOG ZIL. As I only had 58 slots I came with this idea.
This trick can be very useful if you have a box full of Spinning drives, and when sharing by iSCSI zvols you get disconnected in the iSCSI Initiator side. This is typical when ZFS has only Spinning drives and it has no SLOG drives (dedicated fast devices for the ZIL, ZFS INTENDED LOG)
Create a single Ramdrive of 10GB of RAM:
modprobe brd rd_nr=1 rd_size=10485760 max_part=0
Confirm ram0 device exists now:
ls /dev/ram*
Confirm that the pool is imported:
zpool list
Add to the pool:
zpool add carles-N58-C3-D16-P2-S4 log ram0
In the case that you want to have two ram devices as SLOG devices, in mirror.
It is interesting to know that you can work with partitions instead of drives. So for this test we could have partitioned ram0 with 2 partitions and make it work in mirror. You’ll see how much faster the iSCSI communication goes over the network. The writing speed of the ZIL SLOG device is the constrain for ingesting Data from the Network to the Server.
Creating a partition bigger than 2TiB
Master Boot Record (MBR) based partitioning is limited to 2TiB however GUID Partition Table (GPT) has a limit of 8 ZiB.
That’s something very simply, but make you lose time if you’re partitioning big iSCSI Shares, or ZFS Zvols, so here is the trick:
[root@CTRLA-18 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)
[root@CTRLA-18 ~]# parted /dev/zvol/N58-C19-D2-P1-S1/vol54854gb
GNU Parted 3.1
Using /dev/zd0
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt
Warning: The existing disk label on /dev/zd0 will be destroyed and all data on this disk will be lost. Do you want to continue?
Yes/No? y
(parted) print
Model: Unknown (unknown)
Disk /dev/zd0: 58.9TB
Sector size (logical/physical): 512B/65536B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
(parted) mkpart primary 0GB 58.9TB
(parted) print
Model: Unknown (unknown)
Disk /dev/zd0: 58.9TB
Sector size (logical/physical): 512B/65536B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 58.9TB 58.9TB primary
(parted) quit
Information: You may need to update /etc/fstab.
[root@CTRLA-18 ~]# mkfs
mkfs mkfs.btrfs mkfs.cramfs mkfs.ext2 mkfs.ext3 mkfs.ext4 mkfs.minix mkfs.xfs
[root@CTRLA-18 ~]# mkfs.ext4 /dev/zvol/N58-C19-D2-P1-S1/vol54854gb
mke2fs 1.42.9 (28-Dec-2013)
....
[root@CTRLA-18 ~]# mount /dev/zvol/N58-C19-D2-P1-S1/vol54854gb /Data
[root@CTRLA-18 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rhel-root 50G 2.5G 48G 5% /
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 126G 1.1G 125G 1% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/sdp1 1014M 151M 864M 15% /boot
/dev/mapper/rhel-home 65G 33M 65G 1% /home
logs 49G 349M 48G 1% /logs
mysql 9.7G 128K 9.7G 1% /mysql
tmpfs 26G 0 26G 0% /run/user/0
/dev/zd0 54T 20K 51T 1% /Data
ZFS is unable to use a disk
Some times, after creating many pools ZFS may be unable to create a new pool using a drive that is perfectly fine. In this situation, the ideal is wipe the first areas of it, or all of it if you want. If it’s an SSD that is very fast:
dd if=/dev/zero of=/dev/sdc bs=1M status=progress
The status=progress will show a nice progress bar.
Filling a half Petabyte pool as fast as possible
To fill a 60 drives pool composed by 10TB or 14TB spinning drives, so more than half PB, in order to test with real data, you can use this trick:
First, write to the Dataset directly, that’s way much more faster than using zvols.
Secondly, disable the ZIL, set sync=disabled.
Third, use a file in memory to avoid the paytime of reading the file from disk.
Fourth, increase the recordsize to 1M for faster filling (in my experience).
You can use this script of mine that does everything for you, normally you would like to run it inside an screen session, and create a Dataset called Data. The script will mount it in /Data (zfs set mountpoint=/data YOURPOOL/Data):
#!/usr/bin/env bash
# Created by Carles Mateo
FILE_ORIGINAL="/run/urandom.1GB"
FILE_PATTERN="/Data/urandom.1GB-clone."
# POOL="N56-C5-D8-P3-S1"
POOL="N58-C3-D16-P3-S1"
# The starting number, if you interrupt the filling process, you can update it just by updating this number to match the last partially written file
i_COPYING_INITIAL_NUMBER=1
# For 75% of 10TB (3x(16+3)+1 has 421TiB, so 75% of 421TiB or 431,104GiB is 323,328) use 323328
# i_COPYING_FINAL_NUMBER=323328
# For 75% of 10TB, 5x(8+3)+1 ZFS sees 352TiB, so 75% use 270336
# For 75% of 14TB, 3x(16+3)+1, use 453120
i_COPYING_FINAL_NUMBER=453120
# Creating an array that will hold the speed of the latest 1 minute
a_i_LATEST_SPEEDS=(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
i_POINTER_SPEEDS=0
i_COUNTER_SPEEDS=-1
i_ITEMS_KEPT_SPEEDS=60
i_AVG_SPEED=0
i_FILES_TO_BE_COPIED=$((i_COPYING_FINAL_NUMBER-i_COPYING_INITIAL_NUMBER))
get_average_speed () {
# Calculates the Average Speed
i_AVG_SPEED=0
for i_index in {0..59..1}
do
i_SPEED=$((a_i_LATEST_SPEEDS[i_index]))
i_AVG_SPEED=$((i_AVG_SPEED + i_SPEED))
done
i_AVG_SPEED=$((i_AVG_SPEED/((i_COUNTER_SPEEDS)+1)))
}
echo "Bash version ${BASH_VERSION}..."
echo "Disabling sync in the pool $POOL for faster speed"
zfs set sync=disabled $POOL
echo "Maximizing performance with recordsize"
zfs set recordsize=1M ${POOL}
zfs set recordsize=1M ${POOL}/Data
echo "Mounting the Dataset Data"
zfs set mountpoint=/Data ${POOL}/Data
zfs mount ${POOL}/Data
echo "Checking if file ${FILE_ORIGINAL} exists..."
if [[ -f ${FILE_ORIGINAL} ]]; then
ls -al ${FILE_ORIGINAL}
sha1sum ${FILE_ORIGINAL}
else
echo "Generating file..."
dd if=/dev/urandom of=${FILE_ORIGINAL} bs=1M count=1024 status=progress
fi
echo "Starting filling process..."
echo "We are going to copy ${i_FILES_TO_BE_COPIED} , starting from: ${i_COPYING_INITIAL_NUMBER} to: ${i_COPYING_FINAL_NUMBER}"
for ((i_NUMBER=${i_COPYING_INITIAL_NUMBER}; i_NUMBER<=${i_COPYING_FINAL_NUMBER}; i_NUMBER++));
do
s_datetime_ini=$(($(date +%s%N)/1000000))
DATE_NOW=`date '+%Y-%m-%d_%H-%M-%S'`
echo "${DATE_NOW} Copying ${FILE_ORIGINAL} to ${FILE_PATTERN}${i_NUMBER}"
cp ${FILE_ORIGINAL} ${FILE_PATTERN}${i_NUMBER}
s_datetime_end=$(($(date +%s%N)/1000000))
MILLISECONDS=$(expr "$s_datetime_end" - "$s_datetime_ini")
if [[ ${MILLISECONDS} -lt 1 ]]; then
BANDWIDTH_MBS="Unknown (too fast)"
# That sould not happen, but if did, we don't account crazy speeds
else
BANDWIDTH_MBS=$((1000*1024/MILLISECONDS))
# Make sure the Array space has been allocated
if [[ ${i_POINTER_SPEEDS} -gt ${i_COUNTER_SPEEDS} ]]; then
# Add item to the Array the first times only
a_i_LATEST_SPEEDS[i_POINTER_SPEEDS]=${BANDWIDTH_MBS}
i_COUNTER_SPEEDS=$((i_COUNTER_SPEEDS+1))
else
a_i_LATEST_SPEEDS[i_POINTER_SPEEDS]=${BANDWIDTH_MBS}
fi
i_POINTER_SPEEDS=$((i_POINTER_SPEEDS+1))
if [[ ${i_POINTER_SPEEDS} -ge ${i_ITEMS_KEPT_SPEEDS} ]]; then
i_POINTER_SPEEDS=0
fi
get_average_speed
fi
i_FILES_TO_BE_COPIED=$((i_FILES_TO_BE_COPIED-1))
i_REMAINING_TIME=$((1024*i_FILES_TO_BE_COPIED/i_AVG_SPEED))
i_REMAINING_HOURS=$((i_REMAINING_TIME/3600))
echo "File cloned in ${MILLISECONDS} milliseconds at ${BANDWIDTH_MBS} MB/s"
echo "Avg. Speed: ${i_AVG_SPEED} MB/s Remaining Files: ${i_FILES_TO_BE_COPIED} Remaining seconds: ${i_REMAINING_TIME} s. (${i_REMAINING_HOURS} h.)"
done
echo "Enabling sync=always"
zfs set sync=always ${POOL}
echo "Setting back recordsize to 128K"
zfs set recordsize=128K ${POOL}
zfs set recordsize=128K ${POOL}/Data
echo "Unmounting /Data"
zfs set mountpoint=none ${POOL}/Data
Creating a Sparse file that you can partition or create a loopback on it
I know, your laptop has 512GB of M.2 SSD or NVMe, so that’s it.
Well, you can create a sparse file much more bigger than your capacity, and use 0 bytes of it at all.
For example:
truncate -s 1600GB file_disk0.img
If the files are stored in / then you can add a loop device:
sudo losetup -f /file_disk0.img
I do with the 5 I created.
Then you can check that they exist with:
lsblk
or
cat /proc/partitions
The loop devices will appear under /dev/ now.
For some tests I did this in a Virtual Box Virtual Machine:
In coding theory, an erasure code is a forward error correction (FEC) code under the assumption of bit erasures (rather than bit errors), which transforms a message of k symbols into a longer message (code word) with n symbols such that the original message can be recovered from a subset of the n symbols. The fraction r = k/n is called the code rate. The fraction k’/k, where k’ denotes the number of symbols required for recovery, is called reception efficiency.
So Raid systems applied to drives are Erasure Code too.
But I want to talk about Erasure Code for the needs of organizations like Instagram, that need to store huge amount of files and they cannot afford to lose the data simply because several drives, or all the Server, fails.
So what is the way to make this sure if you have thousands of Servers?.
Many Start ups that require to host files, cannot afford to have every file duplicated or triplicated in other systems.
So how to do this in a cheap an efficient way?.
Here is where Erasure Coding comes to play.
Erasure Coding work so simply as:
Given a given file, for example, 1 video of 10 MB
We apply the Erasure Coding to encode the file
We select, for example, to generate 3 additional chunks
So our original 10MB file fill be split in 13 blocks (13 new files), each block will have approx. 1MB
We can rebuild the original file by combining any 10 of those 13 files
That means that we can afford to loss 3 blocks (1MB files) and we will still be able to reconstruct the original file.
Examples:
Ok, so now imagine we have 13 identical Servers, and we encode all our files, using Erasure Coding. Imagine that we store each block in a different Server. That means that we can lose 3 Servers and still have all our information intact.
Imagine we have 100 Servers, and we split all those files to the Servers that have more free space available. We could lose 3 Serversand still not having lose any information. If we are really lucky (or the SDS – Software Defined Storage is very clever) we could lose more than 3 Servers.
Now imagine we have 100 Racks full of Servers. Our SDS selects the Rack that has more free space and places one of the blocks in there, and the same for the other 12 blocks. We could afford to lose 3 racks without losing any Data. That’s more manageable for Google or Yahoo than managing at Server Level.
We can use Erasure coding with different configs like 8+3, or 10+4… The sample I choose 10+3 is easy to understand, as we clearly see that will occupy only 30% of additional space.
Those blocks can conveniently be stored in different Servers, across different regions too, for example, using a config of 9+3 you can have 4 different Cloud Providers in different geographic regions, and each holding 25% of the required files, so 3 files each. Then, you only require 3 Cloud providers to rebuild the original file (you only precise 9 surviving blocks, not all 12). Possibilities are infinite.
When one Rack is down, you can rebalance all the blocks that were there to another rack.
Also you can have different Servers, with different capacity… your SDS should be clever enough to accommodate the blocks for protection and space efficiency. To checksum them to ensure no corruption in the block as was stored or transported over the network. Your SDS Software should be clever enough to be able to add new nodes and Racks, and to substract nodes, to Rebalance, to checksum the blocks in the Servers… and to store the information effectively on the local Servers (not many files per folder…), to use Commodity Hardware with low memory, or even VM’s… if your System is good enough it will even put to sleep, to save energy, the Servers that are not in use (typically the Servers that are full), until required.
Also, when in need to recover a file, the clever SDS Software, using multithread, will ask to the 9 locations at the same time, in parallel, so using all the available bandwidth, in order to fetch the blocks and rebuild the original file really quick. This can also be implemented with no single point of failure, will all the nodes being able to be the headnode.
That’s exactly what my Erasure Coding solution did.
I invented a lot of technologies to scale out since I created my messenger in 1996.
You can do it yourself, or use existing Erasure Coding solutions. The most known is OpenStack Swift, although in my opinion is a pain to configure and to maintain.
This is the history it happen to me some time ago, and so the commands I used to troubleshot. The purpose is to share knowledge in a interactive way. There are some hidden gems that you’ll acquire if you have the patience to go over all the document and read it all…
I had qualified Intel Xeon single processor platform to run my DRAID (ZFS Declustered RAID) project for my employer.
The platforms I qualified were:
1) single processor for Cold Storage (SAS Spinning drives): 4U60, newest models 4602
2) for multiprocessor: the 4U90 (90 Spinning drives) and Flash: All-Flash-Arrays.
The amounts of RAM I was using for my tests range for 64GB to 384GB.
Somebody in the company, at executive level, assembled an experimental config that was totally new for us and wanted to try by their own. It was the 4602 with multiprocessor and 32GB of RAM.
When they were unable to make it work at the expected speed, they required me to troubleshot and to make it work.
The 4602 single processor had two IOC (Input Output Controller, LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) ), while the 4602 double processor had four IOC, so given that each of those IOC can perform at peaks of 6GB/s, with a maximum total of 24 GB/s, the performance when reading/writing from all the drives should be better.
But this Server was returning double times for Rebuilding, respect the single processor version, which didn’t make any sense.
I had to check everything. There was the commands I ran:
Check the upgrade of the CPU:
htop
lscpu
Changing the Zoning.
Those Servers use SAS drives dual ported, which means that two different computers can be connected to the same drive and operate at the same time. Is up to you to make sure you don’t introduce corruption. Those systems are used mainly for HA (High Availability).
Those Systems allow to be configured in different zoning modes. That’s the way on how each of the two servers (Controllers) see the disk. In one zoning each Controller sees only 30 drives, in another each IOC sees all the drives (for redundancy but performance constrained to 1 IOC Speed).
The config I set is each IOC will see 15 drives, so each one of the 4 IOC will have 6GB/s for 15 drives. Given that these spinning drives perform in the outtermost part of the cylinder at 265MB/s, that means that at maximum speed one IOC will be using 3.97 GB/s, will say 4GB/s. Plenty of bandwidth.
Note: Spinning drives have different performance depending on how close you’re to the cylinder. In the innermost part it goes under 145 MB/s, and if you read all of those drive sequentially with dd it will return an average speed of 145 MB/s.
With this command you can sive live how it performs and the average read speed in real time. Use skip to jump to that position (relative to bs) in the drive, so you can test directly the speed at the innermost close to the cylinder part of t.
dd if=/dev/sda of=/dev/null bs=1M status=progress
I saw that the zoning was not right one, so I set it correctly:
The sleeps after rebooting the expanders are recommended. Rebooting the Operating System too, to avoid problems with some Software as the expanders changed live.
If you have ZFS pools or workloads stop them and export the pool before messing with the expanders.
In order to check to which drives is connected each IOC:
I do this for all the drives at the same time and with iostat:
iostat -y 1 1
I check the status of the memory with:
slabtop
free
htop
I checked the memory and htop during a Rebuild. Memory was more than enough. However CPU usage was higher than expected.
The red bars in the image correspond to kernel processes, in this case is the DRAID Rebuild. I see that the load is higher than the usual with a single processor.
I capture all the parameters from ZFS with:
zfs get all
All this information is logged into my forensics document, so later can be checked by my Team or I can share with other Architects or other members of the company. I started this methodology after I knew how Google do their SRE forensics / postmortem documents. Also for myself is useful for the future to have a log of the commands I executed and a verbose output of the results.
I install the smp_utils
yum install smp_utils
Check things:
ls -al /dev/bsg/
total 0drwxr-xr-x. 2 root root 3020 May 22 10:16 .
drwxr-xr-x. 20 root root 8680 May 22 10:16 ..
crw-------. 1 root root 248, 76 May 22 10:00 1:0:0:0
crw-------. 1 root root 248, 126 May 22 10:00 10:0:0:0
crw-------. 1 root root 248, 127 May 22 10:00 10:0:1:0
crw-------. 1 root root 248, 136 May 22 10:00 10:0:10:0
crw-------. 1 root root 248, 137 May 22 10:00 10:0:11:0
crw-------. 1 root root 248, 138 May 22 10:00 10:0:12:0
crw-------. 1 root root 248, 139 May 22 10:00 10:0:13:0
[...]
There are some errors, and I check with the Hardware Team, which pass a battery of tests on the machine and say that the machine passes. They tell me that if the errors counted were in order of millions then it would be a problem, but having few of them is usual.
My colleagues previously reported that the memory was performing well, and the CPU too. They told me that the speed was exactly double respect a platform with one single CPU of the same kind.
Even if they told me that, I ran cmips tests to make sure.
git clone https://github.com/cmips/cmips_bin
It scored 16,000. The performance was Ok in general terms but the problem is that I didn’t have a baseline for that processor in single processor, so I cannot make sure that the memory bandwidth was Ok. The performance was less that an Amazon c3.8xlarge. The system I was testing is a two processor system, but each CPU is cheap, around USD $400.
Still my gut feeling was telling me that this double processor server should score more.
lscpu
[root@DRAID-1135-14TB-2CPU ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 2299.951
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4199.73
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts spec_ctrl intel_stibp
I check the memory configuration with:
dmidecode -t memory
I examined the results, I see that the processor can only operate the DDR4 ECC 2400 Memory at 2133 and… I see something!. This Controller before was a single processor with 2 Memory Sticks of 16GB each, dual rank.
I see that now I have the same number of sticks in that machine, but I have two CPU!. So 2 Memory sticks in total, for 2 CPU.
That’s no good. The memory must be in pairs and in the right slots to get the maximum performance.
1 memory module for 1 CPU doesn’t allow to have Dual Channel and probably is affecting the performance. Many Servers will not even boot if you add an odd number of memory sticks per CPU.
And many Servers can operate at full speed only if all the banks are filled.
I request to the Engineers in Silicon Valley to add 4 modules in the right slots. They did, and I repeated the tests and the performance was doubled then.
After some days I had some time with the machine, I repeated the test and I got a CMIPS Score of around 20,000.
Multiprocessor world is far more complicated than single processor. Some times things can work not as expected, and not be evident, for example cache pipeline can act diferent for a program working in multiprocessor and single processor. Or the QPI could be saturated.
After this I shared my forensics document with as many Engineers as I could, so they could learn how I did to troubleshot the problem, and what was the origin of it, and I asked them to do the same so we can track their steps and progress if something needs to be troubleshoot.
After proper intensive testing the Server was qualified. Lesson here is that changes cannot be commited quickly, need their time.