Category Archives: NFS

Adding a RAMDISK as SLOG ZIL to ZFS

If you use ZFS with spinning drives and you share iSCSI, you will need to use a SLOG device for ZIL otherwise you’ll see your iSCSI connections interrupted.

What is a ZIL?

  • ZIL: Acronym for ZFS Intended Log. Logs synchronous operations to disk
  • SLOG: Acronym for (S)eperate (LOG) Device

In ZFS Data is first written and stored in-memory, then it’s flushed to drives. This can take 10 seconds normally, a bit more in certain occasions.

So without SLOG it can happen that if a power loss occurs, you may loss the last 10 seconds of Data submitted.

The SLOG device brings security that if there is a power loss, after remounting the pool, the information in the SLOG, acknowledged to iSCSI clients, is not lost and flushed to the Hard drives conforming the pool. Basically this device keeps the writings that come from network and flushes to the Hard drives and then clears this data from the SLOG.

The SLOG also allows ZFS to sort how the transactions will be written, to do in a more efficient way.

Normally I’m describing configurations with a fast device for SLOG ZIL, like one or a pair of NVMe drive or SAS SSD, most commonly in mirror a pool of 12 HDD drives or more SAS preferentially, maybe SATA, with 14TB or more each.

As the SLOG device will persist your Data if there is a power off, and submit to the pool the accepted transactions, it is clear that you cannot spare yourself from having a SLOG ZIL device (or better a mirror). It is needed to bring security when remotely writing.

But what happens if we have a kind of business where we don’t care about that the last 10 seconds writings may be lost? (ZFS will never get corrupted due to its kinda journal system), just because we are filling a Server the fastest possible, migrating from another, or because we are running workouts that can be retaken is some data is lost… do we really need to have the speed constrain of an SSD?. Examples are a Hadoop node, or a SETI@Home client. Tasks will be resumed if something failed.

Or maybe you fill your servers with sync=always, so writing it’s safe, and then you use them only for read, or for a Statics Internet Caches (CDNs like Akamai, Cloudfare…) or you use it for storing Backups, write once read many. You don’t really need the constraint speed of a ZIL running at 800 MB/s.

Let me put in another way, we have 2 NIC 100Gbps, in bonding, so 200Gbps (equivalent to (25GB/s Gigabytes per second), 90 HDD drives that can work in parallel up to 250 MB/s each (22.5GB/s) and our Server has a pair or SAS SSD ZIL in mirror, that writes at 900 MB/s (Megabytes per second, so 0.9 GB/s), so our bottleneck or constraint is the SLOG ZIL.

Adding one RAMDISK, or better two RAMDISKs in mirror, we can get to much more highers speeds. I cannot tell you how much, but in my tests with regular configurations (8D+3P) I was achieving more than 2 GB (Gigabytes) per second sustained of Data to the pool. Take in count that the speed writing to the pool does not only depend on the speed on the ZIL, and the speed of the HDD spinning drives (slow, between 100 and 250 MB/s), but also about the config of the pool (number of vdevs, distributions of data and parity drives) and the throughput of your IOC (Input Output Controller), and the number of them.

Live real scenarios use to be more in the line of having 2x10GbpE cards, combined in bonding making 20Gbps, so being able to transmit 2.5GB/s. So to get the max speed of our Network this Ramdrive will do it. Also NVMe devices used as ZIL will do it.

The problem with the NVMe is that they are connected to the PCI Express bus, and so they are not hot swap. If one dies, you cannot replace without stopping the Server.

The problem with the SSD is that they are not made for writing, they will die, so you need at least a mirror and for heavy IO I strongly recommend you to go with Enterprise grade SAS SSD drives. Those are made to last.

SSD Enterprise grade are double price versus one common SSD, but that peace of mind and extra lasting is worth it. And you don’t need a very big device, only has to hold 10 seconds of Data at max speed. So if you can ingest Data through the Network at 20 Gbps (2.5GB/s) you only need approximately 25 GB of space of the SLOG. 50 GB if you want to be more than safe.

Also you can use partitions instead of complete devices for the SLOG (like for the ZFS pool, where you can add complete drives, or partitions).

If you write locally, and you have 4 IOC’s capable of delivering 8 GB/s each, and you write to a Dataset to the pool, and not to a ZVOL which are slow by nature, you can get astonishing combined speed writing to the drives. If you are migrating a Server to another new, where you can resume if power goes down, then it’s safe to disable sync (set async) while this process runs, and turn sync on when going live to production. If you use async you don’t need to use a SLOG.

4 IOC’s able to deliver 8 GB/s are enough to provide sustained speed to 90 HDD SAS drives. 90x200MB/s=18GB/s required at max speed or 90x250MB/s=22.5GB/s.

The HDD drives provide different speeds in the inner and in the outer areas of the drive, so normally those drives up to 8TB perform between 100 and 200 MB/s, and the drives from 10TB SAS to 14TB SAS perform between 145 and 250 MB/s. I cannot tell about the 16 TB as I’ve not tested them.

The instructions to set a Ramdrive and to assign to a pool are like this:

#!/usr/bin/env bash
RAM_GB=1
RAM_DRIVE_SIZE_IN_BYTES=$((RAM_GB*1048576))

if [[ $(id -u) -ne 0 ]] ; then
    echo "Please run as root"
    exit 1
fi

modprobe brd rd_nr=1 rd_size=${RAM_DRIVE_SIZE_IN_BYTES} max_part=0

echo "Use it like: zpool add carlespool log ram0"

If you created more than one Ramdisk you can add a mirror for the slog to the pool with:

zpool add carlespool log mirror /dev/ram0 /dev/ram1

You can partition the Ramdrive and add a partition but we want to add the whole ram device.

Obviously you cannot put other things to that Ramdisk (like the Metadata) as you need persistence for that.

In any case, please, avoid JBODs loaded of big HDD drives with low bandwidth micro SATA like 3Gbps per channel to the Server, and RAID. The bandwidth is too low. Your rebuilds will take forever.

With ZFS you’ll resilver (rebuild) only the actual data, not the whole drive.

iostat_bandwitdth.sh – Utility to calculate the bandwidth used by all your drives

This is a shell script I made long time ago and I use it to monitor in real time what’s the total or individual bandwidth and maximum bandwidth achieved, for READ and WRITE, of Hard drives and NMVe devices.

It uses iostat to capture the metrics, and then processes the maximum values, the combined speed of all the drives… has also an interesting feature to let out the booting device. That’s very handy for Rack Servers where you boot from an SSD card or and SD, and you want to monitor the speed of the other (SAS probably) devices.

I used it to monitor the total bandwidth achieved by our 4U60 and 4U90 Servers, the All-Flash-Arrays 2U and the NVMe 1U units in Sanmina and the real throughput of IOC (Input Output Controllers).

I used also to compare what was the real data written to ZFS and mdraid RAID systems, and to disks and the combined speed with different pool configurations, as well as the efficiency of iSCSI and NFS from clients to the Servers.

You can specify how many times the information will be printed, whether you want to keep the max speed of each device per separate, and specify a drive to exclude. Normally it will be the boot drive.

If you want to test performance metrics you should make sure that other programs are not running or using the swap, to prevent bias. You should disable the boot drive if it doesn’t form part of your tests (like in the 4U60 with an SSD boot drive in a card, and 60 hard drive bays SAS or SATA).

You may find useful tools like iotop.

You can find the code here, and in my gitlab repo:

https://gitlab.com/carles.mateo/blog.carlesmateo.com-source-code/-/blob/master/iostat_bandwidth.sh

#!/usr/bin/env bash

AUTHOR="Carles Mateo"
VERSION="1.4"

# Changelog
# 1.4
# Added support for NVMe drives
# 1.3
# Fixed Decimals in KB count that were causing errors
# 1.2
# Added new parameter to output per drive stats
# Counting is performed in KB

# Leave boot device empty if you want to add its activity to the results
# Specially thinking about booting SD card or SSD devices versus SAS drives bandwidth calculation.
# Otherwise use i.e.: s_BOOT_DEVICE="sdcv"
s_BOOT_DEVICE=""
# If this value is positive the loop will be kept n times
# If is negative ie: -1 it will loop forever
i_LOOP_TIMES=-1
# Display all drives separatedly
i_ALL_SEPARATEDLY=0
# Display in KB or MB
s_DISPLAY_UNIT="M"

# Init variables
i_READ_MAX=0
i_WRITE_MAX=0
s_READ_MAX_DATE=""
s_WRITE_MAX_DATE=""
i_IOSTAT_READ_KB=0
i_IOSTAT_WRITE_KB=0

# Internal variables
i_NUMBER_OF_DRIVES=0
s_LIST_OF_DRIVES=""
i_UNKNOWN_OPTION=0

# So if you run in screen you see colors :)
export TERM=xterm

# ANSI colors
s_COLOR_RED='\033[0;31m'
s_COLOR_BLUE='\033[0;34m'
s_COLOR_NONE='\033[0m'

for i in "$@"
do
    case $i in
        -b=*|--boot_device=*)
        s_BOOT_DEVICE="${i#*=}"
        shift # past argument=value
        ;;
        -l=*|--loop_times=*)
        i_LOOP_TIMES="${i#*=}"
        shift # past argument=value
        ;;
        -a=*|--all_separatedly=*)
        i_ALL_SEPARATEDLY="${i#*=}"
        shift # past argument=value
        ;;
        *)
        # unknown option
        i_UNKNOWN_OPTION=1
        ;;
    esac
done

if [[ "${i_UNKNOWN_OPTION}" -eq 1 ]]; then
    echo -e "${s_COLOR_RED}Unknown option${s_COLOR_NONE}"
    echo "Use: [-b|--boot_device=sda -l|--loop_times=-1 -a|--all-separatedly=1]"
    exit 1
fi

if [ -z "${s_BOOT_DEVICE}" ]; then
    i_NUMBER_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | wc --lines`
    s_LIST_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | awk '{printf $1" ";}'`
else
    echo -e "${s_COLOR_BLUE}Excluding Boot Device:${s_COLOR_NONE} ${s_BOOT_DEVICE}"
    # Add an space after the name of the device to prevent something like booting with sda leaving out drives like sdaa sdab sdac...
    i_NUMBER_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | grep -v "${s_BOOT_DEVICE} " | wc --lines`
    s_LIST_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | grep -v "${s_BOOT_DEVICE} " | awk '{printf $1" ";}'`
fi

AR_DRIVES=(${s_LIST_OF_DRIVES})
i_COUNTER_LOOP=0
for s_DRIVE in ${AR_DRIVES};
do
    AR_DRIVES_VALUES_AVG[i_COUNTER_LOOP]=0
    AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]=0
    AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]=0
    i_COUNTER_LOOP=$((i_COUNTER_LOOP+1))
done


echo -e "${s_COLOR_BLUE}Bandwidth for drives:${s_COLOR_NONE} ${i_NUMBER_OF_DRIVES}"
echo -e "${s_COLOR_BLUE}Devices:${s_COLOR_NONE} ${s_LIST_OF_DRIVES}"
echo ""

while [ "${i_LOOP_TIMES}" -lt 0 ] || [ "${i_LOOP_TIMES}" -gt 0 ] ;
do
    s_READ_PRE_COLOR=""
    s_READ_POS_COLOR=""
    s_WRITE_PRE_COLOR=""
    s_WRITE_POS_COLOR=""
    # In MB
    # s_IOSTAT_OUTPUT_ALL_DRIVES=`iostat -d -m -y 1 1 | grep "sd\|nvm"`
    # In KB
    s_IOSTAT_OUTPUT_ALL_DRIVES=`iostat -d -y 1 1 | grep "sd\|nvm"`
    if [ -z "${s_BOOT_DEVICE}" ]; then
        s_IOSTAT_OUTPUT=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
    else
        # Add an space after the name of the device to prevent something like booting with sda leaving out drives like sdaa sdab sdac...
        s_IOSTAT_OUTPUT=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | grep -v "${s_BOOT_DEVICE} " | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
    fi

    if [ "${i_ALL_SEPARATEDLY}" -eq 1 ]; then
        i_COUNTER_LOOP=0
        for s_DRIVE in ${AR_DRIVES};
        do
            s_IOSTAT_DRIVE=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | grep $s_DRIVE | head --lines=1 | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
            i_IOSTAT_READ_KB=`printf "%s" "${s_IOSTAT_DRIVE}" | awk -F '|' '{print $1;}'`
            i_IOSTAT_WRITE_KB=`printf "%s" "${s_IOSTAT_DRIVE}" | awk -F '|' '{print $2;}'`
            if [ "${i_IOSTAT_READ_KB%.*}" -gt ${AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]%.*} ]; then
                AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]=${i_IOSTAT_READ_KB}
                echo -e "New Max Speed Reading for ${s_COLOR_BLUE}$s_DRIVE${s_COLOR_NONE} at ${s_COLOR_RED}${i_IOSTAT_READ_KB} KB/s${s_COLOR_NONE}"
            echo
            fi
            if [ "${i_IOSTAT_WRITE_KB%.*}" -gt ${AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]%.*} ]; then
                AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]=${i_IOSTAT_WRITE_KB}
                echo -e "New Max Speed Writing for ${s_COLOR_BLUE}$s_DRIVE${s_COLOR_NONE} at ${s_COLOR_RED}${i_IOSTAT_WRITE_KB} KB/s${s_COLOR_NONE}"
            fi

            i_COUNTER_LOOP=$((i_COUNTER_LOOP+1))
        done
    fi

    i_IOSTAT_READ_KB=`printf "%s" "${s_IOSTAT_OUTPUT}" | awk -F '|' '{print $1;}'`
    i_IOSTAT_WRITE_KB=`printf "%s" "${s_IOSTAT_OUTPUT}" | awk -F '|' '{print $2;}'`

    # CAST to Integer
    if [ "${i_IOSTAT_READ_KB%.*}" -gt ${i_READ_MAX%.*} ]; then
        i_READ_MAX=${i_IOSTAT_READ_KB%.*}
        s_READ_PRE_COLOR="${s_COLOR_RED}"
        s_READ_POS_COLOR="${s_COLOR_NONE}"
        s_READ_MAX_DATE=`date`
        i_READ_MAX_MB=$((i_READ_MAX/1024))
    fi
    # CAST to Integer
    if [ "${i_IOSTAT_WRITE_KB%.*}" -gt ${i_WRITE_MAX%.*} ]; then
        i_WRITE_MAX=${i_IOSTAT_WRITE_KB%.*}
        s_WRITE_PRE_COLOR="${s_COLOR_RED}"
        s_WRITE_POS_COLOR="${s_COLOR_NONE}"
        s_WRITE_MAX_DATE=`date`
        i_WRITE_MAX_MB=$((i_WRITE_MAX/1024))
    fi

    if [ "${s_DISPLAY_UNIT}" == "M" ]; then
        # Get MB
        i_IOSTAT_READ_UNIT=${i_IOSTAT_READ_KB%.*}
        i_IOSTAT_WRITE_UNIT=${i_IOSTAT_WRITE_KB%.*}
        i_IOSTAT_READ_UNIT=$((i_IOSTAT_READ_UNIT/1024))
        i_IOSTAT_WRITE_UNIT=$((i_IOSTAT_WRITE_UNIT/1024))
    fi

    # When a MAX is detected it will be displayed in RED
    echo -e "READ  ${s_READ_PRE_COLOR}${i_IOSTAT_READ_UNIT} MB/s ${s_READ_POS_COLOR} (${i_IOSTAT_READ_KB} KB/s) Max: ${i_READ_MAX_MB} MB/s (${i_READ_MAX} KB/s) (${s_READ_MAX_DATE})"
    echo -e "WRITE ${s_WRITE_PRE_COLOR}${i_IOSTAT_WRITE_UNIT} MB/s ${s_WRITE_POS_COLOR} (${i_IOSTAT_WRITE_KB} KB/s) Max: ${i_WRITE_MAX_MB} MB/s (${i_WRITE_MAX} KB/s) (${s_WRITE_MAX_DATE})"
    if [ "$i_LOOP_TIMES" -gt 0 ]; then
        i_LOOP_TIMES=$((i_LOOP_TIMES-1))
    fi
done

Troubleshooting a shell prompt irresponsible that locks/hangs intermittently

You do df -h or ls / and the terminal freezes and not even CTRL + C works, you have a lock.

Normally this is due to a lock of the system trying to perform an IO.

Could be a physical spinning disk failing, but the most probably nowadays is that you have a network mount point and it is timing out.

If you execute mount and you get a timeout, and when you finally see the list you see a NFS, iSCSI or another kind of Network mount (you will see an Ip Address), check for errors.

To do this in CentOS/RHEL you can do as root:

dmesg | grep -i "timed"

or depending on the System

cat /var/log/messages | grep -i "timed"

You’ll get something like this:

[root@compute01 carles]# dmesg -T | grep timed | head -n5
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:45 2020] nfs: server storage07 not responding, timed out

Please note I use dmesg -T in order to have human readable date instead of Unix Epoch.

You can count the errors today:

[root@compute01 carles]# dmesg -T | grep time | grep "Mon Apr 6" | wc --lines
3123