Tag Archives: NVMe

Twitch Stream about ZFS, zpool scrubbing, Hard drives, Data Centers, NVMe, Rack Servers…

Twitch stream on 2022-06-06 10:50 IST

In this very long session we went through actual errors in a ZFS pool, we check the Kernel, we remove and reinsert the drive, conduct zpool scrub… in the meantime I talked about Rack, Rack Servers, PSU, redundant components, ECC RAM…

Adding a RAMDISK as SLOG ZIL to ZFS

If you use ZFS with spinning drives and you share iSCSI, you will need to use a SLOG device for ZIL otherwise you’ll see your iSCSI connections interrupted.

What is a ZIL?

  • ZIL: Acronym for ZFS Intended Log. Logs synchronous operations to disk
  • SLOG: Acronym for (S)eperate (LOG) Device

In ZFS Data is first written and stored in-memory, then it’s flushed to drives. This can take 10 seconds normally, a bit more in certain occasions.

So without SLOG it can happen that if a power loss occurs, you may loss the last 10 seconds of Data submitted.

The SLOG device brings security that if there is a power loss, after remounting the pool, the information in the SLOG, acknowledged to iSCSI clients, is not lost and flushed to the Hard drives conforming the pool. Basically this device keeps the writings that come from network and flushes to the Hard drives and then clears this data from the SLOG.

The SLOG also allows ZFS to sort how the transactions will be written, to do in a more efficient way.

Normally I’m describing configurations with a fast device for SLOG ZIL, like one or a pair of NVMe drive or SAS SSD, most commonly in mirror a pool of 12 HDD drives or more SAS preferentially, maybe SATA, with 14TB or more each.

As the SLOG device will persist your Data if there is a power off, and submit to the pool the accepted transactions, it is clear that you cannot spare yourself from having a SLOG ZIL device (or better a mirror). It is needed to bring security when remotely writing.

But what happens if we have a kind of business where we don’t care about that the last 10 seconds writings may be lost? (ZFS will never get corrupted due to its kinda journal system), just because we are filling a Server the fastest possible, migrating from another, or because we are running workouts that can be retaken is some data is lost… do we really need to have the speed constrain of an SSD?. Examples are a Hadoop node, or a SETI@Home client. Tasks will be resumed if something failed.

Or maybe you fill your servers with sync=always, so writing it’s safe, and then you use them only for read, or for a Statics Internet Caches (CDNs like Akamai, Cloudfare…) or you use it for storing Backups, write once read many. You don’t really need the constraint speed of a ZIL running at 800 MB/s.

Let me put in another way, we have 2 NIC 100Gbps, in bonding, so 200Gbps (equivalent to (25GB/s Gigabytes per second), 90 HDD drives that can work in parallel up to 250 MB/s each (22.5GB/s) and our Server has a pair or SAS SSD ZIL in mirror, that writes at 900 MB/s (Megabytes per second, so 0.9 GB/s), so our bottleneck or constraint is the SLOG ZIL.

Adding one RAMDISK, or better two RAMDISKs in mirror, we can get to much more highers speeds. I cannot tell you how much, but in my tests with regular configurations (8D+3P) I was achieving more than 2 GB (Gigabytes) per second sustained of Data to the pool. Take in count that the speed writing to the pool does not only depend on the speed on the ZIL, and the speed of the HDD spinning drives (slow, between 100 and 250 MB/s), but also about the config of the pool (number of vdevs, distributions of data and parity drives) and the throughput of your IOC (Input Output Controller), and the number of them.

Live real scenarios use to be more in the line of having 2x10GbpE cards, combined in bonding making 20Gbps, so being able to transmit 2.5GB/s. So to get the max speed of our Network this Ramdrive will do it. Also NVMe devices used as ZIL will do it.

The problem with the NVMe is that they are connected to the PCI Express bus, and so they are not hot swap. If one dies, you cannot replace without stopping the Server.

The problem with the SSD is that they are not made for writing, they will die, so you need at least a mirror and for heavy IO I strongly recommend you to go with Enterprise grade SAS SSD drives. Those are made to last.

SSD Enterprise grade are double price versus one common SSD, but that peace of mind and extra lasting is worth it. And you don’t need a very big device, only has to hold 10 seconds of Data at max speed. So if you can ingest Data through the Network at 20 Gbps (2.5GB/s) you only need approximately 25 GB of space of the SLOG. 50 GB if you want to be more than safe.

Also you can use partitions instead of complete devices for the SLOG (like for the ZFS pool, where you can add complete drives, or partitions).

If you write locally, and you have 4 IOC’s capable of delivering 8 GB/s each, and you write to a Dataset to the pool, and not to a ZVOL which are slow by nature, you can get astonishing combined speed writing to the drives. If you are migrating a Server to another new, where you can resume if power goes down, then it’s safe to disable sync (set async) while this process runs, and turn sync on when going live to production. If you use async you don’t need to use a SLOG.

4 IOC’s able to deliver 8 GB/s are enough to provide sustained speed to 90 HDD SAS drives. 90x200MB/s=18GB/s required at max speed or 90x250MB/s=22.5GB/s.

The HDD drives provide different speeds in the inner and in the outer areas of the drive, so normally those drives up to 8TB perform between 100 and 200 MB/s, and the drives from 10TB SAS to 14TB SAS perform between 145 and 250 MB/s. I cannot tell about the 16 TB as I’ve not tested them.

The instructions to set a Ramdrive and to assign to a pool are like this:

#!/usr/bin/env bash
RAM_GB=1
RAM_DRIVE_SIZE_IN_BYTES=$((RAM_GB*1048576))

if [[ $(id -u) -ne 0 ]] ; then
    echo "Please run as root"
    exit 1
fi

modprobe brd rd_nr=1 rd_size=${RAM_DRIVE_SIZE_IN_BYTES} max_part=0

echo "Use it like: zpool add carlespool log ram0"

If you created more than one Ramdisk you can add a mirror for the slog to the pool with:

zpool add carlespool log mirror /dev/ram0 /dev/ram1

You can partition the Ramdrive and add a partition but we want to add the whole ram device.

Obviously you cannot put other things to that Ramdisk (like the Metadata) as you need persistence for that.

In any case, please, avoid JBODs loaded of big HDD drives with low bandwidth micro SATA like 3Gbps per channel to the Server, and RAID. The bandwidth is too low. Your rebuilds will take forever.

With ZFS you’ll resilver (rebuild) only the actual data, not the whole drive.

The Ethernet standards group announces a new 800 GbE specification

Here is the link to the new: https://www.pcgamer.com/amp/the-ethernet-standards-group-developed-a-new-speed-so-fast-it-had-to-change-its-name/

This is a great new for scaling performance in the Data Centers. For routers, switches…

And this makes me think about all the Architects that are using Memcached and Redis in different Servers, in Networks of 1Gbps and makes me want to share with you what a nonsense, is often, that.

So the idea of having Memcache or Redis is just to cache the queries and unload the Database from those queries.

But 1Gbps is equivalent to 125MB (Megabytes) per second.

Local RAM Memory in Servers can perform at 24GB and more (24,000,000 Megabytes) per second, even more.

A PCIE NVMe drive at 3.5GB per second.

A local SSD drive without RAID 550 MB/s.

A SSD in the Cloud, varies a lot on the provider, number of drives, etc… but I’ve seen between 200 MB/s and 2.5GB/s aggregated in RAID.

In fact I have worked with Servers equipped with several IO Controllers, that were delivering 24GB/s of throughput writing or reading to HDD spinning drives.

If you’re in the Cloud. Instead of having 2 Load Balancers, 100 Front Web servers, with a cluster of 5 Redis with huge amount of RAM, and 1 MySQL Master and 1 Slave, all communicating at 1Gbps, probably you’ll get a better performance having the 2 LBs, and 11 Front Web with some more memory and having the Redis instance in the same machine and saving the money of that many small Front and from the 5 huge dedicated Redis.

The same applies if you’re using Docker or K8s.

Even if you just cache the queries to drive, speed will be better than sending everything through 1 Gbps.

This will matter for you if your site is really under heavy load. Most of the sites just query the MySQL Server using 1 Gbps lines, or 2 Gbps in bonding, and that’s enough.

Upgrading my new HP 14-bp060sa

As the company I was working for, Sanmina, has decided to move all the Software Development to Colorado, US, and closing the offices in Bishopstown, Cork, Ireland I found myself with the need to get a new laptop. At work I was using two Dell laptops, one very powerful and heavy equipped with an Intel Xeon processor and 32 GB of RAM. The other a lightweight one that I updated to 32 GB of RAM.

I had an accident around 8 months ago, that got my spine damaged, and so I cannot carry much weight.

My personal laptops at home, in Ireland are a 15″ with 16 GB of RAM, too heavy, and an Acer 11,6″ with 8GB of RAM and SSD (I upgraded it), but unfortunately the screen crashed. I still use it through the HDMI port. My main computer is a tower with a Core i7, 64GB of RAM and a Samsung NVMe SSD drive. And few Raspberrys Pi 4 and 3 :)

I was thinking about what ultra-lightweight laptop to buy, but I wanted to buy it in Barcelona, as I wanted a Catalan keyboard (the layout with the broken ç and accents). I tried by Amazon.es but I have problems to have shipped the Catalan keyboard layout laptops to my address in Ireland.

I was trying to find the best laptop for me.

While I was investigating I found out that none of the laptops in the market were convincing me.

The ones in around 1Kg, which was my initial target, were too big, and lack a proper full size HDMI port and Gigabit Ethernet. Honestly, some models get the HDMI or the Ethernet from an USB 3.1, through an adapter, or have mini-HDMI, many lack the Gigabit port, which is very annoying. Also most of the models come with 8GB of RAM only and were impossible to upgrade. I enrolled my best friend in my quest, in the research, and had the same conclusions.

I don’t want to have to carry adapters with me to just plug to a monitor or projector. I don’t even want to carry the power charger. I want a laptop that can work with me for a complete day, a full work session, without needing to recharge.

So while this investigation was going on, I decided to buy a cheap laptop with a good trade off of weight and cost, in order to be able to work on the coffee. I needed it for writing documents in Google Docs, creating microservices architectures, programming in Java and PHP, and writing articles in my blog. I also decided that this would be my only laptop with Windows, as honestly I missed playing Star Craft 2, and my attempts with Wine and Linux did not success.

Not also, for playing games :) , there are tools that are only available for Windows or for Mac Os X and Windows, like: POSTMAN, Kitematic for managing dockers visually, vSphere…

(Please note, as I reviewed the article I realized that POSTMAN is available for Linux too)

Please note: although I use mainly Linux everywhere (Ubuntu, CentOS, and RedHat mainly) and I contribute to Open Source projects, I do have Windows machines.

I created my Start up in 2004, and I still have Windows Servers, physical machines in a Data Center in Barcelona, and I still have VMs and Instances in Public Clouds with Windows Servers. Also I programmed some tools using Visual Studio and Visual Basic .NET, ASP.NET and C#, but when I needed to do this I found more convenient spawn an instance in Amazon or Azure and pay for its use.

When I created my Start up I offered my infrastructure as a way to get funding too, and I offered VMs with VMWare. I found that having my Mail Servers in VMs was much more convenient for Backups, cloning, to scale up, to avoid disruption and for Disaster and Recovery.

I wanted a cheap laptop that will not make feel bad if transporting it in a daily basis gets a hit and breaks, or that if it rains (and this happens more than often in Ireland) and it breaks is not super-hurtful, or even if it gets stolen. Yes, I’m from a big city, like is Barcelona, Catalonia, and thieves are a real problem. I travel, so I want a laptop decent enough that I can take to travel, and for going for a coffee, coding anything, and I feel comfortable enough that if something happens to it is not the end of the world.

Cork is not a big city, so the options were reduced. I found a laptop that meets my needs.

I got a HP s14-bp060sa for 439€.

It is equipped with a Intel® Core™ i3-6006U (2 GHz, 3 MB cache, 2 cores) , a 500GB SATA HDD, and 4 GB of DDR4 RAM.

The information on HP webpage is really scarce, but checking other pages I was able to see that the motherboard has 2 memory banks, accepting a max of 16GB of RAM.

I saw that there was an slot, unclear if supporting NVMe SSD drives, but supporting M.2 SSD for sure.

So I bought in Amazon 2x8GB and a M.2 500GB drive.

Since I was 5 years old I’ve been upgrading and assembling by myself all the computers. And this is something that I want to keep doing. It keeps me sharp, knowing the new ports, CPUs, and motherboard architectures, and keeps me in contact with the Hardware. All my life I’ve thought that specializing Software Engineers and Systems Engineers, like if computers were something separate, is a mistake, so I push myself to stay up to date of the news in all the fields.

I removed the spinning 500 GB SATA HDD, cause it’s slow and it consumes a lot of energy. With the M.2 SSD the battery last forever.

The interesting part is how I cloned the drive from the Spinning HDD to the new M.2.

I did:

  • Open the computer (see pics below) and Insert the new drive M.2
  • Boot with an USB Linux Rescue distribution (to do that I had to enable Legacy Boot on BIOS and boot with the USB)
  • Use lsblk command to identify the HDD drive, it was easy as it was the one with partitions
  • dd from if=/dev/sda to of=/dev/sdb with status=progress to see live status and speed (around 70MB/s) and estimated time to complete.
  • Please note that the new drive should be bigger or at least have the same number of bytes to avoid problems with the last partition.
  • I removed the HDD drive, this reduces the weight of my laptop by 100 grams
  • Disable Legacy Boot, and boot the computer. Windows started perfectly :)

I found so few information about this model, that I wanted to share the pictures with the Community. Here are the pictures of the upgrade process.

Here you can see the Crucial M.2 SSD installed and the Spinning HDD removed. Yes, I did in a coffee :)
Final step, installing the 2x8GB RAM memory modules