Category Archives: Storage

News of the blog 2021-08-16

  • I completed my ZFS on Ubuntu 20.04 LTS book.
    I had an error in an actual hard drive so I added a Troubleshooting section explaining how I fixed it.
  • I paused for a while the advance of my book Python: basic exercises for beginners, as my colleague Michela is translating it to Italian. She is a great Engineer and I cannot be more happy of having her help.
  • I added a new article about how to create a simple web Star Wars game using Flask.
    As always, I use Docker and a Dockerfile to automate the deployment, so you can test it without messing with your local system.
    The code is very simple and easy to understand.
mysql> UPDATE wp_options set option_value='blog.carlesmateo.local' WHERE option_name='siteurl';
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0

This way I set an entry in /etc/hosts and I can do all the tests I want.

  • I added a new section to the blog, is a link where you can see all the articles published, ordered by number of views.
    /posts_and_views.php

Is in the main page, just after the recommended articles.
Here you can see the source code.

  • I removed the Categories:
    • Storage
      • ZFS
  • In favor of:
    • Hardware
      • Storage
        • ZFS
  • So the articles with Categories in the group deleted were reassigned the Categories in the second group.
  • Visually:
    • I removed some annoying lines from the Quick Selection access.
      They came from inherited CSS properties from my WordPress, long time customized, and I created new styles for this section.
    • I adjusted the line-height to avoid separation between lines being too much.
  • I added a link in the section of Other Engineering Blogs that I like, to the great https://github.com/lesterchan site, author of many super cool WordPress plugins.

cmemgzip Python tool to compress files in memory when there is no free space on the disk

Rationale

All the Operation Engineers and SREs that work with systems have found the situation of having a Server with the disk full of logs and needing to keep those logs, and at the same time needing the system to keep running.

This is an uncomfortable situation.

I remember when I was being interviewed in Facebook, in Menlo Park, for a SDM position in the SRE (Software Development Manager) back in 2013-2014. They asked me about a situation where they have the Server disk full, and they deleted a big log file from Apache, but the space didn’t come back. They told me that nobody ever was able to solve this.

I explained that what happened is that Apache still had the fd (file descriptor), and that he will try to write to end of that file, even if they removed the huge log file with rm command, from the system they will not get back any free space. I explained that the easiest solution was to stop apache. They agreed and asked me how we could do the same without restarting the Webserver and I said that manipulating the file descriptors under /proc. They told me what I was the first person to solve this.

How it works

Basically cmemgzip will read a file, as binary, and will load it completely in to Memory.

Then it will compress it also in Memory. Then it will release the memory used to keep the original, will validate write permissions on the folder, will check that the compressed file is smaller than the original, and will delete the original and, using the new space now available in disk, write the compressed and smaller version of the file in gzip format.

Since version 0.3 you can specify an amount of memory that you will use for the blocks of data read from the file, so you can limit greatly the memory usage and compress files much more bigger than the amount of memory.

If for whatever reason the gz version cannot be written to disk, you’ll be asked for another route.

I mentioned before about File Descriptors, and programs that may keep those files open.

So my advice here, is that if you have to compress Apache logs or logs from a multi-thread program, and disk is full, and several instances may be trying to write to the log file: to stop Apache service if you can, and then run cmemgzip. I want to add it the future to auto-release open fd, but this is delicate and requires a lot of time to make sure it will be reliable in all the circumstances and will obey the exact desires of the SRE realizing the operation, without unexpected undesired side effects. It can be implemented with a new parameter, so the SysAdmin will know what is requesting.

Get the source code

You can decompress it later with gzip/gunzip.

So about cmemgzip you can git clone the project from here:

https://gitlab.com/carles.mateo/cmemgzip

git clone https://gitlab.com/carles.mateo/cmemgzip

The README.md is very clear:

https://gitlab.com/carles.mateo/cmemgzip/-/blob/master/README.md

The program is written in Python 3, and I gave it License MIT, so you can use it and the Open Source really with Freedom.

Do you want to test in other platforms?

This is a version 0.3.

I have only tested it in:

  • Ubuntu 20.04 LTS Linux for x64
  • Ubuntu 20.04 LTS 64 bits under Raspberry Pi 4 (ARM Processors)
  • Windows 10 Professional x64
  • Mac OS X
  • CentOS

It should work in all the platforms supporting Python, but if you want to contribute testing for other platforms, like Windows 32 bit, Solaris or BSD, let me know.

Alternative solutions

You can create a ramdisk and compress it to there. Then delete the original and move the compressed file from ramdisk to the hard drive, and unload the ramdrive Kernel Module. However we find very often with this problems in Docker containers or in instances that don’t have the Kernel module installed. Is much more easier to run cmemgzip.

Another strategy you can do for the future is to have a folder based on ZFS and compression. Again, ZFS should be installed on the system, and this doesn’t happen with Docker containers.

cmemgzip is designed to work when there is no free space, if there is free space, you should use gzip command.

In a real emergency when you don’t have enough RAM, neither disk space, neither the possibility to send the log files to another server to be compressed there, you could stop using the swap, and fdisk the swap partition to be a ext4 Linux format, format it, mount is, and use the space to compress the files. And after moving the files compressed to the original folder, fdisk the old swap partition to change type to Swap again, and enable swap again (swapon).

Memory requirements

As you can imagine, the weak point of cmemgzip, is that, if the file is completely loaded into memory and then compressed, the requirements of free memory on the Server/Instance/VM are at least the sum of the size of the file plus the sum of the size of the file compressed. You guess right. That’s true.

If there is not enough memory for loading the file in memory, the program is interrupted gracefully.

I decided to keep it simple, but this can be an option for the future.

So if your VM has 2GB of Available Memory, you will be able to use cmemgzip in uncompressed log files around 1.7GB.

In version 0.3 I implemented the ability to load chunks of the original file, and compress into memory, so I would be able use less memory. But then the compression is less efficient and initial tests point that I’ll have to keep a separate file for each compressed chunk. So I will need to created a uncompress tool as well, when now is completely compatible with gzip/gunzip, zcat, the file extractor from Ubuntu, etc…

For a big Server with a logfile of 40TB, around 300GB of RAM should be sufficient (the Servers I use have 768 GB of RAM normally).

Honestly, nowadays we find ourselves more frequently with VMs or Instances in the Cloud with small drives (10 to 15GB) and enough Available RAM, rather than Servers with huge mount points. This kind of instances, which means scaling horizontally, makes more difficult to have NFS Servers were we can move those logs, for security.

So cmemgzip covers very well some specific cases, while is not useful for all the scenarios.

I think it’s safe to say it covers 95% of the scenarios I’ve found in the past 7 years.

cmemgzip will not help you if you run out inodes.

Usage

Usage is very simple, and I kept it very verbose as the nature of the work is Operations, Engineers need to know what is going on.

I return error level/exit code 0 if everything goes well or 1 on errors.

./cmemgzip.py /home/carles/test_extract/SherlockHolmes.txt
 
 cmemgzip.py v.0.1

 Verifying access to: /home/carles/test_extract/SherlockHolmes.txt
 Size of file: /home/carles/test_extract/SherlockHolmes.txt is 553KB (567,291 bytes)
 Reading file: /home/carles/test_extract/SherlockHolmes.txt (567,291 bytes) to memory.
 567,291 bytes loaded.
 Compressing to Memory with maximum compression level…
 Size compressed: 204KB (209,733 bytes). 36.97% of the original file
 Attempting to create the gzip file empty to ensure write permissions
 Deleting the original file to get free space
 Writing compressed file /home/carles/test_extract/SherlockHolmes.txt.gz
 Verifying space written match size of compressed file in Memory
 Write verification completed.

You can also simulate, without actually delete or write to disk, just in order to know what will be the

Installation

There are no third party libraries to install. I only use the standard ones: os, sys, gzip

So clone it with git in your preferred folder and just create a symbolic link with your favorite name:

sudo ln --symbolic /home/carles/code/cmemgzip/cmemgzip.py /usr/bin/cmemgzip

I like to create the link without the .py extension.

This way you can invoke the program from anywhere by just typing: cmemgzip

Extend existing Single ZFS disk with a mirror without losing the Data on the existing HDD

This is an answer that I did to a question in askubuntu.

https://askubuntu.com/questions/1301828/extend-existing-single-disk-zfs-with-a-mirror-without-formating-the-existing-hdd/

Question:

I have one HDD formatted as single disc zfs system on my server. It looks like the following:

Now I want to convert this to a zfs mirror without formatting the original disk. Any ideas?

Result should be something like:

hdd0
   mirror0
       ata-........................
       ata-........................

Answer:

I reproduced your case in a VM and paste here step by step. :)

Note: First of all, please do a backup of your data. I added an empty new disk, so ZFS had no doubt what was the master drive. Although you should have no problem as the first drive already forms part of the pool, a backup is recommended.

Quick answer: You need the zpool attach command.

Basically:

sudo zpool attach hdd0 existinghdd blankhdd

After, do:

zpool status

And you will see that a mirror has been created. Your data on the already existing drive will be keep, and will be replicated to the new one (Resilvered).

As ZFS only copys the actual information this process will take more or less depending on the amount of Data.

In my VM 300 GB were replicated in 3 seconds, while my experience with SAS and SATA drives, I was Resilvering 10 TB in less than 24 hours (for that I was using drives from 10TB to 14TB SAS) .

Now the long answer with everything I did in my Virtual Box VM:

lsblk --scsi

identify the two empty drives by:

ls /dev/disk/by-id/

Select one of them and create a pool like your: sudo zpool create hdd0 id_of_mydrive

See that pool /hdd0 has been created and mounted on root.

sudo zpool status sudo zpool list sudo ls -al /hdd0

Fill with some random data (or better copy files there) to generate a drive like data like you. I generated from random:

sudo dd id=/dev/urandom of=/hdd0/file.000 bs=1M count=100 status=progress
sudo dd id=/dev/urandom of=/hdd0/file.001 bs=1M count=100 status=progress
sudo dd id=/dev/urandom of=/hdd0/file.002 bs=1M count=100 status=progress

Then I got the checksum and saved to verify later.

sudo su
# Please note I continue as root
sha512sum file.000 > file.000.sha512
sha512sum file.001 > file.001.sha512
sha512sum file.002 > file.002.sha512

zpool list shows nearly 100GB of space.

zpool attach hdd0 id_of_mydrive id_of_the_drive_to_add

zpool status will show:

pool: hdd0
state: ONLINE
scan: resilvered 301M in 0 days 00:00:03 with 0 errors…

   NAME                            STATE   READ WRITE CKSUM   
   hdd0
     mirror-0
       ata-VBOX_HARDDISK_VBa8...   ONLINE     0     0     0
       ata-VBOX_HARDDISK_VB8c...   ONLINE     0     0     0

errors: No known data errors

I verified the checksums.

zpool list will return as well 99GB of space available, as two drives of 100GB are being used in mirror.

So as kaulex mentioned the format is: zpool attach

Where device is your previous vdev with data (the single hard drive with Data in the ZFS pool named ‘hdd0’).

As I did you want to use the Id of the device and not the name, so you will use the identifier in /dev/disk/by-id/ and not sdb, sdc… (Please note, adding /dev/ is not necessary). The reason to do not use device names like sdb, sdc, sdea, etc… is that those names may change why live is running or between reboots. The id never changes. In real systems, not Virtual Box, they may start by wwn or ata.

News from the blog 2020-12-16

  • Happy Christmas!

This is the 3D tree that I bought, which is programmable in Python :)

  • If you’re into ZFS I recommend this video:

https://klarasystems.com/learning/webinars/best-practices-for-optimizing-zfs1/

Is a video from klarasystems about best practices for ZFS.

  • Amazing Apache Kafka resources can be found here:

https://developer.confluent.io/

https://developer.confluent.io/learn-kafka/

  • I decided to lower the price of my book to the minimum in LeanPub $5 USD while covid is going on in order to help people with their lives.
    https://leanpub.com/u/carlesmateo

I read with surprise that Comcast is capping the Internet use to 1.2TB per month, and that they will be charging excess.

So… if I contract a Backup with Carbonite or BackBlaze or DropBox or another company and I backup my 10TB files, Comcast will ruin me charging excesses…
Or if I work from home, or the family watches a lot of Netflix…
I can only thinK on their Cast Strategy of CastNumberOfClientsToBankrupcy.

A joke to indicate that I think they will loss clients.

Imagine yesterday I downloaded two images of Ubuntu, being 5 GB, installed Call of Duty in one computer 180 GB, installed few Xbox games 400 GB, listened to Spotify 10 Gb, watched youtube 3 GB, watched Netflix 4 GB, so 602 GB in one day.

Not counting the bandwidth WFH (Working from Home).

Not counting Windows Updates, TV updates, consoles updates, Android Updates, Ubuntu updates…

And this is done in the middle of the covid-19 pandemic, with so many people lock down at home, playing video games, watching movies, and requiring desperately distractions.

<irony>Well done Comcast!</irony>

News from the Blog 2020-11-11

  • The latest drive enclosures have been a bit of a fiasco.

The small one (6 drives) fits perfectly in one ATX Bay, however, the SAS SSD are too height to fit.

I fit 1 SATA3 SSD 1TB and 4 SATA3 HDD 2TB.

The other one, the 3 Bay 5.25″ SAS/SATA enclosure for 12 drives did not fit in the Corsair Obsidian Series 750D case, and I had to install it outside. Doing a DIY, as I explain in my book about assembling, fixing and upgrading your own PCs and laptops.

However the 12 Gbps SAS SSD were returning Checksum errors in ZFS when I did copy information or I ran scrub. I’m afraid the enclosure can only provide 6 Gbps at max, or a poor connection. Cables or expanders use to be the reason. I ordered new cables to make a direct connection to the HBA Controller without the enclosure to validate my theory and the drives stopped showing errors.

There is something good in all bad: I have been able to document and explain how to troubleshoot, actual errors in ZFS, in my book and talk about the problems with the cables, and the advantages of using a SAS controller even if you use SATA drives.

  • I got my first Excellent in an Assignment in an Ireland university, which makes me specially happy. And I keep going on studying in Linux Academy, the last course I did was GCP and Terraform, even if I knew both it helps me to keep my skills sharp.
  • I share with you some offers and charity bundles that I enrolled and enjoyed a lot:

There is an offer with Microsoft Pass which is that we can use Disney+ for free during 30 days.

  1. I started watching the Mandalorian, Season 2, and is wonderfully displayed in 4K.
    The quality of the video surprised me. Not that many contents in Netflix are 4K and I really enjoyed the great quality of the image.
  2. Humble Bundle offers a pack of 8 VR games per €13.45.
    If you like Virtual Reality and have your headset, this pack is amazing, and the benefits go to charity: Movember. The games are downloaded from Steam and the pack will last for 14 days.
  3. Humble Bundle offers a pack of Java and one go books for €12.65, with a minimum of €0.84 for 3 books. Benefits go for charity: Code for America.

Don’t forget to balance how much of your contribution goes to every player.

Unfortunately by default most of the money goes to O’Reilly and Humble Tip and few to the Charity cause. You can change that from the web when going to to the payment.

News from the blog 2020-11-03

Nice articles recommended

This article talks about how at Riot Games they use Slack. Slack is really a powerful tool, and also makes the communication more human in companies with their approach and the funny icons and /giphy. I’m very serious when it comes to work but I recognize the friendly, warm, human and lovely touch these kind of animated icons bring to the conversations.

Remember that life of the SSD is different from spinning drives. I recommend to keep your backups on external spinning drives disconnected most of the time.

Operating at Scale – An Inside Look at Facebook’s Production Engineering Team

CMIPS

I’ve been working on testing performance of more configurations on Azure and GCP.

I’m also looking forward to test the AMD Ryzen™ 7 3700X, AM4, Zen 2, 8 Core, 16 Thread, 3.6GHz, 4.4GHz Turbo that is arriving to me this week.

CTOP.py v. 0.7.8 released

I closed the ticket #21 (Thank Jian!) so ensuring CTOP.py is compatible with Python 3.5 versions.

Feature requests and bugs are listed using gitlab: https://gitlab.com/carles.mateo/ctop/-/issues

My Python Combat Guide Book

I updated it the Nov-01, as I normally do, bringing more content.

I’ve been paid the royalties for he past two months and I reinvested everything (and more from my pocket) in Hardware for working with ZFS.

I was offered by an editorial in The States to publish Python Combat Guide and other of my books worldwide. I was thinking it for a while. It was very good money, translation to multiple languages and platforms and marketing and a lot of promotion, but I would had loss the rights and the Freedom I have now, like the possibility to offer discount coupons to who I want and to update the contents often. So to celebrate my decision for you, readers of the blog, during September, I provide a discounted price of $5 USD for the fist 100 sales instead of the $25 USD suggested price. Use the following link:

https://leanpub.com/pythoncombatguide/c/blog-carles-nov2020

ZFS progress

As part of my effort to contributing with nice Open Source products to the Community I have made some investments to keep contributing to:

  • OpenZFS
  • My old tool for managing ZFS and Network shares easily

I’m writing a new book about managing ZFS for Small Business too, so I show how to operate on this hardware, good points and downsides.

I’m assembling a new Pc with ZFS plenty of Disk Storage within a mix of:

  • SAS Enterprise grade SSD 2.5″
  • SATA 12Gb Enterprise grade SSD 2.5″
  • SATA SSD 2.5″
  • SATA HDD 2TB 2.5″
  • SATA HDD 2TB 3.5″

I’m a big fan of Intel, but this time I have chosen AMD. Concretely a AMD Ryzen 7 3700X AM4 8 Core / 16 Threads, 3.6 GHz to 4.4 GHz with Turbo. The reason I chose this CPU is because it only uses 65W but still has 8 Cores / 16 Threads.

Also I want to see the performance of this AMD Ryzen with CMIPS and another important reason is that AMD motherboards support PCI 4.0. I have bought a NVMe SSD Samsung 980 PRO PCI 4.0 (x4) able to read at 6,400 MB/s. I will use this AMD box for running VMs as well. Basically Virtual Box and Docker.

I’ve been surprised that for 169.99 GBP I can have a very good Asus Motherboard with a 2.5 Gb Ethernet: ASUS ROG STRIX B550-F GAMING, AMD B550, AM4, DDR4, PCIe 4.0, SATA3, Dual M.2, CrossFire, 2.5GbE, USB 3.2 Gen2 A+C, ATX.

In order to have an Asus motherboard with a 2.5 Gb Ethernet for Intel I had to jump to a 254 GBP motherboard and Intel is still PCI 3.0. Actually there are PCI 10Gb NICs at 80 GBP so at some point I’ll upgrade my home network from Gigabit to 10 Gb. That will come slowly, but if the new equipment I assemble has 2.5 Gb when I upgrade the main switches to 10 Gb, at least I’ll be able to communicate at 2.5 Gb without ant additional change.

Also memory at 3200, speed that the AMD motherboard can provide, is more than affordable.

This new server will have 64 GB of RAM (Corsair DDR4 Vengeance PC4-25600 (3200)), as I plan to run VMs and use Volumes mounted via iSCSI and locally as block devices to improve my Software. I’ve bought a new UPS to keep it running in case power goes down. That’s something that doesn’t happen often in my city in Ireland, honestly, but I never forget that this happens in Barcelona two or three times per year, and that a high tension spike can burn your motherboard, drives, or electronics like the TV or the fridge. I’ve bought as well a new KVM Switch, a HDMI 4K and USB too one, so I don’t have to have so many keyboards. My logitech M720 allowed me to use it with 3 computers, but still I want something more operational. The KVM I bought allow me to switch with a button or within a hotkey in the keyboard.

I bought a new Icy box fox handling 6 2.5 drives in just one bay of the tower, and a 850 Watt Corsair PSU that will be able to power the many drives I want at the same time.

More books coming

I started two new books:

Those can be purchased while I’m still working on them and get the updates that I’ll be publishing and keeping a communication with me about doubts or improvements.

Halloween Software Offers

I saw some Halloween offers and I purchased Software licenses for Software I use.

Backup Guard is one of the products I registered:

https://backup-guard.com/

I contribute a lot to Open Source, and many years ago before Open Source existed I was creating Freeware Software. But I think that good commercial Software deserves to be supported. Like everything in life, if they are doing a good work that is useful to me, why not giving them support?. It is also a way to make sure they will continue producing amazing Software. And in the other hand, myself, I create Software. Some times commercial Software, and I like to be paid, so I apply the same principle.

News from the blog 2020-10-16

  • I’ve been testing and adding more instances to CMIPS. I’m planning on testing the Azure instance with 120 cores.
  • News: Microsoft makes an option to permanently remote work

https://www.bbc.com/news/business-54482245

  • One of my colleagues showed me dstat, a very nice tool for system monitoring, and bandwidth of a drive monitoring. Also ifstat, as complement to iftop is very cool for Network too. This functionality is also available in CTOP.py
  • As I shared in the past news of the blog, I’m resuming my contributions to ZFS Community.

Long time ago I created some ZFS tools that I want to share soon as Open Source.

I equipped myself with the proper Hardware to test on SAS and SATA:

  • 12G Internal PCI-E SAS/SATA HBA RAID Controller Card, Broadcom’s SAS 3008, compatible for SAS 9300-8I.
    This is just an HDA (Host Data Adapter), it doesn’t support RAID. Only connects up to 8 drives or 1024 through expander, to my computer.
    It has a bandwidth of 9,600 MB/s which guarantees me that I’ll be able to add 12 SAS SSD Enterprise grade at almost the max speed of the drives. Those drives perform at 900 MB/s so if I’m using all of them at the same time, like if I have a pool of 8 + 3 and I rebuild a broken drive or I just push Data, I would be using 12×900 = 10,800 MB/s. Close. Fair enough.
  • VANDESAIL Mini-SAS Cables, 1m Internal Mini-SAS to 4x SAS SATA Forward Breakout Cable Hard Drive Data Transfer Cable (SAS Cable).
  • SilverStone SST-FS212B – Aluminium Trayless Hot Swap Mobile Rack Backplane / Internal Hard Drive Enclosure for 12x 2.5 Inch SAS/SATA HDD or SSD, fit in any 3x 5.25 Inch Drive Bay, with Fan and Lock, black
  • Terminator is here.
    I ordered this T-800 head a while ago and finally arrived.

Finally I will have my empty USB keys located and protected. ;)

Remember to be always nice to robots. :)

Adding a RAMDISK as SLOG ZIL to ZFS

If you use ZFS with spinning drives and you share iSCSI, you will need to use a SLOG device for ZIL otherwise you’ll see your iSCSI connections interrupted.

What is a ZIL?

  • ZIL: Acronym for ZFS Intended Log. Logs synchronous operations to disk
  • SLOG: Acronym for (S)eperate (LOG) Device

In ZFS Data is first written and stored in-memory, then it’s flushed to drives. This can take 10 seconds normally, a bit more in certain occasions.

So without SLOG it can happen that if a power loss occurs, you may loss the last 10 seconds of Data submitted.

The SLOG device brings security that if there is a power loss, after remounting the pool, the information in the SLOG, acknowledged to iSCSI clients, is not lost and flushed to the Hard drives conforming the pool. Basically this device keeps the writings that come from network and flushes to the Hard drives and then clears this data from the SLOG.

The SLOG also allows ZFS to sort how the transactions will be written, to do in a more efficient way.

Normally I’m describing configurations with a fast device for SLOG ZIL, like one or a pair of NVMe drive or SAS SSD, most commonly in mirror a pool of 12 HDD drives or more SAS preferentially, maybe SATA, with 14TB or more each.

As the SLOG device will persist your Data if there is a power off, and submit to the pool the accepted transactions, it is clear that you cannot spare yourself from having a SLOG ZIL device (or better a mirror). It is needed to bring security when remotely writing.

But what happens if we have a kind of business where we don’t care about that the last 10 seconds writings may be lost? (ZFS will never get corrupted due to its kinda journal system), just because we are filling a Server the fastest possible, migrating from another, or because we are running workouts that can be retaken is some data is lost… do we really need to have the speed constrain of an SSD?. Examples are a Hadoop node, or a SETI@Home client. Tasks will be resumed if something failed.

Or maybe you fill your servers with sync=always, so writing it’s safe, and then you use them only for read, or for a Statics Internet Caches (CDNs like Akamai, Cloudfare…) or you use it for storing Backups, write once read many. You don’t really need the constraint speed of a ZIL running at 800 MB/s.

Let me put in another way, we have 2 NIC 100Gbps, in bonding, so 200Gbps (equivalent to (25GB/s Gigabytes per second), 90 HDD drives that can work in parallel up to 250 MB/s each (22.5GB/s) and our Server has a pair or SAS SSD ZIL in mirror, that writes at 900 MB/s (Megabytes per second, so 0.9 GB/s), so our bottleneck or constraint is the SLOG ZIL.

Adding one RAMDISK, or better two RAMDISKs in mirror, we can get to much more highers speeds. I cannot tell you how much, but in my tests with regular configurations (8D+3P) I was achieving more than 2 GB (Gigabytes) per second sustained of Data to the pool. Take in count that the speed writing to the pool does not only depend on the speed on the ZIL, and the speed of the HDD spinning drives (slow, between 100 and 250 MB/s), but also about the config of the pool (number of vdevs, distributions of data and parity drives) and the throughput of your IOC (Input Output Controller), and the number of them.

Live real scenarios use to be more in the line of having 2x10GbpE cards, combined in bonding making 20Gbps, so being able to transmit 2.5GB/s. So to get the max speed of our Network this Ramdrive will do it. Also NVMe devices used as ZIL will do it.

The problem with the NVMe is that they are connected to the PCI Express bus, and so they are not hot swap. If one dies, you cannot replace without stopping the Server.

The problem with the SSD is that they are not made for writing, they will die, so you need at least a mirror and for heavy IO I strongly recommend you to go with Enterprise grade SAS SSD drives. Those are made to last.

SSD Enterprise grade are double price versus one common SSD, but that peace of mind and extra lasting is worth it. And you don’t need a very big device, only has to hold 10 seconds of Data at max speed. So if you can ingest Data through the Network at 20 Gbps (2.5GB/s) you only need approximately 25 GB of space of the SLOG. 50 GB if you want to be more than safe.

Also you can use partitions instead of complete devices for the SLOG (like for the ZFS pool, where you can add complete drives, or partitions).

If you write locally, and you have 4 IOC’s capable of delivering 8 GB/s each, and you write to a Dataset to the pool, and not to a ZVOL which are slow by nature, you can get astonishing combined speed writing to the drives. If you are migrating a Server to another new, where you can resume if power goes down, then it’s safe to disable sync (set async) while this process runs, and turn sync on when going live to production. If you use async you don’t need to use a SLOG.

4 IOC’s able to deliver 8 GB/s are enough to provide sustained speed to 90 HDD SAS drives. 90x200MB/s=18GB/s required at max speed or 90x250MB/s=22.5GB/s.

The HDD drives provide different speeds in the inner and in the outer areas of the drive, so normally those drives up to 8TB perform between 100 and 200 MB/s, and the drives from 10TB SAS to 14TB SAS perform between 145 and 250 MB/s. I cannot tell about the 16 TB as I’ve not tested them.

The instructions to set a Ramdrive and to assign to a pool are like this:

#!/usr/bin/env bash
RAM_GB=1
RAM_DRIVE_SIZE_IN_BYTES=$((RAM_GB*1048576))

if [[ $(id -u) -ne 0 ]] ; then
    echo "Please run as root"
    exit 1
fi

modprobe brd rd_nr=1 rd_size=${RAM_DRIVE_SIZE_IN_BYTES} max_part=0

echo "Use it like: zpool add carlespool log ram0"

If you created more than one Ramdisk you can add a mirror for the slog to the pool with:

zpool add carlespool log mirror /dev/ram0 /dev/ram1

You can partition the Ramdrive and add a partition but we want to add the whole ram device.

Obviously you cannot put other things to that Ramdisk (like the Metadata) as you need persistence for that.

In any case, please, avoid JBODs loaded of big HDD drives with low bandwidth micro SATA like 3Gbps per channel to the Server, and RAID. The bandwidth is too low. Your rebuilds will take forever.

With ZFS you’ll resilver (rebuild) only the actual data, not the whole drive.

iostat_bandwitdth.sh – Utility to calculate the bandwidth used by all your drives

This is a shell script I made long time ago and I use it to monitor in real time what’s the total or individual bandwidth and maximum bandwidth achieved, for READ and WRITE, of Hard drives and NMVe devices.

It uses iostat to capture the metrics, and then processes the maximum values, the combined speed of all the drives… has also an interesting feature to let out the booting device. That’s very handy for Rack Servers where you boot from an SSD card or and SD, and you want to monitor the speed of the other (SAS probably) devices.

I used it to monitor the total bandwidth achieved by our 4U60 and 4U90 Servers, the All-Flash-Arrays 2U and the NVMe 1U units in Sanmina and the real throughput of IOC (Input Output Controllers).

I used also to compare what was the real data written to ZFS and mdraid RAID systems, and to disks and the combined speed with different pool configurations, as well as the efficiency of iSCSI and NFS from clients to the Servers.

You can specify how many times the information will be printed, whether you want to keep the max speed of each device per separate, and specify a drive to exclude. Normally it will be the boot drive.

If you want to test performance metrics you should make sure that other programs are not running or using the swap, to prevent bias. You should disable the boot drive if it doesn’t form part of your tests (like in the 4U60 with an SSD boot drive in a card, and 60 hard drive bays SAS or SATA).

You may find useful tools like iotop.

You can find the code here, and in my gitlab repo:

https://gitlab.com/carles.mateo/blog.carlesmateo.com-source-code/-/blob/master/iostat_bandwidth.sh

#!/usr/bin/env bash

AUTHOR="Carles Mateo"
VERSION="1.4"

# Changelog
# 1.4
# Added support for NVMe drives
# 1.3
# Fixed Decimals in KB count that were causing errors
# 1.2
# Added new parameter to output per drive stats
# Counting is performed in KB

# Leave boot device empty if you want to add its activity to the results
# Specially thinking about booting SD card or SSD devices versus SAS drives bandwidth calculation.
# Otherwise use i.e.: s_BOOT_DEVICE="sdcv"
s_BOOT_DEVICE=""
# If this value is positive the loop will be kept n times
# If is negative ie: -1 it will loop forever
i_LOOP_TIMES=-1
# Display all drives separatedly
i_ALL_SEPARATEDLY=0
# Display in KB or MB
s_DISPLAY_UNIT="M"

# Init variables
i_READ_MAX=0
i_WRITE_MAX=0
s_READ_MAX_DATE=""
s_WRITE_MAX_DATE=""
i_IOSTAT_READ_KB=0
i_IOSTAT_WRITE_KB=0

# Internal variables
i_NUMBER_OF_DRIVES=0
s_LIST_OF_DRIVES=""
i_UNKNOWN_OPTION=0

# So if you run in screen you see colors :)
export TERM=xterm

# ANSI colors
s_COLOR_RED='\033[0;31m'
s_COLOR_BLUE='\033[0;34m'
s_COLOR_NONE='\033[0m'

for i in "$@"
do
    case $i in
        -b=*|--boot_device=*)
        s_BOOT_DEVICE="${i#*=}"
        shift # past argument=value
        ;;
        -l=*|--loop_times=*)
        i_LOOP_TIMES="${i#*=}"
        shift # past argument=value
        ;;
        -a=*|--all_separatedly=*)
        i_ALL_SEPARATEDLY="${i#*=}"
        shift # past argument=value
        ;;
        *)
        # unknown option
        i_UNKNOWN_OPTION=1
        ;;
    esac
done

if [[ "${i_UNKNOWN_OPTION}" -eq 1 ]]; then
    echo -e "${s_COLOR_RED}Unknown option${s_COLOR_NONE}"
    echo "Use: [-b|--boot_device=sda -l|--loop_times=-1 -a|--all-separatedly=1]"
    exit 1
fi

if [ -z "${s_BOOT_DEVICE}" ]; then
    i_NUMBER_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | wc --lines`
    s_LIST_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | awk '{printf $1" ";}'`
else
    echo -e "${s_COLOR_BLUE}Excluding Boot Device:${s_COLOR_NONE} ${s_BOOT_DEVICE}"
    # Add an space after the name of the device to prevent something like booting with sda leaving out drives like sdaa sdab sdac...
    i_NUMBER_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | grep -v "${s_BOOT_DEVICE} " | wc --lines`
    s_LIST_OF_DRIVES=`iostat -d -m | grep "sd\|nvm" | grep -v "${s_BOOT_DEVICE} " | awk '{printf $1" ";}'`
fi

AR_DRIVES=(${s_LIST_OF_DRIVES})
i_COUNTER_LOOP=0
for s_DRIVE in ${AR_DRIVES};
do
    AR_DRIVES_VALUES_AVG[i_COUNTER_LOOP]=0
    AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]=0
    AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]=0
    i_COUNTER_LOOP=$((i_COUNTER_LOOP+1))
done


echo -e "${s_COLOR_BLUE}Bandwidth for drives:${s_COLOR_NONE} ${i_NUMBER_OF_DRIVES}"
echo -e "${s_COLOR_BLUE}Devices:${s_COLOR_NONE} ${s_LIST_OF_DRIVES}"
echo ""

while [ "${i_LOOP_TIMES}" -lt 0 ] || [ "${i_LOOP_TIMES}" -gt 0 ] ;
do
    s_READ_PRE_COLOR=""
    s_READ_POS_COLOR=""
    s_WRITE_PRE_COLOR=""
    s_WRITE_POS_COLOR=""
    # In MB
    # s_IOSTAT_OUTPUT_ALL_DRIVES=`iostat -d -m -y 1 1 | grep "sd\|nvm"`
    # In KB
    s_IOSTAT_OUTPUT_ALL_DRIVES=`iostat -d -y 1 1 | grep "sd\|nvm"`
    if [ -z "${s_BOOT_DEVICE}" ]; then
        s_IOSTAT_OUTPUT=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
    else
        # Add an space after the name of the device to prevent something like booting with sda leaving out drives like sdaa sdab sdac...
        s_IOSTAT_OUTPUT=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | grep -v "${s_BOOT_DEVICE} " | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
    fi

    if [ "${i_ALL_SEPARATEDLY}" -eq 1 ]; then
        i_COUNTER_LOOP=0
        for s_DRIVE in ${AR_DRIVES};
        do
            s_IOSTAT_DRIVE=`printf "${s_IOSTAT_OUTPUT_ALL_DRIVES}" | grep $s_DRIVE | head --lines=1 | awk '{sum_read += $3} {sum_write += $4} END {printf sum_read"|"sum_write"\n"}'`
            i_IOSTAT_READ_KB=`printf "%s" "${s_IOSTAT_DRIVE}" | awk -F '|' '{print $1;}'`
            i_IOSTAT_WRITE_KB=`printf "%s" "${s_IOSTAT_DRIVE}" | awk -F '|' '{print $2;}'`
            if [ "${i_IOSTAT_READ_KB%.*}" -gt ${AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]%.*} ]; then
                AR_DRIVES_VALUES_READ_MAX[i_COUNTER_LOOP]=${i_IOSTAT_READ_KB}
                echo -e "New Max Speed Reading for ${s_COLOR_BLUE}$s_DRIVE${s_COLOR_NONE} at ${s_COLOR_RED}${i_IOSTAT_READ_KB} KB/s${s_COLOR_NONE}"
            echo
            fi
            if [ "${i_IOSTAT_WRITE_KB%.*}" -gt ${AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]%.*} ]; then
                AR_DRIVES_VALUES_WRITE_MAX[i_COUNTER_LOOP]=${i_IOSTAT_WRITE_KB}
                echo -e "New Max Speed Writing for ${s_COLOR_BLUE}$s_DRIVE${s_COLOR_NONE} at ${s_COLOR_RED}${i_IOSTAT_WRITE_KB} KB/s${s_COLOR_NONE}"
            fi

            i_COUNTER_LOOP=$((i_COUNTER_LOOP+1))
        done
    fi

    i_IOSTAT_READ_KB=`printf "%s" "${s_IOSTAT_OUTPUT}" | awk -F '|' '{print $1;}'`
    i_IOSTAT_WRITE_KB=`printf "%s" "${s_IOSTAT_OUTPUT}" | awk -F '|' '{print $2;}'`

    # CAST to Integer
    if [ "${i_IOSTAT_READ_KB%.*}" -gt ${i_READ_MAX%.*} ]; then
        i_READ_MAX=${i_IOSTAT_READ_KB%.*}
        s_READ_PRE_COLOR="${s_COLOR_RED}"
        s_READ_POS_COLOR="${s_COLOR_NONE}"
        s_READ_MAX_DATE=`date`
        i_READ_MAX_MB=$((i_READ_MAX/1024))
    fi
    # CAST to Integer
    if [ "${i_IOSTAT_WRITE_KB%.*}" -gt ${i_WRITE_MAX%.*} ]; then
        i_WRITE_MAX=${i_IOSTAT_WRITE_KB%.*}
        s_WRITE_PRE_COLOR="${s_COLOR_RED}"
        s_WRITE_POS_COLOR="${s_COLOR_NONE}"
        s_WRITE_MAX_DATE=`date`
        i_WRITE_MAX_MB=$((i_WRITE_MAX/1024))
    fi

    if [ "${s_DISPLAY_UNIT}" == "M" ]; then
        # Get MB
        i_IOSTAT_READ_UNIT=${i_IOSTAT_READ_KB%.*}
        i_IOSTAT_WRITE_UNIT=${i_IOSTAT_WRITE_KB%.*}
        i_IOSTAT_READ_UNIT=$((i_IOSTAT_READ_UNIT/1024))
        i_IOSTAT_WRITE_UNIT=$((i_IOSTAT_WRITE_UNIT/1024))
    fi

    # When a MAX is detected it will be displayed in RED
    echo -e "READ  ${s_READ_PRE_COLOR}${i_IOSTAT_READ_UNIT} MB/s ${s_READ_POS_COLOR} (${i_IOSTAT_READ_KB} KB/s) Max: ${i_READ_MAX_MB} MB/s (${i_READ_MAX} KB/s) (${s_READ_MAX_DATE})"
    echo -e "WRITE ${s_WRITE_PRE_COLOR}${i_IOSTAT_WRITE_UNIT} MB/s ${s_WRITE_POS_COLOR} (${i_IOSTAT_WRITE_KB} KB/s) Max: ${i_WRITE_MAX_MB} MB/s (${i_WRITE_MAX} KB/s) (${s_WRITE_MAX_DATE})"
    if [ "$i_LOOP_TIMES" -gt 0 ]; then
        i_LOOP_TIMES=$((i_LOOP_TIMES-1))
    fi
done

A script to backup your partition compressed and a game to learn about dd and pipes

This article is more an exercise, like a game, so you get to know certain things about Linux, and follow my mental process to uncover this. Is nothing mysterious for the Senior Engineers but Junior Sys Admins may enjoy this reading. :)

Ok, so the first thing is I wrote an script in order to completely backup my NVMe hard drive to a gziped file and then I will use this, as a motivation to go deep into investigations to understand.

Ok, so the first script would be like this:

#!/bin/bash
SOURCE_DRIVE="/dev/nvme0n1"
TARGET_PATH="/media/carles/Seagate\ Backup\ Plus\ Drive/BCK/"
TARGET_FILE="nvme.img"
sudo bash -c "dd if=${SOURCE_DRIVE} | gzip > ${TARGET_PATH}${TARGET_FILE}.gz"

So basically, we are going to restart the computer, boot with Linux Live USB Key, mount the Seagate Hard Drive, and run the script.

We are booting with a Live Linux Cd in order to have our partition unmounted and unmodified while we do the backup. This is in order to avoid corruption or data loss as a live Filesystem is getting modifications as we read it.

The problem with this first script is that it will generate a big gzip file.

By big I mean much more bigger than 2GB. Not all physical supports support files bigger than 2GB or 4GB, but even if they do, it’s a pain to transfer this over the Network, or in USB files, so we are going to do a slight modification.

#!/bin/bash
SOURCE_DRIVE="/dev/nvme0n1"
TARGET_PATH="/media/carles/Seagate\ Backup\ Plus\ Drive/BCK/"
TARGET_FILE="nvme.img"
sudo bash -c "dd if=${SOURCE_DRIVE} | gzip | split -b 1024MiB - ${TARGET_PATH}${TARGET_FILE}-split.gz_"

Ok, so we will use pipes and split in order to generate many files as big as 1GB.

If we ls we will get:

-rwxrwxrwx 1 carles carles 1073741824 May 24 14:57 nvme.img-split.gz_aa
-rwxrwxrwx 1 carles carles 1073741824 May 24 14:58 nvme.img-split.gz_ab
-rwxrwxrwx 1 carles carles 1073741824 May 24 14:59 nvme.img-split.gz_ac

Then one may say, Ok, this is working, but how I know the progress?.

For old versions of dd you can use pv which stands for Pipe Viewer and allows you to know the transference between processes using pipes.

For more recent versions of dd you can use status=progress.

So the script updated with status=progress is:

#!/bin/bash
SOURCE_DRIVE="/dev/nvme0n1"
TARGET_PATH="/media/carles/Seagate\ Backup\ Plus\ Drive/BCK/"
TARGET_FILE="nvme.img"
sudo bash -c "dd if=${SOURCE_DRIVE} status=progress | gzip | split -b 1024MiB - ${TARGET_PATH}${TARGET_FILE}-split.gz_"

You can also download the code from:

https://gitlab.com/carles.mateo/blog.carlesmateo.com-source-code/-/blob/master/backup_partition_in_files.sh

Then one may ask himself, wait, if pipes use STDOUT and STDIN and dd is displaying into the screen, then will our gz file get corrupted?.

I like when people question things, and investigate, so let’s answer this question.

If it was a young member of my Team I would ask:

  • Ok, try,it. Check the output file to see if is corrupted.

So they can do zcat or zless to inspect the file, see if it has errors, and to make sure:

gzip -v -t nvme.img.gz
nvme.img.gz:        OK

Ok, so what happened?, because we were seeing output in the screen.

Assuming the young Engineer does not know the answer I would had told:

  • Ok, so you know that if dd would print to STDOUT, then you won’t see it, cause it would be sent to the pipe, so there is something more you’re missing. Let’s check the source code of dd to see what status=progress does

And then look for “progress”.

Soon you’ll find things like everywhere:

  if (progress_time)
    fputc ('\r', stderr);

Ok, pay attention to where is the data written: stderr. So basically the answer is: dd status=progress does not corrupt STDOUT and prints into the screen because it uses STDERR.

Other funny ways to get the progress would be to use:

watch -n10 "ls -alh /BCK/ | grep nvme | wc --lines"
So you would see in real time what was the advance and finally 512GB where compressed to around 336GB in 336 files of 1 GB each (except the last one)

Another funny way would had been sending USR1 signal to the dd process:

Hope you enjoyed this little exercise about the importance of going deep, to the end, to understand what’s going on on the system. :)

Instead of gzip you can use bzip2 or pixz. pixz is very handy if you want to just compress a file, as it uses multiple processors in parallel for the tasks.

xz or lrzip are other compressors. lrzip aims to compress very large files, specially source code.