Category Archives: Troubleshoot

Dropping caches in Linux, to check if memory is actually being used

I encountered that Server, Xeon, 128 GB of RAM, with those 58 Spinning drives 10 TB and 2 SSD of 2 TB each, where I was testing the latest version of my Software.

Monitoring long term tests, data validation, checking for memory leaks…
I notice the Server is using 70 GB of RAM. Only 5.5 GB are used for buffers according to the usual tools (top, htop, free, cat /proc/meminfo, ps aux…) and no programs are eating that amount, so where is the RAM?.
The rest of the Servers are working well, including models: same mode, 4U60 with 64 GB of RAM, 4U90 with 128 GB and All-Flash-Array with 256 GB of RAM, only using around 8 GB of RAM even under load.
iSCSI sharings being used, with I/O, iSCSI initiators trying to connect and getting rejected, several requests for second, disk pulling, and that usual stuff. And this is the only unit using so many memory, so what?.
I checked some modules to see memory consumption, but nothing clear.
Ok, after a bit of investigation one member of the Team said “Oh, while you was on holidays we created a Ramdisk and filled it for some validations, we deleted that already but never rebooted the Server”.
Ok. The easy solution would be to reboot, but that would had hidden a memory leak it that was the cause.
No, I had to find the real cause.

I requested assistance of one my colleagues, specialist, Kernel Engineer.
He confirmed that processes were not taking that memory, and ask me to try to drop the cache.

So I did:

echo 3 > /proc/sys/vm/drop_caches

Then the memory usage drop to 11.4 GB and kept like that while I maintain sustained the load.

That’s more normal taking in count that we have 16 Volumes shared and one host is attempting to connect to Volumes that do not exist any more like crazy, Services and Cronjobs run in background and we conduct tests degrading the pool, removing drives, etc..

After tests concluded memory dropped to 2 GB, which is what we use when we’re not under load.

Note: In order to know about the memory being used by Kernel slab cache in real time you can use command:


You can also check:

sudo vmstat -m

Solving an infinite loop in CentOS after inducing a Kernel Panic in a Server

This trick may be useful for you.
Almost surely if you power cycle, completely powering down your Server you’ll fix booting too.
Unfortunately we do not always have access to the Data Center or Remote Hands service available, so this trick may be useful for you.

Just reset your BMC card with this:
ipmitool -H -U admin -P thepassword bmc reset cold

After this use the remote control tool to request a reboot and it will do and power on normally.

This may not work in all the Servers, it depends on a lot of aspects (firmware, bmc manufacturer, etc…) but can do the trick for you maybe.

Create a small partition on the drives for tests

Ok, as you know I work with ZFS, DRAID, Erasure Coding… and Cold Storage.
I work with big disks, SAS, SSD, and NVMe.
Sometimes I need to conduct some tests that involve filling completely to 100% the pool.
That’s very slow having to fill 14TB drives, with Servers with 60, 90 and 104 drives, for obvious reasons. So here is a handy script for partitioning those drives with a small partition, then you use the small partition for creating a pool that will fill faster.

1. Get the list of drives in the system
For example this script can help

DRIVES=`ls -al /dev/disk/by-id/ | grep "sd" | grep -v "part" | grep "wwn" | tr "./" "  " | awk '{ print $11; }'`

If your drives had a previous partition this script will detect them, and will use only the drives with wwn identifier.
Warning: some M.2 booting drives have wwn where others don’t. Use with caution.

2. Identify the boot device and remove from the list
3. Do the loop with for DRIVE in $DRIVES or manually:

for DRIVE in sdar sdcd sdi sdj sdbp sdbd sdy sdab sdbo sdk sdz sdbb sdl sdcq sdbl sdbe sdan sdv sdp sdbf sdao sdm sdg sdbw sdaf sdac sdag sdco sds sdah sdbh sdby sdbn sdcl sdcf sdbz sdbi sdcr sdbj sdd sdcn sdr sdbk sdaq sde sdak sdbx sdbm sday sdbv sdbg sdcg sdce sdca sdax sdam sdaz sdci sdt sdcp sdav sdc sdae sdf sdw sdu sdal sdo sdx sdh sdcj sdch sdaw sdba sdap sdck sdn sdas sdai sdaa sdcs sdcm sdcb sdaj sdcc sdad sdbc sdb sdq
(echo g; echo n; echo; echo; echo 41984000; echo w;) | fdisk /dev/$DRIVE

Curiosity Python string.strip() removes just more than white spaces

Another Python curiosity.

If you see the Official Python3 documentation for strip(), it says that strip without parameters will return the string without the leading and trailing white spaces.
Optionally you can pass a string with the characters you want to eliminate.

The official documentation for Python 2 says:

string.strip(s[, chars])

Return a copy of the string with leading and trailing characters removed. If chars is omitted or None, whitespace characters are removed. If given and not None, chars must be a string; the characters in the string will be stripped from the both ends of the string this method is called on.

Changed in version 2.2.3: The chars parameter was added. The chars parameter cannot be passed in earlier 2.2 versions.

A white space is a white space. Is not an Enter.
But strip() without parameters will remove white spaces (space), and Enter \n and Tabs \t.

Probably you will not realize that unless you read from a file that has empty lines at the end for a reason, and you use strip().

You can see a demonstration following this small program, that runs the same for Python2 and Python3.

And the corresponding output for python2 and python3:

The [ ] characters where added to show that there are no hidden tabs or similar after.

Here I paste the code so you can try yourself:

import sys

def print_bar():

def print_between_brackets(s_test):
    print("[" + s_test + "]")

s_string_with_enters = "  Testing strip not only removing white spaces, but Enter and Tabs s well\n\t\n\n"

print("Testing .strip()")
print("You are running Python " + sys.version)
print("This is the original string")
print("Now after strip()...")
print("As you can see the Enters and the Tabs have been removed, not just the spaced")

I think this should be disambiguate so I decided to take action. Is very easy to blame and never contribute. Not me. I went to Python to fix that and I located a bug reporting this issue:

The issue was registed and made specially interesting contributions by Dimitri Papadopoulos Orfanos.

The thread is really interesting to read. I recommend it.

At a glance:

“Python heavily relies on isspace() to detect “whitespace” characters.”

* Lib/ near line 23:
  whitespace = ' \t\n\r\v\f'

So all those characters will be stripped in Python2.7 if you use just string.strip()

The ticket was opened the 2015-10-18 12:15. So it’s a shame the documentation has not been updated yet, more than 3 years later. Those are the kind of things, lack of care, that I can’t understand. Not looking for the excellence.

Please, do note that Python3 supports Unicode natively and things are always a bit different than with Python2 and AscII.

Solving a persistent MD Array problem in RHEL7.4

Ok, so I lend one of my Servers to two of my colleagues in The States, that required to prepare some test for a customer. I always try to be nice and to stimulate sales.

I work with Declustered RAID, DRAID, and ZFS.

The Server was a 4U90, so a 4U Server with 90 SAS3 drives and 4 SSD. Drives are Dual Ported, and two Controllers (motherboard + CPU) have access simultaneously to the drives for HA.

After their tests my colleagues, returned me the Server, and I needed to use it and my surprise was when I tried to provision with ZFS and I encountered problems. Not much in the logs.

I checked:

cat /proc/mdstat

And that was the thing 8 MD Arrays where there.

[root@4u90-B ~]# cat /proc/mdstat 
Personalities : 
md2 : inactive sdba1[9](S) sdag1[7](S) sdaf1[3](S)
11720629248 blocks super 1.2

md1 : inactive sdax1[7](S) sdad1[5](S) sdac1[1](S) sdae1[9](S)
12056071168 blocks super 1.2

md0 : inactive sdat1[1](S) sdav1[9](S) sdau1[5](S) sdab1[7](S) sdaa1[3](S)
19534382080 blocks super 1.2

md4 : inactive sdbf1[9](S) sdbe1[5](S) sdbd1[1](S) sdal1[7](S) sdak1[3](S)
19534382080 blocks super 1.2

md5 : inactive sdam1[1](S) sdan1[5](S) sdao1[9](S)
11720629248 blocks super 1.2

md8 : inactive sdcq1[7](S) sdz1[2](S)
7813752832 blocks super 1.2

md7 : inactive sdbm1[7](S) sdar1[1](S) sdy1[9](S) sdx1[5](S)
15627505664 blocks super 1.2

md3 : inactive sdaj1[9](S) sdai1[5](S) sdah1[1](S)
11720629248 blocks super 1.2

md6 : inactive sdaq1[7](S) sdap1[3](S) sdr1[8](S) sdp1[0](S)
15627505664 blocks super 1.2

Ok. So I stop the Arrays

mdadm --stop /dev/md127

And then I zero the superblock:

mdadm --zero-superblock /dev/sdb1

After doing this for all I try to provision and… surprise! does not work. /dev/md127 has respawned like in the old times from Doom video game.

I check the mdmonitor service and even disable it.

systemctl disable mdmonitor

I repeat the process.

And /dev/md127 appears again, using another device.

At this point, just in case, I check the other controller, which should be powered off.

Ok, it was on. I launch the poweroff command, and repeat, same!.

I see that the poweroff command on the second Controller is doing a reboot. So I launch the halt command that makes it not respond to the ping anymore.

I repeat the process, and still the ghost md array appears there, and blocks me from doing my zpool create.

The /etc/mdadm.conf file did not exist (by default is not created).

I try a more aggressive approach:

DRIVES=`cat /proc/partitions | grep 3907018584 | awk '{ print $4; }'`

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}1"; mdadm --examine /dev/${DRIVE}1; done

Ok. And destruction time:

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}"; wipefs -a -f /dev/${DRIVE}; done

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}1"; mdadm --zero-superblock /dev/${DRIVE}1; done

Apparently the system is clean, but still I cannot provision, and /dev/md127 respaws and reappears all the time.

After googling and not finding anything about this problem, and my colleagues no having clue about what is causing this, I just proceed with a simple solution, as I need the Server for my company completing the tests in the next 24 hours.

So I create the file /etc/mdadm.conf with this content:

[root@draid-08 ~]# cat /etc/mdadm.conf 
AUTO -all

After that I rebooted the Server and I saw the infamous /dev/md127 is not there and I’m able to provision.

I share the solution as it may help other people.

Troubleshooting upgrading and loading a ZFS module in RHEL7.4

I illustrate this troubleshooting as it will be useful for some of you.

I requested to one of the members of my Team to compile and to install ZFS 7.9 to some of the Servers loaded with drives, that were running ZFS 7.4 older version.

Those systems were running RHEL7.4.

The compilation and install was fine, however the module was not able to load.

My Team member reported that: when trying to run “modprobe zfs”. It was giving the error:

modprobe: ERROR: could not insert 'zfs': Invalid argument

Also when trying to use a zpool command it gives the error:

Failed to initialize the libzfs library

That was only failing in one of the Servers, but not in the others.

My Engineer ran dmesg and found:

zfs: `' invalid for parameter `metaslab_debug_unload

He though it was a compilation error, but I knew that metaslab_debug_unload is an option parameter that you can set in /etc/zfs.conf

So I ran:

 modprobe -v zfs

And that confirmed my suspicious, so I edited /etc/zfs.conf and commented the parameter and tried again. And it failed.

As I run modprobe -v zfs (verbose) it was returning me the verbose info, and so I saw that it was still trying to load those parameters so I knew it was reading those parameters from some file.
I could have grep all the files in the filesystem looking for the parameter failing in the verbose or find all the files in the system named zfs.conf. To me it looked inefficient as it would be slow and may not bring any result (as I didn’t know how exactly my team member had compiled the code), however I expected to get the result. But what if I found 5 or 7 zfs.conf files?. Slow.
I used strace. It was not installed but the RHEL license was active so I simple did:

 yum install strace

strace is for System Trace and so it records all the System Calls that the programs do.
That’s a pro trick that will accompany you all your career.

So I did strace modprobe zfs

I did not use -v in here cause all the verbose would had been logged as a System Call and made more difficult my search.
I got the output of all the System Calls and I just had to look for which files were being read.

Then I found that zfs.conf under /etc/modprobe.d/zfs.conf
That was the one being read. So I commented the line and tried modprobe zfs and it worked perfectly. :)


An Epic fail that are committing all the universities

Article created on: 1528997557 | 2018-06-14 18:32:15 IST

Recently a mentor of the UCC university came to visit me to my office, in order to do the following of one of the members of my Team, an intern.
Conversation was well, and then at some point he asked what courses could do the university teach to their students in order to be more prepared for working with us.
The Head of Business Development, that was in the meeting with me, mentioned something interesting:
– Make the publish their best code in github, bitbucket or similar git repository, and maintain it. It is like a CV.
He pointed that some of the students sent me their repository page, and they have not committed a thing for more than a year. And usually the code that I find there is less than a tic-tac-toe exercise.
– Obviously, to have git experience.
– Having contributed to an Open Source project

I exposed some things that would be helpful to have in the interns and grads that I hire:
– git experience
– Python programming
– C programming
– Unit Testing experience
– Networking experience, in particular iSCSI exports, tcpdump
– Programming Best practices, PEP-8 at least for Python
– Usage of Professional Tools like PyCharm, JetBrains IntelliJ, PHPStorm, Code Lion, Netbeans, Eclipse
– Linux experience. Many of them use Windows at home cause they also play video games. Really few programmers in real life use windows. So at least guys install Virtual Box or VMWare and run Linux in an Virtual Machine.
– Cloud experience. Using instances, CDNs, APIs, tools…

And as the talking advanced I gave him a hint of the Epic fail that all the universities are committing.
They teach git for a semester. They teach Python for one or two semester, the first year usually one, the second year another. And that’s it. Is gone.
When they exit the university they have not programmed in Python for 2 or 3 years, they have not used git, they have not used SQL for the same amount of time, etc…

My boss pointed that the best candidates do side projects in their spare time, and have that bright in their eyes. That sparkling in the eyes is what I call the eye of the tiger, the desire to improve, to learn. That spark.

I told the mentor of my intern that the big mistake is doing things in small parcels, isolated, one block and is gone. That the best way to proceed would be to:
Make the student start a project from the very beginning, from the first semester. Then keep making it bigger and better over time.
Let them improve it over time. Screw it in all the ways possible. Make them reach the limits of their initial architecture. Allow them to face having to redo the thing from the scratch. Allow them to do screw it, to break things, and to learn from their mistakes. Over and over.

Nobody becomes a great programmer coding average things for two semesters.
But let them realize where the problems are. Let them come back to their code of two or three months ago, before holidays, and realize how important is to make comments, to give proper names to the files and to the variables. Let them run that project over so many time, that at some point they have to change computer and they realize that what worked with windows Uppercases and Lowercase mixed files, does not work with Unix (case sensitive).
Let them grow.
Let them see their mistakes over the time.

Let them run the project for so long so they switch several times from Cloud provider, and discover the pros and the cons and the not-to-do, and things like run for your life before using sharing hostings that limit your CPU quota even that kills your MySql instances when they look at the email (true history, connecting to POP3 was raising the CPU and the provider was killing the MySQL instances, and so the queries) or that limits your queries per second, and then ask them to install a drupal and they will learn the hard way why Quality is always better than price and will make the right decisions when they work for somebody else or for their own Startup.

Even many of the supposedly Senior guys never learned from their mistakes, for example the Outsourcing guys, cause they work 6 months to a year in a project and then jump to another. Nobody explains the hell in maintenance and incidental they have left there. Nobody teach them.

Programming an small project for 6 months doesn’t make a master. Doing it for 5 years, growing it, learning from your mistakes and learning the YES and DO-NOT the hard way, the real way that works, cause makes you understand why something is better than other things, is the path.

That also remembers me why I love the MT Notation and many of the guys in Barcelona that saw it criticized the method, while my colleagues at Facebook and Dropbox actually told me that they use it, specially for Python and C/C++.

Allow them to thing about how to solve sorting a list of 1000 items by themselves. Let them think. The lazy will copy, but they will not grow.

Then let them implement a Bubble sort. Let them improve it, if they can. Allow them a week to try to improve that. Then make them sort 1,000,000 items so they see that is bloody slow. How can I improve that?. May I read the data from the drive at once, reading line by line was slow… let them think. Like if they were learning Martial Arts, and so discovering their strengths, that they have fast reflexes, allow them to grow.

Universities have to create good professional, not just machines of passing the exams. Real world demands talent, problem solving abilities, passion, ability to learn, and will to do the things well and to improve, and discipline.

After 5-6 years of programming on a daily basis, with an IDE, git, deploying to the Cloud as the basic, and growing a program and seeing the downsides of the solutions chosen, observing that the caveats where for a reason, learning that the Hardware is important, that is not the same to write to memory that to disk or to network, detecting the problems, redoing things, ending in a cul-de-sac, fixing, improving, learning, growing the project, growing himself/herself as a mind, as a programmer, as a thinker, as an expert, daily, even if it’s 30 minutes per day, then that person is prepared for some serious business.

Like piano, guitar, painting, writing… and any other activity, one require continue training in order to improve.

Students have to follow a journey in order to improve.

Let them start with Command Line, i.e. in C and files. Let’s do add later database support.

Deal with buffer overflow, file descriptor, locks and conversion types. Let them migrate to another language the entire project, using Git from the beginning.

Let them migrate again when they need to add Web support. Allow them to discover that instead of reloading all the page they can use Ajax/JSON. Let them deal with click-click that many common users do on the page buttons (so they submit twice the information). To discover SQL Injections. To use a Web Framework. To add Unit Testing. Add some improvement via Javascript Frameworks like responsive for mobiles.

Allow them to use a new Database, new Webserver or technology that is fashion and everybody on Twitter talks about, so it crashes in their face. And so they discover that they will not play or discover new technologies in actual project time in the Company of their future employers, cause shit happens, and impacts the Schedule, and the Company loses money. Universities: Teach them, let the students learn this for themselves, rather than screwing it up in several companies after university.

Linux command-line tools I usually install

Some additional command-line tools that I use to install and use on my client Systems

Apache benchmarks

To stress a Web Server



Reports file access events from all running processes in real time.

With flock several processes can have a shared lock at the same time, or be waiting to acquire a write lock. With lslocks from util-linux package you can get a list of these processes.


discard unused blocks on a mounted filesystem (local or remote). Is useful for freeing blocks no longer used in ZFS zvols. That can also be achieved by mount -o discard

Show which processes use the named files, sockets, or filesystems.


Get/set SATA/IDE device parameters


An improved top


To set the metrics of all IPV4 routes attached to a given network interface


To watch metrics for a network interface (or wireless)



CPU and IO devices stats. I modified some collectors for telegraf and influxdb consumed by grafana for fetching the Write KB/s, Read KB/s, Bandwidth of the Magnetic Spinning drives and SSD during declustered rebuild.



Perform network throughput tests




java (jre Oracle and OpenJDK)

According to manpages, the opposite of more. :)
What it does is display a file, and you can scroll up/down, you can search for patterns…
cat /etc/passwd | less
less /etc/passwd
# -n doesn’t count the lines, to save time
# For a specific Offset
less -n +500000000P /var/log/apache2/giant.log
# For 50% point
less -n +50p /var/log/apache2/giant.log


To trace library calls.


Midnight Commander





To see in real time queries and slow queries to mysql


Show the space used by any directory and subdirectory



nginx (fpm-php) and apache

The webservers

nfs client


Offers monitoring of different aspects: Network, Disk, Processes…




Partition manipulation


Performance profiler.

perf top
perf stat ls

PHP + curl + mysql (hhvm)

python-pip and pypy

Pipe Viewer – is a terminal-based tool for monitoring the progress of data through a pipeline. It can be inserted into any normal pipeline between two processes to give a visual indication of how quickly data is passing through, how long it has taken, how near to completion it is, and an estimate of how long it will be until completion.

dd if=/dev/urandom | pv | dd of=/dev/null


1,74MB 0:00:09 [ 198kB/s] [      <=>                               ]


Access SCSI modes pages; read VPD pages; send simple SCSI commands

Displays Kernel slab cache information in real time.


Utility for dealing with the S.M.A.R.T. features of the disks, knowing errors…

Split a file into several, based by text lines, or binary: number of bytes per file.

SSH without typing the password. -f for reading it from a file.
sshpass -p “mypassword” ssh -o StrictHostKeyChecking=no root@

In this sample passing the command ls, so this will be executed, and logout.
sshpass -p “mypassword” ssh -o StrictHostKeyChecking=no root@ ls


A poor’s man VPN through SSH that is available for Linux and Mac OS X.

E.g.: sudo sshuttle -r carles@

Sshuttle forwards TCP and DNS but does not forward UDP or ICMP. So ping or ipmi protocol won’t work. But it does work for http, https, ssh…

Nice article on tunneling only certain things here.


To trace the system calls and signals.
To redirect the output to another process use:
strace zpool status 2>&1 | less

subversion svn

systemd-cgtop and systemd-analyze


To see the traffic to your NIC


Reads from sdtin and sends to a file and outputs to stdout as well.
For example:

find . -mtime -1 -print | grep -v "/logs/" 2>&1 | tee /var/log/results.log

Kills the process after the indicated timeout.

timeout 1000 dd if=/dev/urandom of=/dev/zvol/carles-N51-C5-D8-P2-S1/gb1000


Make a hexdump or do the reverse

sudo xxd /dev/nvme0n1p1 | less

Just like cat, but for compressed filed.

zcat logs.tar.gz | grep "Error"
zcat logs.tar.gz | less


Sergey Davidoff stumbled upon a project called compcache that creates a RAM based block device which acts as a swap disk, but is compressed and stored in memory instead of swap disk (which is slow), allowing very fast I/O and increasing the amount of memory available before the system starts swapping to disk. compcache was later re-written under the name zRam and is now integrated into the Linux kernel.

Using Windows 10 Appliance in Ubuntu Virtual Box 4.3.10

blog-carlesmateo-com-microsoft-edgeMicrosoft has released Windows 10, and with it the possibility to Download a Windows 10 Appliance to run under Virtual Box, VMWare player, HyperV (for windows), Parallels (Mac). Their idea is to allow you to test Microsoft Edge new browser in addition of being able to test the older browsers in older VM images.

I wanted to use Windows 10 to check compatibility with my messenger c-client.

Also I wanted to know how Java behaves.

The Windows 10 VM image will work for 90 days. You can download it from here (

Instructions are very precarious and they didn’t specify a minimum version, however if you use Virtual Box under Ubuntu 14.04, so Virtual Box 4.3.10, you’ll not be able to import the Appliance as you’ll get an error.

Update: Thanks to Razvan and Eric!, readers that reported that this also works for Mac OS 10.9.5. + Virtual Box 4.3.12 and VirtualBox 4.3.20 running under Windows 7 respectively.

‘Windows10_64’ is not a valid Guest OS type.

Result Code: NS_ERROR_INVALID_ARG (0x80070057)
Component: VirtualBox
Interface: IVirtualBox {fafa4e17-1ee2-4905-a10e-fe7c18bf5554}
Callee: IAppliance {3059cf9e-25c7-4f0b-9fa5-3c42e441670b}


I was looking to find a solution and found no solution on the Internet, so I decided to give a chance and try to fix it by myself.

The error is: ‘Windows10_64’ is not a valid Guest OS type. so obviously, the Windows10_64 is not on the list of the VirtualBox yet, it is a pretty new release. Microsoft could had shipped it with OS Type Windows 64 Other, or Windows 8 64 bits, but they did’t. I wondered if I could edit the image to trick it to appear as a recognized image.

I edited the file (MSEdge – Win10.ova) with Bless Hex Editor, an hexadecimal editor.

I looked for the String “Windows10_64” and found two occurrences.

blog-carlesmateo-com-bless-hex-editor-searchingI had to replace the string and leave it with exact number of bytes it has, so the same length (do not insert additional bytes). I searched for the list of supported OSes and found that “WindowsXP_64” would be a perfect match. I replaced that 10 for XP twice.

blog-carlesmateo-com-bless-hex-editor-windows10_64-to-windowsXP_64Then tried to import the Appliance and it worked.

blog-carlesmateo-com-virtual-box-importing-windows10-appliance-ova-cutblog-carlesmateo-com-bless-applicance-settingsI tried to run it like that, but it froze on the boot, with the new blue logo of windows.

I figured out that Windows XP would probably not be the best similar architecture, so I edited the config and I set Windows 8.1 (64 bit). I also increased the RAM to 4096 MB and set a 32 MB memory for the video card.


Then I just started the VM and everything worked.


Ok, a funny note: Just started, it installed me an update without asking ;)