Category Archives: DevOps

Post-Mortem: The mystery of the duplicated Transactions into an e-Commerce

Me, with 4 more Senior BackEnd Engineers wrote the new e-Commerce for a multinational.

The old legacy Software evolved into a different code for every country, making it impossible to be maintained.

The new Software we created used inheritance to use the same base code for each country and overloaded only the specific different behavior of every country, like for the payment methods, for example Brazil supporting “parcelados” or Germany with specific payment players.

We rewrote the old procedural PHP BackEnd into modern PHP, with OOP and our own Framework but we had to keep the transactional code in existing MySQL Procedures, so the logic was split. There was a Front End Team consuming our JSONs. Basically all the Front End code was cached in Akamai and pages were rendered accordingly to the JSONs served from out BackEnd.

It was a huge success.

This e-Commerce site had Campaigns that started at a certain time, so the amount of traffic that would come at the same time would be challenging.

The project was working very well, and after some time the original Team was split into different projects in the company and a Team for maintenance and evolutives was hired.

At certain point they started to encounter duplicate transactions, and nobody was able to solve the mystery.

I’m specialized into fixing impossible problems. They used to send me to Impossible Missions, and I am famous for solving impossible problems easily.

So I started the task with a SRE approach.

The System had many components and layers. The problem could be in many places.

I had in my arsenal of tools, Software like mysqldebugger with which I found an unnoticed bug in decimals calculation in the past surprising everybody.

Previous Engineers involved believed the problem was in the Database side. They were having difficulties to identify the issue by the random nature of the repetitions.

Some times the order lines were duplicated, and other times were the payments, which means charging twice to the customer.

Redis Cluster could also play a part on this, as storing the session information and the basket.

But I had to follow the logic sequence of steps.

If transactions from customer were duplicated that mean that in first term those requests have arrived to the System. So that was a good point of start.

With a list of duplicated operations, I checked the Webservers logs.

That was a bit tricky as the Webserver was recording the Ip of the Load Balancer, not the ip of the customer. But we were tracking the sessionid so with that I could track and user request history. A good thing was also that we were using cookies to stick the user to the same Webserver node. That has pros and cons, but in this case I didn’t have to worry about the logs combined of all the Webservers, I could just identify a transaction in one node, and stick into that node’s log.

I was working with SSH and Bash, no log aggregators existing today were available at that time.

So when I started to catch web logs and grep a bit an smile was drawn into my face. :)

There were no transactions repeated by a bad behavior on MySQL Masters, or by BackEnd problems. Actually the HTTP requests were performed twice.

And the explanation to that was much more simple.

Many Windows and Mac User are used to double click in the Desktop to open programs, so when they started to use Internet, they did the same. They double clicked on the Submit button on the forms. Causing two JavaScript requests in parallel.

When I explained it they were really surprised, but then they started to worry about how they could fix that.

Well, there are many ways, like using an UUID in each request and do not accepting two concurrents, but I came with something that we could deploy super fast.

I explained how to change the JavaScript code so the buttons will have no default submit action, and they will trigger a JavaScript method instead, that will set a boolean to True, and also would disable the button so it can not be clicked anymore. Only if the variable was False the submit would be performed. It was almost impossible to get a double click as the JavaScript was so fast disabling the button, that the second click will not trigger anything. But even if that could be possible, only one request would be made, as the variable was set to True on the first click event.

That case was very funny for me, because it was not necessary to go crazy inspecting the different layers of the system. The problem was detected simply with HTTP logs. :)

People often forget to follow the logic steps while many problems are much more simple.

As a curious note, I still see people double clicking on links and buttons on the Web, and some Software not handling it. :)

The Ethernet standards group announces a new 800 GbE specification

Here is the link to the new:

And this makes me think about all the Architects that are using Memcached and Redis in different Servers, in Networks of 1Gbps and makes me want to share with you what a nonsense, is often, that.

So the idea of having Memcache or Redis is just to cache the queries and unload the Database from those queries.

But 1Gbps is equivalent to 125MB (Megabytes) per second.

Local RAM Memory in Servers can perform at 24GB and more (24,000,000 Megabytes) per second, even more.

A PCIE NVMe drive at 3.5GB per second.

A local SSD drive without RAID 550 MB/s.

A SSD in the Cloud, varies a lot on the provider, number of drives, etc… but I’ve seen between 200 MB/s and 2.5GB/s aggregated in RAID.

In fact I have worked with Servers equipped with several IO Controllers, that were delivering 24GB/s of throughput writing or reading to HDD spinning drives.

If you’re in the Cloud. Instead of having 2 Load Balancers, 100 Front Web servers, with a cluster of 5 Redis with huge amount of RAM, and 1 MySQL Master and 1 Slave, all communicating at 1Gbps, probably you’ll get a better performance having the 2 LBs, and 11 Front Web with some more memory and having the Redis instance in the same machine and saving the money of that many small Front and from the 5 huge dedicated Redis.

The same applies if you’re using Docker or K8s.

Even if you just cache the queries to drive, speed will be better than sending everything through 1 Gbps.

This will matter for you if your site is really under heavy load. Most of the sites just query the MySQL Server using 1 Gbps lines, or 2 Gbps in bonding, and that’s enough.

Troubleshooting a shell prompt irresponsible that locks/hangs intermittently

You do df -h or ls / and the terminal freezes and not even CTRL + C works, you have a lock.

Normally this is due to a lock of the system trying to perform an IO.

Could be a physical spinning disk failing, but the most probably nowadays is that you have a network mount point and it is timing out.

If you execute mount and you get a timeout, and when you finally see the list you see a NFS, iSCSI or another kind of Network mount (you will see an Ip Address), check for errors.

To do this in CentOS/RHEL you can do as root:

dmesg | grep -i "timed"

or depending on the System

cat /var/log/messages | grep -i "timed"

You’ll get something like this:

[root@compute01 carles]# dmesg -T | grep timed | head -n5
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:44 2020] nfs: server storage07 not responding, timed out
[Fri Mar 20 02:27:45 2020] nfs: server storage07 not responding, timed out

Please note I use dmesg -T in order to have human readable date instead of Unix Epoch.

You can count the errors today:

[root@compute01 carles]# dmesg -T | grep time | grep "Mon Apr 6" | wc --lines

Installing Red Hat Linux in a M.2 that crashes the installer

Few months ago I encountered with a problem with RHEL installer and some of the M.2 drives.

I’ve productized my Product, to be released with M.2 booting SATA drives of 128GB.

The procedure for preparing the Servers (90 and 60 drives, Cold Storage) was based on the installation of RHEL in the M.2 128GB drive. Then the drives are cloned.

Few days before mass delivery the company request to change the booting M.2 drives for others of our own, 512 GB drives.

I’ve tested many different M.2 drives and all of them were slightly different.

Those 512 GB M.2 drives had one problem… Red Hat installer was failing with a python error.

We were running out of time, so I decided to clone directly from the 128GB M.2 working card, with everything installed, to the 512 GB card. Doing that is so easy as booting with a Rescue Linux USB disk, and then doing a dd from the 128GB drive to the 512GB drive.

Booting with a live USB system is important, as Filesystem should not be mounted to prevent corruption when cloning.

Then, the next operation would be booting the 512 GB drive and instructing Linux to claim the additional space.

Here is the procedure for doing it (note, the OS installed in the M.2 was CentOS in this case):

Determine the device that needs to be operated on (this will usually be the boot drive); in this example it is /dev/sdae

# df -h 
Filesystem                             Size  Used Avail Use% Mounted on
/dev/mapper/centos_4602c-root           50G  2.4G   47G   1% /
devtmpfs                                16G     0   16G   0% /dev
tmpfs                                   16G     0   16G   0% /dev/shm
tmpfs                                   16G  395M   16G   3% /run
tmpfs                                   16G     0   16G   0% /sys/fs/cgroup
/dev/sdae1                            1014M  146M  869M  15% /boot
/dev/mapper/centos_4602c-home           57G   33M   57G   1% /home
tmpfs                                  3.2G     0  3.2G   0% /run/user/0
logs                                    68G  7.4M   68G   1% /logs
mysql                                  481G  128K  481G   1% /mysql
N58-C3-D16-P3-S1                       491T  334G  490T   1% /N58-C3-D16-P3-S1

Extend the OS partition using Parted

# parted /dev/sdae
resizepart PART_NUMBER END


  • PART_NUMBER: Is the partition number obtained from the “print” command
  • END: This is the end of the drive; for example, for a 50GB drive, enter 50000

Examining the LVM Partitions

The centos_4602c-root LVM partition is the one we want to extend.

# lsblk /dev/sdae
NAME                          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdae                           65:224  0   477G  0 disk 
├─sdae1                        65:225  0     1G  0 part /boot
└─sdae2                        65:226  0 475.9G  0 part 
  ├─centos_4602c-root         253:0    0    50G  0 lvm  /
  ├─centos_4602c-swap         253:1    0  11.9G  0 lvm  [SWAP]
  └─centos_4602c-home         253:2    0  56.3G  0 lvm  /home

Using LVM Commands

The following commands will:

  • Display the LVM volumes on the system
  • Resize a volume (device)
  • Re-display the updated LVM volumes
  • Extend the desired LVM partition (lvextend command)
# pvdisplay
  /dev/sdbm: open failed: No medium found
  /dev/sdbn: open failed: No medium found
  /dev/sdbj: open failed: No medium found
  /dev/sdbk: open failed: No medium found
  /dev/sdbl: open failed: No medium found
  --- Physical volume ---
  PV Name               /dev/sdae2
  VG Name               centos_4602c
  PV Size               118.24 GiB / not usable 3.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              30269
  Free PE               0
  Allocated PE          30269
  PV UUID               yvHO6t-cYHM-CCCm-2hOO-mJWf-6NUI-zgxzwc
# pvresize /dev/sdae2
  /dev/sdbm: open failed: No medium found
  /dev/sdbn: open failed: No medium found
  /dev/sdbj: open failed: No medium found
  /dev/sdbk: open failed: No medium found
  /dev/sdbl: open failed: No medium found
  Physical volume "/dev/sdae2" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized
# pvdisplay
  /dev/sdbm: open failed: No medium found
  /dev/sdbn: open failed: No medium found
  /dev/sdbj: open failed: No medium found
  /dev/sdbk: open failed: No medium found
  /dev/sdbl: open failed: No medium found
  --- Physical volume ---
  PV Name               /dev/sdae2
  VG Name               centos_4602c
  PV Size               <475.84 GiB / not usable 3.25 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              121813
  Free PE               91544
  Allocated PE          30269
  PV UUID               yvHO6t-cYHM-CCCm-2hOO-mJWf-6NUI-zgxzwc
# vgdisplay
  --- Volume group ---
  VG Name               centos_4602c
  System ID             
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  6
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               3
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               <475.93 GiB
  PE Size               4.00 MiB
  Total PE              121838
  Alloc PE / Size       30269 / <118.24 GiB
  Free  PE / Size       91569 / 357.69 GiB
  VG UUID               ORcp2t-ntwQ-CNSX-NeXL-Udd9-htt9-kLfvRc
# lvextend -l +91569 /dev/centos_4602c/root 
  Size of logical volume centos_4602c/root changed from 50.00 GiB (12800 extents) to <407.69 GiB (104369 extents).
  Logical volume centos_4602c/root successfully resized.

Extend the xfs file system to use the extended space

The xfs file system for the root partition will need to be extended to use the extra space; this is done using the xfs_grow command as shown below.

# xfs_growfs /dev/centos_4602c/root  
meta-data=/dev/mapper/centos_4602c-root isize=512    agcount=4, agsize=3276800 blks
         =                       sectsz=512   attr=2, projid32bit=1          =                       crc=1        finobt=0 spinodes=0 data     =                       bsize=4096   blocks=13107200, imaxpct=25 
         =                       sunit=0      swidth=0 blks 
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1 log      =internal               bsize=4096   blocks=6400, version=2          =                       sectsz=512   sunit=0 blks, lazy-count=1 
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 13107200 to 106873856 

Verify the results

Note that the c-root LVM partition is now 408GB.

# df -h 
Filesystem                             Size  Used Avail Use% Mounted on
/dev/mapper/centos_4602c-root          408G  2.4G  406G   1% /
devtmpfs                                16G     0   16G   0% /dev
tmpfs                                   16G     0   16G   0% /dev/shm
tmpfs                                   16G  395M   16G   3% /run
tmpfs                                   16G     0   16G   0% /sys/fs/cgroup
/dev/sdae1                            1014M  146M  869M  15% /boot
/dev/mapper/centos_4602c-home           57G   33M   57G   1% /home
tmpfs                                  3.2G     0  3.2G   0% /run/user/0
logs                                    68G  7.4M   68G   1% /logs
mysql                                  481G  128K  481G   1% /mysql
N58-C3-D16-P3-S1                       491T  334G  490T   1% /N58-C3-D16-P3-S1

So now we are able to clone directly from one 512GB to another.

You may be interested to take a look to the commands:

xfs_growfs (from xfsprogs package)

If you want to do this in an instance in Amazon, here is a very good documentation.

A handy trick command line to get the usages of our Python Methods in the code

We all use powerful code analysis tool, but sometimes you’re presented with a problem and you have just… the terminal.

This Bash code is handy.

grep "def " /home/carles/code/gitlab/cloud/terraform/src/scale/lib/ | tr "()" "  " | awk '{ print $2; }' |  grep -v "__init" | sort > ./function_names_iscsi.txt

So this basically will get all the methods (“def ” whatever), strip the parenthesis with tr, and get the second column with awk, so basically the method name, sort it and write it to the file.

Then I will cd to the src directory and execute the seconds part:

cd /home/carles/code/gitlab/cloud/terraform/src/
for fname in $(cat ~/function_names_iscsi.txt); do printf "%s: %s\n" "$fname" "$(grep -r $fname *|grep -v 'def ' -c)"; done > ~/functions_being_used.txt

That will produce a nice list with the number of times of the method being called, in the form of:

method_name: occurrences

That’s the equivalent to doing Find Usages is PyCharm.

It’s easy to identify dead code then, with method_name: 0.

You can also run this to your Jenkins to warn when there is Dead Code in your repository.

Adding my Server as Docker, with PHP Catalonia Framework, explained

The previous day I explained how I migrated my old Server (Amazon Instance) to a more powerful model, with more recent OS, WebServer, etc…

This was interesting under the point of view of dealing with elastic Ip’s, Amazon AWS Volumes, etc… but was a process basically manual. I could have generated an immutable image to start from next time, but this is another discussion, specially because that Server Instance has different base Software, including a MySql Database.

This time I want to explain, step by step, how to conainerize my Server, so I can port to different platforms, and I can be independent on what the Server Operating System is. It will work always, as we defined the Operating System for the Docker Container.

So we start to use IaC (Infrastructure as Code).

So first you need to install docker.

So basically if your laptop is an Ubuntu 18.04 LTS you have to:

sudo apt install

Start and Automate Docker

The Docker service needs to be setup to run at startup. To do so, type in each command followed by enter:

sudo systemctl start docker
sudo systemctl enable docker

Create the Dockerfile

For doing this you can use any text editor, but as we are working with IaC why not use a Code Editor?.

You can use the versatile PyCharm, that has modules for understanding Docker and so you can use Control Version like git too.

This is the Dockerfile

FROM ubuntu:19.04


ARG DEBIAN_FRONTEND=noninteractive

#RUN echo "nameserver" > /etc/resolv.conf

RUN echo "Europe/Ireland" | tee /etc/timezone

# Note: You should install everything in a single line concatenated with
#       && and finalising with apt autoremove && apt clean
#       In order to use the less space possible, as every command is a layer
RUN apt-get update && apt-get install -y apache2 ntpdate libapache2-mod-php7.2 \
mysql-server php7.2-mysql php-dev libmcrypt-dev php-pear git && \
apt autoremove && apt clean

RUN a2enmod rewrite

RUN mkdir -p /www

# In order to activate Debug
# RUN sed -i "s/display_errors = Off/display_errors = On/" /etc/php/7.2/apache2/php.ini 
# RUN sed -i "s/error_reporting = E_ALL & ~E_DEPRECATED & ~E_STRICT/error_reporting = E_ALL/" /etc/php/7.2/apache2/php.ini 
# RUN sed -i "s/display_startup_errors = Off/display_startup_errors = On/" /etc/php/7.2/apache2/php.ini 
# To Debug remember to change:
# config/{production.php|preproduction.php|devel.php|docker.php} 
# in order to avoid Error Reporting being set to 0.


ENV APACHE_LOG_DIR   /var/log/apache2
ENV APACHE_PID_FILE  /var/run/apache2/
ENV APACHE_RUN_DIR   /var/run/apache2
ENV APACHE_LOCK_DIR  /var/lock/apache2
ENV APACHE_LOG_DIR   /var/log/apache2


# Remove the default Server
RUN sed -i '/<Directory \/var\/www\/>/,/<\/Directory>/{/<\/Directory>/ s/.*/# var-www commented/; t; d}' /etc/apache2/apache2.conf 

RUN rm /etc/apache2/sites-enabled/000-default.conf

COPY /etc/apache2/sites-available/


RUN ln -s /etc/apache2/sites-available/ /etc/apache2/sites-enabled/

# Note: You should clone locally and COPY to the Docker Image
#       Also you should add the .git directory to your .dockerignore file
#       I made this way to show you and for simplicity, having everything
#       in a single file
RUN git clone /www/
RUN git checkout tags/v.1.16-web-1.0
# In order to change profile to Production
# RUN sed -i "s/define('ENVIRONMENT', DOCKER)/define('ENVIRONMENT', PRODUCTION)/" /var/www/ 

# for debugging
#RUN apt-get install -y vim

RUN service apache2 restart


CMD ["/usr/sbin/apache2", "-D", "FOREGROUND"]

The file

As you saw in the Dockerfile you have the line:

COPY /etc/apache2/sites-available/

This will copy the file that must be in the same directory that the Dockerfile file, to the /etc/apache2/sites-available/ folder in the conainer.

<VirtualHost *:80>
    # Uncomment to use a DNS name in a multiple VirtualHost Environment
    DocumentRoot /www/
    <Directory /www/>
            Options -Indexes +FollowSymLinks +MultiViews
            AllowOverride All
            Order allow,deny
            allow from all
            Require all granted
    ErrorLog ${APACHE_LOG_DIR}/www-cataloniaframework-com-error.log
    # Possible values include: debug, info, notice, warn, error, crit,
    # alert, emerg.
    LogLevel warn
    CustomLog ${APACHE_LOG_DIR}/www-cataloniaframework-com-access.log combined

Stoping, starting the docker Service and creating the Catalonia image

service docker stop && service docker start

To build the Docker Image we will do:

docker build -t catalonia . --no-cache

I use the –no-cache so git is pulled and everything is reworked, not kept from cache.

Now we can run the Catalonia Docker, mapping the 80 port.

docker run -d -p 80:80 catalonia

If you want to check what’s going on inside the Docker, you’ll do:

docker ps

And so in this case, we will do:

docker exec -i -t distracted_wing /bin/bash

Finally I would like to check that the web page works, and I’ll use my preferred browser. In this case I will use lynx, the text browser, cause I don’t want Firefox to save things in the cache.

Upgrading the Blog after 5 years, AWS Amazon Web Services, under DoS and Spam attacks

Few days ago I was under a heavy DoS attack.

Nothing new, zombie computers, hackers, pirates, networks of computers… trying to abuse the system and to hack into it. Why? There could be many reasons, from storing pirate movies, trying to use your Server for sending Spam, try to phishing or to host Ransomware pages…

Most of those guys doesn’t know that is almost impossible to Spam from Amazon. Few emails per hour can come out from the Server unless you explicitly requests that update and configure everything.

But I thought it was a great opportunity to force myself to update the Operating System, core tools, versions of PHP and MySql.

Forensics / Postmortem of the incident

The task was divided in two parts:

  • Understanding the origin of the attack
  • Blocking the offending Ip addresses or disabling XMLRPC
  • Making the VM boot again (problems with Amazon AWS)
    • I didn’t know why it was not booting so.
  • Upgrading the OS

I disabled the access to the site while I was working using Amazon Web Services Firewall. Basically I turned access to my ip only. Example:

I changed so the world wide mask to my_Ip/3

That way the logs were reflecting only what I was doing from my Ip.

Dealing with Snapshots and Volumes in AWS

Well the first thing was doing an Snapshot.

After, I tried to boot the original Blog Server (so I don’t stop offering service) but no way, the Server appeared to be dead.

So then I attached the Volume to a new Server with the same base OS, in order to extract (dump) the database. Later I would attach the same Volume to a new Server with the most recent OS and base Software.

Something that is a bit annoying is that the new Instances, the new generation instances, run only in VPC, not in Amazon EC2 Classic. But my static Ip addresses are created for Amazon EC2 Classic, so I could not use them in new generation instances.

I choose the option to see all the All the generations.

Upgrading the system base Software had its own challenges too.

Upgrading the OS / Base Software

My approach was to install an Ubuntu 18.04 LTS, and install the base Software clean, and add any modification I may need.

I wanted to have all the supported packages and a recent version of PHP 7 and the latest Software pieces link Apache or MySQL.

sudo apt update

sudo apt install apache2

sudo apt install mysql-server

sudo apt install php libapache2-mod-php php-mysql


Config files that before were working stopped working as the new Apache version requires the files or symlinks under /etc/apache2/sites-enabled/ to end with .conf extension.

Also some directives changed, so some websites will not able to work properly.

Those projects using my Catalonia Framework were affected, although I have this very well documented to make it easy to work with both versions of Apache Http Server, so it was a very straightforward change.

From the previous version I had to change my file and enable:

    <Directory /www/>
Options Indexes FollowSymLinks MultiViews
AllowOverride All
Order allow,deny
allow from all

Then Open the ports for the Web Server (443 and 80).

sudo ufw allow in "Apache Full"

Then service apache restart

Catalonia Framework Web Site, which is also created with Catalonia Framework itself once restored


The problem was to use the most updated version of the Database. I could use one of the backups I keep, from last week, but I wanted more fresh data.

I had the .db files and it should had been very straightforward to copy to /var/lib/mysql/ … if they were the same version. But they weren’t. So I launched an instance with the same base Software as the old previous machine had, installed mysql-server, stopped it, copied the .db files, started it, and then I made a dump with mysqldump –all-databases > 2019-04-29-all-databases.sql

Note, I copied the .db files using the mythical mc, which is a clone from Norton Commander.

Then I stopped that instance and I detached that volume and attached it to the new Blog Instance.

I did a Backup of my original /var/lib/mysql/ files for the purpose of faster restoring if something went wrong.

I mounted it under /mnt/blog_old and did mysql -u root -p < /mnt/blog_old/home/ubuntu/2019-04-29-all-databases.sql

That worked well I had restored the blog. But as I was watching the /var/log/mysql/error.log I noticed some columns were not where they should be. That’s because inadvertently I overwritten the MySql table as well, which in MySQL 5.7 has different structure than in MySQL 5.5. So I screwed. As I previewed this possibility I restored from the backup in seconds.

So basically then I edited my .sql files and removed all that was for the mysql database.

I started MySql, and run the mysql import procedure again. It worked, but I had to recreate the users for all the Databases and Grant them permissions.

GRANT ALL PRIVILEGES ON db_mysqlproxycache.* TO 'wp_dbuser_mysqlproxy'@'localhost' IDENTIFIED BY 'XWy$&{yS@qlC|<¡!?;:-ç';


Some modules in my blogs where returning errors in /var/log/apache2/mysite-error.log so I checked that it was due to lack of support of latest PHP versions, and so I patched manually the code or I just disabled the offending plugin.


As seen checking the /var/log/apache2/ some URLs where not located by WordPress.

For example:

The requested URL /wordpress/wp-json/ was not found on this server

I had to activate modrewrite and then restart Apache.

a2enmod rewrite; service apache2 restart

Making the site more secure

Checking at the logs of Apache, /var/log/apache2/ I checked for Ip’s accessing Admin areas, I looked for 404 Errors pointing to intents to exploit a unsafe WP Plugin, I checked for POST protocol as well.

I added to the Ubuntu Uncomplicated Firewall (UFW) the offending Ip’s and patched the xmlrpc.php file to exit always.

Dropping caches in Linux, to check if memory is actually being used

I encountered that Server, Xeon, 128 GB of RAM, with those 58 Spinning drives 10 TB and 2 SSD of 2 TB each, where I was testing the latest version of my Software.

Monitoring long term tests, data validation, checking for memory leaks…
I notice the Server is using 70 GB of RAM. Only 5.5 GB are used for buffers according to the usual tools (top, htop, free, cat /proc/meminfo, ps aux…) and no programs are eating that amount, so where is the RAM?.
The rest of the Servers are working well, including models: same mode, 4U60 with 64 GB of RAM, 4U90 with 128 GB and All-Flash-Array with 256 GB of RAM, only using around 8 GB of RAM even under load.
iSCSI sharings being used, with I/O, iSCSI initiators trying to connect and getting rejected, several requests for second, disk pulling, and that usual stuff. And this is the only unit using so many memory, so what?.
I checked some modules to see memory consumption, but nothing clear.
Ok, after a bit of investigation one member of the Team said “Oh, while you was on holidays we created a Ramdisk and filled it for some validations, we deleted that already but never rebooted the Server”.
Ok. The easy solution would be to reboot, but that would had hidden a memory leak it that was the cause.
No, I had to find the real cause.

I requested assistance of one my colleagues, specialist, Kernel Engineer.
He confirmed that processes were not taking that memory, and ask me to try to drop the cache.

So I did:

echo 3 > /proc/sys/vm/drop_caches

Then the memory usage drop to 11.4 GB and kept like that while I maintain sustained the load.

That’s more normal taking in count that we have 16 Volumes shared and one host is attempting to connect to Volumes that do not exist any more like crazy, Services and Cronjobs run in background and we conduct tests degrading the pool, removing drives, etc..

After tests concluded memory dropped to 2 GB, which is what we use when we’re not under load.

Note: In order to know about the memory being used by Kernel slab cache in real time you can use command:


You can also check:

sudo vmstat -m

Solving a persistent MD Array problem in RHEL7.4 (Dual Port SAS drives)

Ok, so I lend one of my Servers to two of my colleagues in The States, that required to prepare some test for a customer. I always try to be nice and to stimulate sales across the organizations I help, so if they need a Server for a PoC and demo to a customer, they know they can count on me.

It is important to remark that the Servers I was using had two motherboards, with their CPU and RAM, and Dual Port SAS drives. We had those Servers so we can implement High Availability. The Dual Port SAS allow two different computers or IO controllers to access the same drive at the same time.

I work with Declustered RAID, DRAID, and ZFS.

The Server was a 4U90, so a 4U Server with 90 SAS3 spinning drives and 4 SSD. Drives are Dual Ported, and two Controllers (motherboard + CPU + RAM) have access simultaneously to the drives for HA.

After their tests my colleagues, returned me the Server, and I needed to use it and my surprise was when I tried to provision with ZFS and I encountered problems. Not much in the logs. Please note I was using only one node (or controller), and the other was not in use but they ask me to keep the OS and the data (in 2xMD drive). I shutdown the node A after the Engineers in San Jose powered the Server off, so only my node was working.

I checked:

cat /proc/mdstat

And that was the thing 8 MD Arrays where there.

[root@4u90-B ~]# cat /proc/mdstat 
Personalities : 
md2 : inactive sdba1[9](S) sdag1[7](S) sdaf1[3](S)
11720629248 blocks super 1.2

md1 : inactive sdax1[7](S) sdad1[5](S) sdac1[1](S) sdae1[9](S)
12056071168 blocks super 1.2

md0 : inactive sdat1[1](S) sdav1[9](S) sdau1[5](S) sdab1[7](S) sdaa1[3](S)
19534382080 blocks super 1.2

md4 : inactive sdbf1[9](S) sdbe1[5](S) sdbd1[1](S) sdal1[7](S) sdak1[3](S)
19534382080 blocks super 1.2

md5 : inactive sdam1[1](S) sdan1[5](S) sdao1[9](S)
11720629248 blocks super 1.2

md8 : inactive sdcq1[7](S) sdz1[2](S)
7813752832 blocks super 1.2

md7 : inactive sdbm1[7](S) sdar1[1](S) sdy1[9](S) sdx1[5](S)
15627505664 blocks super 1.2

md3 : inactive sdaj1[9](S) sdai1[5](S) sdah1[1](S)
11720629248 blocks super 1.2

md6 : inactive sdaq1[7](S) sdap1[3](S) sdr1[8](S) sdp1[0](S)
15627505664 blocks super 1.2

Ok. So I stop the Arrays

mdadm --stop /dev/md127

And then I zero the superblock:

mdadm --zero-superblock /dev/sdb1

After doing this for all I try to provision and… surprise! does not work. /dev/md127 has respawned like in the old times from Doom video game.

I check the mdmonitor service and even disable it.

systemctl disable mdmonitor

I repeat the process.

And /dev/md127 appears again, using another device.

At this point, just in case, I check the other controller, which should be powered off.

Ok, it was on. With different Ip, so it was not answering to ping, but I still had access to BMC//IPMI. After confirming with my colleagues that I can shutdown that node (they did not turn it on apparently) I launch the poweroff command, and repeat, same!.

I see that the poweroff command on the second Controller is doing a reboot, not poweroff. Is a Firmware issue I find. So I access to the Linux from the management tool and I launch the halt command that makes it not respond to the ping anymore.

I repeat the process, and still the ghost md array appears there, and blocks me from doing my zpool create.

The /etc/mdadm.conf file did not exist (by default is not created).

I try a more aggressive approach:

DRIVES=`cat /proc/partitions | grep 3907018584 | awk '{ print $4; }'`

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}1"; mdadm --examine /dev/${DRIVE}1; done

Ok. And destruction time:

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}"; wipefs -a -f /dev/${DRIVE}; done

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}1"; mdadm --zero-superblock /dev/${DRIVE}1; done

Apparently the system is clean, but still I cannot provision, and /dev/md127 respawns and reappears all the time.

After googling and not finding anything about this problem, and my colleagues no having clue about what is causing this, I just proceed with a simple solution, as I need the Server for my company completing the tests in the next 24 hours.

So I create the file /etc/mdadm.conf with this content:

[root@draid-08 ~]# cat /etc/mdadm.conf 
AUTO -all

After that I rebooted the Server and I saw the infamous /dev/md127 is not there and I’m able to provision.

I share the solution as it may help other people.

The most straightforward procedure would had been reinstalling clean the OS, but this operation is very slow when simulating a Virtual CD remotely, so it was worth fixing that as OS level, as I save one day delaying my work.