Category Archives: Operations

Showing the exploting Copy Fail CVE-2026-31431 in an Ubuntu 24.04 just launched in Google Cloud, and how to fix it

So I show here how I launched a fresh Ubuntu 24.04 in Google Cloud, on 2026-05-04, and demostrate the exploit of escalation privileges Copy Fail (CVE-2026-31431) which allows you to become root from a regular user account in almost any Linux since year 2017.

It consists in the execution of a Python 3 code, which is only 732 bytes.

I show how I fixed it by upgrading the kernel and rebooting.

Here you can see the original tweet I saw: https://x.com/DarkWebInformer/status/2049579219190165658?s=20

And access the code: https://github.com/theori-io/copy-fail-CVE-2026-31431

I also tried on a fresh deployed Ubuntu 26.04 LTS and it was not affected by the exploit.

Resizing the disk of your Ubuntu Server in Google Cloud GCP without rebooting

If you are running your instances in Google Gloud Compute Engine and you want to increase the size of the Disk without having to reboot, this video explains step by step how you can do it.

Go to Disks in GCP, select the disk of the instance you want to increase, then press Edit.

After you increase the Disk in Google Cloud Dashboard, then ssh to you instance.

There type:

lsblk

in order to list the devices.

In my case is sda and I want to grow the partition 1.

So I proceed with:

sudo growpart /dev/sda 1

Which growing from 30GB to 40GB produces the output:

CHANGED: partition=1 start=2324480 old: size=60590047 end=62914526 new: size=81561567 end=83886046

Is you type lsblk again you’ll see the new size.

But if you type df -h you’ll see that Linux still doesn’t see the space.

To finalize and claim the additional space execute (in my case is sda1):

sudo resize2fs /dev/sda1

Validate IP Addresses and Networks with CIDR in Python

Python has a built-in package named ipaddress

You don’t need to install anything to use it.

This simple code shows how to use it

import ipaddress


def check_ip(s_ip_or_net):
    b_valid = True
    try:
        # The IP Addresses are expected to be passed without / even if it's /32 it would fail
        # If it uses / so, the CIDR notation, check it as a Network, even if it's /32
        if "/" in s_ip_or_net:
            o_net = ipaddress.ip_network(s_ip_or_net)
        else:
            o_ip = ipaddress.ip_address(s_ip_or_net)

    except ValueError:
        b_valid = False

    return b_valid


if __name__ == "__main__":
    a_ips = ["127.0.0.2.4",
             "127.0.0.0",
             "192.168.0.0",
             "192.168.0.1",
             "192.168.0.1 ",
             "192.168.0. 1",
             "192.168.0.1/32",
             "192.168.0.1 /32",
             "192.168.0.0/32",
             "192.0.2.0/255.255.255.0",
             "0.0.0.0/31",
             "0.0.0.0/32",
             "0.0.0.0/33",
             "1.2.3.4",
             "1.2.3.4/24",
             "1.2.3.0/24"]

    for s_ip in a_ips:
        b_success = check_ip(s_ip)
        if b_success is True:
            print(f"The IP Address or Network {s_ip} is valid")
        else:
            print(f"The IP Address or Network {s_ip} is not valid")

And the output is like this:

The IP Address or Network 127.0.0.2.4 is not valid
The IP Address or Network 127.0.0.0 is valid
The IP Address or Network 192.168.0.0 is valid
The IP Address or Network 192.168.0.1 is valid
The IP Address or Network 192.168.0.1  is not valid
The IP Address or Network 192.168.0. 1 is not valid
The IP Address or Network 192.168.0.1/32 is valid
The IP Address or Network 192.168.0.1 /32 is not valid
The IP Address or Network 192.168.0.0/32 is valid
The IP Address or Network 192.0.2.0/255.255.255.0 is valid
The IP Address or Network 0.0.0.0/31 is valid
The IP Address or Network 0.0.0.0/32 is valid
The IP Address or Network 0.0.0.0/33 is not valid
The IP Address or Network 1.2.3.4 is valid
The IP Address or Network 1.2.3.4/24 is not valid
The IP Address or Network 1.2.3.0/24 is valid

As you can read in the code comments, ipaddress.ip_address() will not validate an IP Address with the CIDR notation, even if it’s /32.

You should strip the /32 or use ipaddress.ip_network() instead.

As you can see 1.2.3.4/24 is returned as not valid.

You can pass the parameter strict=False and it will be returned as valid.

ipaddress.ip_network(s_ip_or_net, strict=False)

Creating a RabbitMQ Docker Container accessed with Python and pika

In this video, that I streamed on Twitch, I demonstrate the code showed here.

I launch the Docker Container and operated it a bit, so you can get to learn few tricks.

I created the RabbitMQ Docker installation based on the official RabbitMQ installation instructions for Ubuntu/Debian:

https://www.rabbitmq.com/install-debian.html#apt-cloudsmith

One interesting aspect is that I cover how the messages are delivered as byte sequence. I show this by sending Unicode characters

Files in the project

Dockerfile

FROM ubuntu:20.04

MAINTAINER Carles Mateo

ARG DEBIAN_FRONTEND=noninteractive

# This will make sure printing in the Screen when running in dettached mode
ENV PYTHONUNBUFFERED=1

ARG PATH_RABBIT_INSTALL=/tmp/rabbit_install/

ARG PATH_RABBIT_APP_PYTHON=/opt/rabbit_python/

RUN mkdir $PATH_RABBIT_INSTALL

COPY cloudsmith.sh $PATH_RABBIT_INSTALL

RUN chmod +x ${PATH_RABBIT_INSTALL}cloudsmith.sh

RUN apt-get update -y && apt install -y sudo python3 python3-pip mc htop less strace zip gzip lynx && apt-get clean

RUN ${PATH_RABBIT_INSTALL}cloudsmith.sh

RUN service rabbitmq-server start

RUN mkdir $PATH_RABBIT_APP_PYTHON

COPY requirements.txt $PATH_RABBIT_APP_PYTHON

WORKDIR $PATH_RABBIT_APP_PYTHON

RUN pwd

RUN pip install -r requirements.txt

COPY *.py $PATH_RABBIT_APP_PYTHON

COPY loop_send_get_messages.sh $PATH_RABBIT_APP_PYTHON

RUN chmod +x loop_send_get_messages.sh

CMD ./loop_send_get_messages.sh

cloudsmith.sh

#!/usr/bin/sh
# From: https://www.rabbitmq.com/install-debian.html#apt-cloudsmith

sudo apt-get update -y && apt-get install curl gnupg apt-transport-https -y

## Team RabbitMQ's main signing key
curl -1sLf "https://keys.openpgp.org/vks/v1/by-fingerprint/0A9AF2115F4687BD29803A206B73A36E6026DFCA" | sudo gpg --dearmor | sudo tee /usr/share/keyrings/com.rabbitmq.team.gpg > /dev/null
## Cloudsmith: modern Erlang repository
curl -1sLf https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/gpg.E495BB49CC4BBE5B.key | sudo gpg --dearmor | sudo tee /usr/share/keyrings/io.cloudsmith.rabbitmq.E495BB49CC4BBE5B.gpg > /dev/null
## Cloudsmith: RabbitMQ repository
curl -1sLf https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-server/gpg.9F4587F226208342.key | sudo gpg --dearmor | sudo tee /usr/share/keyrings/io.cloudsmith.rabbitmq.9F4587F226208342.gpg > /dev/null

## Add apt repositories maintained by Team RabbitMQ
sudo tee /etc/apt/sources.list.d/rabbitmq.list <<EOF
## Provides modern Erlang/OTP releases
##
deb [signed-by=/usr/share/keyrings/io.cloudsmith.rabbitmq.E495BB49CC4BBE5B.gpg] https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/deb/ubuntu bionic main
deb-src [signed-by=/usr/share/keyrings/io.cloudsmith.rabbitmq.E495BB49CC4BBE5B.gpg] https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/deb/ubuntu bionic main

## Provides RabbitMQ
##
deb [signed-by=/usr/share/keyrings/io.cloudsmith.rabbitmq.9F4587F226208342.gpg] https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-server/deb/ubuntu bionic main
deb-src [signed-by=/usr/share/keyrings/io.cloudsmith.rabbitmq.9F4587F226208342.gpg] https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-server/deb/ubuntu bionic main
EOF

## Update package indices
sudo apt-get update -y

## Install Erlang packages
sudo apt-get install -y erlang-base \
                        erlang-asn1 erlang-crypto erlang-eldap erlang-ftp erlang-inets \
                        erlang-mnesia erlang-os-mon erlang-parsetools erlang-public-key \
                        erlang-runtime-tools erlang-snmp erlang-ssl \
                        erlang-syntax-tools erlang-tftp erlang-tools erlang-xmerl

## Install rabbitmq-server and its dependencies
sudo apt-get install rabbitmq-server -y --fix-missing

build_docker.sh

#!/bin/bash

s_DOCKER_IMAGE_NAME="rabbitmq"

echo "We will build the Docker Image and name it: ${s_DOCKER_IMAGE_NAME}"
echo "After, we will be able to run a Docker Container based on it."

printf "Removing old image %s\n" "${s_DOCKER_IMAGE_NAME}"
sudo docker rm "${s_DOCKER_IMAGE_NAME}"

printf "Creating Docker Image %s\n" "${s_DOCKER_IMAGE_NAME}"
sudo docker build -t ${s_DOCKER_IMAGE_NAME} . --no-cache

i_EXIT_CODE=$?
if [ $i_EXIT_CODE -ne 0 ]; then
    printf "Error. Exit code %s\n" ${i_EXIT_CODE}
    exit
fi

echo "Ready to run ${s_DOCKER_IMAGE_NAME} Docker Container"
echo "To run in type: sudo docker run -it --name ${s_DOCKER_IMAGE_NAME} ${s_DOCKER_IMAGE_NAME}"
echo "or just use run_in_docker.sh"

requirements.txt

pika

loop_send_get_messages.sh

#!/bin/bash

echo "Starting RabbitMQ"
service rabbitmq-server start

echo "Launching consumer in background which will be listening and executing the callback function"
python3 rabbitmq_getfrom.py &

while true; do

    i_MESSAGES=$(( RANDOM % 10 ))

    echo "Sending $i_MESSAGES messages"
    for i_MESSAGE in $(seq 1 $i_MESSAGES); do
        python3 rabbitmq_sendto.py
    done

    echo "Sleeping 5 seconds"
    sleep 5

done

echo "Exiting loop"

rabbitmq_sendto.py

#!/usr/bin/env python3
import pika
import time

connection = pika.BlockingConnection(pika.ConnectionParameters(host="localhost"))

channel = connection.channel()

channel.queue_declare(queue="hello")

s_now = str(time.time())

s_message = "Hello World! " + s_now + " Testing Unicode: çÇ àá😀"
channel.basic_publish(exchange="", routing_key="hello", body=s_message)
print(" [x] Sent '" + s_message + "'")
connection.close()

rabbitmq_getfrom.py

#!/usr/bin/env python3
import pika


def callback(ch, method, properties, body):
    # print(f" [x] Received in channel: {ch} method: {method} properties: {properties} body: {body}")
    print(f" [x] Received body: {body}")


connection = pika.BlockingConnection(pika.ConnectionParameters(host="localhost"))

channel = connection.channel()

channel.queue_declare(queue="hello")

print(" [*] Waiting for messages. To exit press Ctrl+C")

# This will loop
channel.basic_consume(queue="hello", on_message_callback=callback)
channel.start_consuming()

print("Finishing consumer")

News from the Blog 2022-06-22

For the first part of June I’ve been quiet on Social Media as I was on holidays and taking some scheduled tests for my health in the hospital.

Carles in the Media/Press/Streaming

Twitch

I started streaming live Python coding sessions in Twitch. I’m giving it a try to see if coders have engagement.

The Software I use to broadcast from Linux is OBS.

I started with my Open Source project ctop.

I had a very long and interesting session on 2022-06-06 about OpenZFS, Data Centers, NVMe, iSCSI, Hard Drives, Storage, performance, Data Centers

More funny things happened like when I was installing a VirtualBox VM live, and the ZFS pool became irresponsible due hardware errors in one SATA Spinning drive.

Things from broadcasting live…

Some of the feedback I got from talented Engineers is that even if the original matter to talk about was interesting, seeing everything falling apart live due to unexpected hardware problems, and me troubleshooting live is being the best of the show… which I found very amusing.

RAB Radio the new digital world

I keep doing my radio space for Radio America Barcelona, once per week, addressed to the Catalan Community across the world and expats.

This radio program, streamed also via Twitch, is available in Catalan language only. RAB.

Open Source

carleslibs

I’ve been working in version 1.0.8 branch, and after a session of refactor on Twitch where I found a bug in MenuUtils class, I fixed it and released v. 1.0.8. You can see the video on the link.

Now I’m working on the branch v. 1.0.9.

ctop

I’ve been working in the branch 0.8.9.

My first Twitch broadcast was about adding Unit Testing to MemUtils class.

You can see all my videos:

http://www.youtube.com/channel/UCYzY-2wJ9W_ooR64-QzEdJg

Infrastructure

OpenStack

I recommend you the videos in this page about Operating OpenStack at Scale.

Some of my Blizzard colleagues talk on it.

https://superuser.openstack.org/articles/upgrades-in-large-scale-openstack-infrastructure-openinfra-live-episode-6/

https://www.openstack.org/videos/summits/denver-2019/how-blizzard-entertainment-uses-autoscaling-with-overwatch

My last physical server in a Data Center

This week I decommissioned my last physical server in a Data Center.

It has been a long journey since I created my company to launch my own projects, and I started having my own infrastructure, back at 2000.

I was offering VPS at that time, with VMWare as Hypervisor.

This last Rack Server served me well for 21 years.

Now everything is Cloud, and is not viable to host and maintain servers unless this is your main occupation. Server’s motherboards die, hard drives die and they need to be replaced. Maintaining infrastructure it’s a full time job and you require somebody to do it. Also using fixed servers only prevents you from moving fast, locks a lot of money, and from spawning more compute capacity.

If you are curious this Rack Server is a Super Micro with Intel Xeon processor and SCSI drives.

Security

Firewall

I keep blocking thousands of IP Addresses every day.

When I see a pattern of an IP trying an attacks against the Server I look at the IP and if it’s from a hosting provider I just block the entire range.

I keep blocking any IP Address coming from Russia or Belarus since they invaded Ukraine.

My Health

I visited the hospital for a programmed following on my health.

The analysis are super good, and it’s super clear that I’ve improved radically. My discipline with the diet, taking the medicines and doing exercise regularly has been crucial.

My Doctor is confident that I’ll have a full recovery, but to do so I need to loss a lot of weight in a year or two.

So, I need to focus on my health and in doing exercise, being happy and avoid any kind of negative stress.

The cost of the travels and the medicines have put some stress into my economy, but I’m fortunate that I can handle it.

Entertainment / Life / Reflections

Star Wars and racism

I’m really enjoying new Start Wars series Obi Wan, and I’ve been profoundly shocked to read that there are fans being racist against the black characters.

https://www.theverge.com/2022/5/31/23148468/star-wars-obi-wan-moses-ingram-third-sister

So just writing here to show my support to human beings from all races, genders including transgender, LGTB+, conditions and preferences.

Twitch Stream about ZFS, zpool scrubbing, Hard drives, Data Centers, NVMe, Rack Servers…

Twitch stream on 2022-06-06 10:50 IST

In this very long session we went through actual errors in a ZFS pool, we check the Kernel, we remove and reinsert the drive, conduct zpool scrub… in the meantime I talked about Rack, Rack Servers, PSU, redundant components, ECC RAM…

Renewing a SSL Certificate for Apache2 in Ubuntu 20.04

First you have to generate a new csr and key files.

It is not recommended to reuse your old CSR file.

openssl req -new -newkey rsa:2048 -nodes -keyout blog_carles_mateo_com_2022.key -out blog_carlesmateo_com_2022.csr

As you can see I used the name of the domain and the year for the new files to be generated to easily distinguish them.

When you’re asked for the password, in the additional fields, keep that password safe in case you need the Cert to be reissued to you.

You’ll need to submit the CSR file to your SSL provider. They will return you the CRT and the CA-BUNDLE files.

Edit your Apache config file for the SSL site.

For example:

/etc/apache2/sites-enabled/11-https-blog-carlesmateo-com.conf

Your conf file will look similar to this:

<VirtualHost *:443>
	ServerAdmin webmaster@yourdomain.cat

	DocumentRoot /opt/sites/www/blog.carlesmateo.com
	ServerName blog.carlesmateo.com
        SSLEngine on
        SSLCertificateFile /opt/sites/certs/2022/blog_carlesmateo_com_2022.crt
        SSLCertificateKeyFile /opt/sites/certs/2022/blog_carlesmateo_com_2022.key
        SSLCertificateChainFile /opt/sites/certs/2022/blog_carlesmateo_com_2022.ca-bundle
...

Before restarting Apache2, test the configuration for syntax errors with:

apache2ctl -t

If all is good, restart your Web Server with:

service apache2 restart

With a browser, verify that the information of the domain is right. I recommend you to check in Firefox and Chrome at least.

News from the Blog 2021-11-11

New Articles

How to communicate with your Python program running inside a Docker Container, using Linux Signals

Hope you’ll have fun reading this article:

Communicating with Docker Containers via Linux Signals and Python

I migrated my last services from Amazon and the blog to Google Compute Engine (GCE / GCP)

I wrote a Postmortem analysis about the process of migrating my last services from my 11 year old Amazon account.

Updates

Updates to articles

I updated the article about Python weird things that you may not know adding the Ellipsis …

I’ve been working in some Cassandra examples. I may publish an article soon about using it from Python and Docker.

Updates to My Books

I updated my Python and Docker books.

I’m currently writing a book about using Amazon AWS Python SDK (boto3).

Updates to Open Source projects

I have updated ctop, fixed two bugs and increased Code Coverage.

I made a new tag and released the last Stable Version:

https://gitlab.com/carles.mateo/ctop/-/tags/0.8.7

On top of my local Unit Testing, I have Jenkins checking that I don’t commit anything that breaks the Tests.

Some time ago I wrote some articles about how you can setup jenkins in a Docker Container.

Miscellaneous

Charity

I’ve donated to Wikipedia.

Only 2% of the viewers donate, so I answered the call every time it was made.

This is my 5th donation to Wikimedia.

I consider that Freedom is very important.

I bought these new books

One of my secrets to be on top is that I’m always studying.

I study all the time, at work and in my free time.

I use Linux Academy and I buy books in paper. I don’t connect with reading in tablets. I think information is stored better when read in paper. I use also a marker and pointers to keep a direct access to the most interesting points on the books.

And I study all kind of themes. Obviously I know a lot of Web Scraping, but there is always room for learning more. And whatever new I learn helps me to be better with my students and more clear writing my books.

I’ve never been a Front End, but I’ve been able to fix bugs in the Front End engines from the companies I worked for, like Privalia. I was passed a bug that prevented the Internet Explorer users to buy just one hour before we launching a massive campaign. I debugged and I found a variable named “value” so the html looked like <input name="value" value="">. In less than 30 minutes I proved to the incredulous Head of Development and the CTO that a bug in Internet Explored was causing a conflict when fetching the value from the input named value. We deployed to Production the update and the campaign was a total success. So I consider knowing Javascript and Front also a need, even if I don’t work directly with it. I want to be able to understand all the requirements and possibilities, and weaknesses, so I can fix bugs and save the day. That allowed me to fix scalability problems in Nodejs and Phantomjs projects too. (They are Javascript Server Side, event driven, projects)

It seems that Amazon.co.uk works well again for Ireland. My two last orders arrived on time and I had no problems of border taxes apparently.

Nice Python article

I enjoyed a lot this article, cause explains part of what I did with my student and friend Albert, in a project that analyzes the access logs from Apache for patterns of attempts of exploits, then feeds a database, and then blocks those offender Ip Addresses in the Firewall.

The article only covers the part of Pandas, of reading the access.log file and working with it, but is a very well redacted article:

https://mmas.github.io/read-apache-access-log-pandas

Nice Virtual Volumes article from VMware

I prefer Open Source, but there are very good commercial products too.

I liked this article about Virtual Volumes from VMWare:

Understanding Virtual Volumes (vVols) in VMware vSphere 6.7/7.0 (2113013)

https://kb.vmware.com/s/article/2113013

Thanks Blizzard (again)

There is a very nice initiative where we can nominate 4 colleagues a year, that we think that deserve a recognition.

My colleagues voted for me, so I received a gift voucher that I can spend in Ireland stores like Ikea, Pc World, Argos, Adidas, App Store & iTunes…

So thanks a million buds. :)

Migrating my 11 years Amazon AWS account services (Postmortem Analysis)

I started to explain that I was migrating some services from Amazon and that some of my sites were under Maintenance and that I would provide more information.

Here is the complete history of why I migrated all the services from my 11 years old Amazon account to other CSP.

Some lessons can be learned from my adventure.

I migrated my last services from Amazon to GCP

Amazon sent me an email on October 6th, this year 2021, telling me that they will disable EC2-Classic by August 2022. I thought I would not be able to keep my Static Ip’s as in the past VPC Ip’s and EC2-Classic Ip’s were not transferable, so considering that I would loss my Static Ip’s anyway I started to migrate to some to other providers like Digital Ocean.

Is not cool losing Static Ip (Elastic Ip in AWS) Addresses as this is bad for SEO, so given that I though I would lose my Static Ips that have been with me for years, I started to migrate certain services to providers much more economic.

Amazon is terrible communicating, and I talked with some product managers in the past about that, when they lost one of my Volumes, and the email was so cold and terrible that actually that hurt more than Amazon losing my Data. I believed that it was a poorly made Scam and when I realized it was true I reached one of my friends, that is manager there, as I know they care for doing things right, and he organized a meeting with two PM so I can pass my feedback.

The Cloud providers are changing things very fast, and nobody is able to be up to date with the changes, unless their work position allows plenty of time to get updated. Even if pages of documentation are provided, you have to react to an event that they externally generated forcing you to action. Action to read all the documentation about EC2-Classic migrations, action to prepare to have migrated by August 2022.

So August 2022… I was counting that I had plenty of time but I’m writing a new book about using the Amazon SDK for Python, boto3, and I was doing some API calls and they started to fail in a very unusual way, Exceptions with timeout, but only for the only region where I had EC2-Classic.

urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPSConnection object at 0x7f0347d545e0>: Failed to establish a new connection: [Errno -2] Name or service not known

My config was:

        o_config = Config(
            region_name="us-east-1a",
            signature_version="v4",
            retries={
                'max_attempts': 10,
                'mode': 'standard'
            }
        )

But if I switched to another region name, it would work:

            region_name='us-west-2',

I made a mistake in here, the region name is “us-east-1” and not “us-east-1a“. “us-east-1a” is the availability zone. So the SDK was giving a timeout because in order to connect to the endpoint it uses the region name as part of the hostname. So it doesn’t find that endpoint because it doesn’t exist.

I never understood why a company like Amazon is unable to provide the SDK with a sample project or projects 100% working, with the source code so people has a base that works to build up.

Every API that I have created, I have provided it with documentation but also with example for several languages for how to use it.

In 2013 I was CTO of an online travel agency, and we had meta-searchers consuming our API and we were having several hundreds of thousands requests per second. Everything was perfectly documented, examples were provided for several languages, the document and the SDK had version numbers…

Everybody forgets about Developers and companies throw terrible and cold products to the poor Developers, so difficult to use. How many Developers would like to say: Listen Mr. President of the big Cloud Company XXXX, I only want to spawn a VM that works, and fast, with easy wizards. I don’t want to learn 50 hours before being able to use your overpriced platform, by doing 20 things before your Ip’s are reflexes of your infrastructure and based in Microservices. Modern JavaScript frameworks can create nice gently wizards even if you have supercold APIs.

Honestly, I didn’t realize my typo in the region and I connected to the Amazon Console to investigate and I saw this.

Honestly, when I read it I understood that they were going to end my EC2 Networking the 30th of October. It was 29th. I misunderstood.

It was my fault not reading it well to the end, I got shocked by the first part telling about shutdown and I didn’t fully understood as they were going to shutdown EC2-Classic for the zones I didn’t had anything running only.

From the long errors (3 exceptions chained) I didn’t realize that the endpoint is built with the region name. (And I was passing the availability zone)

botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://ec2.us-east-1a.amazonaws.com/"

Here is when I say that a good SDM would had thought and cared for the Developers more, and would had made the SDK to check if that region exists. How difficult is to create a SDK a bit more clever that detects a invalid region id?. It is not difficult.

It is true that it was late in the evening and I was tired of all the day, and two days of the week between work and zoom university classes I work 15 hours and 13 hours respectively, not counting the assignments, so by the end of the week I am very tired. But that’s why it is very important to follow methodology and to read well. I think Amazon has 50% of the fault by the way they do things: how the created the SDK, how they communicate, and by the errors that the console returned me when I tried to create a VPC instance of an EC2-Classic AMI (they seem related to the fact I had old VPC Network objects with shorter hash than the current they use) and the other 50% was my fault for not identifying the source of the error, and not reading the message in their website well.

But the fact that there were having those errors in the API’s and timeouts made me believe they were going to cut the EC2-Classic Networking the next day.

All the mistakes fall together in a perfect storm.

I checked for documentation and I saw it was possible to migrate my Static Ip’s to VPC Static Ip’s.

It was Friday evening, and I cancelled my plans, in order to migrate the Blog to VPC in an attempt to keep running it with Amazon.

As Cloud Architect, I like to have running instances in several CSP as it allows me to stay up to date with the changes they do.

I checked the documentation for the migration. Disassociating the Static Ip (Elastic Ip in AWS jargon) was easy. Turning into VPC as well.

As I progressed, what had to be easy turned into a nightmare, as I was getting many errors from the Amazon API, without any information, and my Instances were not created.

I figured out that their API could have problems with old VPC objects I created time ago, so I had to create new objects for several things.

I managed to spawn my instances but they were being launch and terminated instantly without information. Frustrating.

When launching a new instance from the AMI (a Snapshot of the blog), I was giving shown options to add more volumes without any sense. My Instance was using 16GB from a 20GB total Space, and I was shown different volume configs, depending on the instance, in some case an additional 20GB volume, in other small SSD, ephemeral and 10 GB for the AMI (which requires at least 16GB).

After some fight I manage to make it work after deleting the volumes that made no sense, and keeping only one of 20GB, the same size of my AMI.

But then my nightmare started to make the VPC Instance to have Internet access and to be seen from outside. I had to create a new Internet Gateway, NAT, Network, etc…

As mentioned the old objects I was trying to reusing were making the process to fail.

I was running out of time, and I thought in few time they were going to shutdown EC2-Classic network (as I did not read correctly), so I decided to download everything and to migrate to another provider. For doing that first I blocked all the traffic, except for my Ip.

I worked in parallel, creating the new config in Google Cloud, just in case I had forgot something. I had created a document for the migration and it was accurate.

I managed to do everything fast enough. The slower part was to download all the Data, as I hold entire VM’s for projects like Cassandra Universal Driver.

Then I powered off my Amazon Instance for the Blog forever.

In GCP I blocked all the traffic in the firewall, except for my Ip, so I could work calmly.

When everything was ready, I had to redirect the DNS to the new static Ip from Google.

The DNS provider I used had implemented some changes in their API so I was getting errors replacing my old entry ‘.’ (their JSON calls returned Internal Server Error). Finally I figured it out how to workaround it and I was able to confirm that the first service was up and running.

I did some tests to make sure there were not unexpected permission problems, entries in the logs, etc…

Only then I opened the Google Firewall. I have a second firewall in each instance where I block or open at Ip tables level what I want. Basically abusive bot’s IPs trying to find exploits or brute force by dictionary passwords.

I checked with my phone, without Wifi that the Firewall was all good. (It is always a good idea to use another external Ip, different from the management one, to check)

I added a post explaining that I was migrating some of my Services and were under maintenance.

I mentioned in the blog that some of my services were being migrated from Amazon to Digital Ocean.

For some reasons, in the Backup of the Database one user was lost, so I created it in the MySQL with the typical commands:

CREATE USER 'username'@'localhost' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;
GRANT ALL PRIVILEGES ON mydatabase.* TO 'username'@'localhost';

My Sites are under Maintenance

2021-11-08 Update: There is a Postmortem analysis of what happened with Amazon here.

TLTR: I’m undergoing a Maintenance on all my sites.

The main reason was that I was getting unexpected API Exceptions on the AWS SDK for Python (boto3), so I connected to the AWS Console to get more information.

Then I saw a message indicating that they will stop EC2-Classic today 30th of October. (Please read the Update on the Postmortem analysis as I understood incorrectly that banner message)

I already started migrating my Services, some I move to other providers like Digital Ocean. Other I had plant to keep in Amazon.

EOL (End of Life) was scheduled for 2022 August, so when I saw the message from Amazon the evening of the 29th, I decided to migrate my EC2-Classic Public Ip’s and Compute to VPC. Trying to deploy from an AMI, Amazon APIs were returning many internal errors, and as I figured out where their failures would be I was able get instances being launch without being Terminated immediately without an explanation. Still I had many problems with the Internet Gateway, VPC NAT, etc… after hours fighting with their errors, and their console, that is more a bunch of pages to manage Infrastructure rather than a user/developer friendly Cloud Tool I decided that I had enough.

After 11 years using Amazon AWS, including a trip to Dublin to be hired as Manager for Cloud Watch, and giving them the idea to add AutoScaling (I was told the project was too easy for me and that I would get bored in a year or too so I was not hired), I decided to move my Services to Google Cloud and to Digital Ocean.

I’m very polite and I saw that when I told to one Manager that the User Interface was terrible he didn’t like, but I have to speak up and say that tools for developers cannot be cold as your evil girlfriend. Cannot be API alike, stand alone pages to manage infinite parts of Architecture. Web providing services for developers cannot be created like in cold SysAdmin style. If the infrastructure is hard to manage and internally you use APIs, build nice Wizards in Javascript. I was leading a Team of Developers with infinite less resources than Amazon or Google and we wrote a Multi-Cloud product, with nice, and clever, and easy to use Wizards, and they were infinitely more better that those giant CSPs. We won a prize at European level at that time. But it was 2013.

I’ve migrated everything, moved all the data, statics, VMs… but I’m completing the adjustments for certain services like Cassandra nodes, web sites, bootstrapping some of my sites based of my PHP Catalonia Framework, adding Firewall rules to GCP, doing changes for Ansible provisioning, deploying the Server scripts from IaC, Docker, etc…

I’ll be posting updates in Twitter.