In this very long session we went through actual errors in a ZFS pool, we check the Kernel, we remove and reinsert the drive, conduct zpool scrub… in the meantime I talked about Rack, Rack Servers, PSU, redundant components, ECC RAM…
Equitas Health helps thousands of HIV-positive in Ohio, Dayton and Columbus.
Thousands more are reached with our prevention, testing, and other services. We are excited about embracing our expanded mission as a strategic step to further that legacy and its reach by providing care for all – with a focus on a safe and open space and highest quality healthcare for the LGBTQ community and others who are medically underserved.
I did my donation following a post by Terra Field, a former colleague at Blizzard and later leading Netflix’s Trans *ERG, but I didn’t see that she organized a gofund campaign, so I donated again :)
As I saw that there is a lack of clarity in the articles about this theme.
I also provided two alternatives ways, one pure Python3 and the other Bash based (grep awk tr)
Books
The books I publish in LeanPub have two prices, the suggested price, which is the price I consider the right price for the book, and the minimum price, which is the minimum price I authorized a reader can pay to have it.
You can buy it for the minimum price. You know better than anyone your economy.
So when a reader buys one of my books for the suggested price, instead of the minimum price, it’s really showing how they appreciate may work.
So thanks for all the support and appreciation you show!. :)
One of the motives I chose Leanpub platform is because I think is fair. No DRM, no BS. And the reader can ask for a refund within 45 days if they don’t like the book. It also makes very happy seeing that I don’t have any refunds. I appreciate it as a token of the usefulness of my work. Thanks. :)
Updates to Docker Combat File book (v.16 2021-11-24)
I added a nice trick to reverse engineering the original Dockerfile from a running Image.
I also added another typical copy and paste error into the Troubleshoot section.
Automating and Provisioning Amazon AWS (EC2, EBS, S3, CloudWatch) with boto3 (Amazon’s SDK for Python 3) and Python 3 book
I’m writing a book about how to automate your Amazon AWS tasks using Amazon’s AWS Python 3 SDK boto3, provisioning new instances, stopping, starting, creating volumes, creating/deleting buckets in S3, uploading/downloading files from S3…
It is currently 20% completed. With 43 pages it shows EC2 section already.
I’ve working in carleslibs v.1.0.3. I added MenuUtils class, which allows to assemble menus super quickly, that execute the code referenced in the menu array. Ideal for building CLI applications very fast.
I also added KeyboardUtils class, which allows to ask the user for String within certain lengths allowing or not spaces and/or underscores, and ask user for Integer values within a certain min and max, having 0 for go back.
The plan is to release the new version of carleslibs as soon as I’ve tested it properly.
Only 2% of the viewers donate, so I answered the call every time it was made.
This is my 5th donation to Wikimedia.
I consider that Freedom is very important.
I bought these new books
One of my secrets to be on top is that I’m always studying.
I study all the time, at work and in my free time.
I use Linux Academy and I buy books in paper. I don’t connect with reading in tablets. I think information is stored better when read in paper. I use also a marker and pointers to keep a direct access to the most interesting points on the books.
And I study all kind of themes. Obviously I know a lot of Web Scraping, but there is always room for learning more. And whatever new I learn helps me to be better with my students and more clear writing my books.
I’ve never been a Front End, but I’ve been able to fix bugs in the Front End engines from the companies I worked for, like Privalia. I was passed a bug that prevented the Internet Explorer users to buy just one hour before we launching a massive campaign. I debugged and I found a variable named “value” so the html looked like
<inputname="value"value="">
<input name="value" value="">. In less than 30 minutes I proved to the incredulous Head of Development and the CTO that a bug in Internet Explored was causing a conflict when fetching the value from the input named value. We deployed to Production the update and the campaign was a total success. So I consider knowing Javascript and Front also a need, even if I don’t work directly with it. I want to be able to understand all the requirements and possibilities, and weaknesses, so I can fix bugs and save the day. That allowed me to fix scalability problems in Nodejs and Phantomjs projects too. (They are Javascript Server Side, event driven, projects)
It seems that Amazon.co.uk works well again for Ireland. My two last orders arrived on time and I had no problems of border taxes apparently.
Nice Python article
I enjoyed a lot this article, cause explains part of what I did with my student and friend Albert, in a project that analyzes the access logs from Apache for patterns of attempts of exploits, then feeds a database, and then blocks those offender Ip Addresses in the Firewall.
The article only covers the part of Pandas, of reading the access.log file and working with it, but is a very well redacted article:
Here is the complete history of why I migrated all the services from my 11 years old Amazon account to other CSP.
Some lessons can be learned from my adventure.
I migrated my last services from Amazon to GCP
Amazon sent me an email on October 6th, this year 2021, telling me that they will disable EC2-Classic by August 2022. I thought I would not be able to keep my Static Ip’s as in the past VPC Ip’s and EC2-Classic Ip’s were not transferable, so considering that I would loss my Static Ip’s anyway I started to migrate to some to other providers like Digital Ocean.
Is not cool losing Static Ip (Elastic Ip in AWS) Addresses as this is bad for SEO, so given that I though I would lose my Static Ips that have been with me for years, I started to migrate certain services to providers much more economic.
Amazon is terrible communicating, and I talked with some product managers in the past about that, when they lost one of my Volumes, and the email was so cold and terrible that actually that hurt more than Amazon losing my Data. I believed that it was a poorly made Scam and when I realized it was true I reached one of my friends, that is manager there, as I know they care for doing things right, and he organized a meeting with two PM so I can pass my feedback.
The Cloud providers are changing things very fast, and nobody is able to be up to date with the changes, unless their work position allows plenty of time to get updated. Even if pages of documentation are provided, you have to react to an event that they externally generated forcing you to action. Action to read all the documentation about EC2-Classic migrations, action to prepare to have migrated by August 2022.
So August 2022… I was counting that I had plenty of time but I’m writing a new book about using the Amazon SDK for Python, boto3, and I was doing some API calls and they started to fail in a very unusual way, Exceptions with timeout, but only for the only region where I had EC2-Classic.
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPSConnection object at 0x7f0347d545e0>: Failed to establish a new connection: [Errno -2] Name or service not known
But if I switched to another region name, it would work:
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
region_name='us-west-2',
region_name='us-west-2',
region_name='us-west-2',
I made a mistake in here, the region name is “us-east-1” and not “us-east-1a“. “us-east-1a” is the availability zone. So the SDK was giving a timeout because in order to connect to the endpoint it uses the region name as part of the hostname. So it doesn’t find that endpoint because it doesn’t exist.
I never understood why a company like Amazon is unable to provide the SDK with a sample project or projects 100% working, with the source code so people has a base that works to build up.
Every API that I have created, I have provided it with documentation but also with example for several languages for how to use it.
In 2013 I was CTO of an online travel agency, and we had meta-searchers consuming our API and we were having several hundreds of thousands requests per second. Everything was perfectly documented, examples were provided for several languages, the document and the SDK had version numbers…
Everybody forgets about Developers and companies throw terrible and cold products to the poor Developers, so difficult to use. How many Developers would like to say: Listen Mr. President of the big Cloud Company XXXX, I only want to spawn a VM that works, and fast, with easy wizards. I don’t want to learn 50 hours before being able to use your overpriced platform, by doing 20 things before your Ip’s are reflexes of your infrastructure and based in Microservices. Modern JavaScript frameworks can create nice gently wizards even if you have supercold APIs.
Honestly, I didn’t realize my typo in the region and I connected to the Amazon Console to investigate and I saw this.
Honestly, when I read it I understood that they were going to end my EC2 Networking the 30th of October. It was 29th. I misunderstood.
It was my fault not reading it well to the end, I got shocked by the first part telling about shutdown and I didn’t fully understood as they were going to shutdown EC2-Classic for the zones I didn’t had anything running only.
From the long errors (3 exceptions chained) I didn’t realize that the endpoint is built with the region name. (And I was passing the availability zone)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://ec2.us-east-1a.amazonaws.com/"
Here is when I say that a good SDM would had thought and cared for the Developers more, and would had made the SDK to check if that region exists. How difficult is to create a SDK a bit more clever that detects a invalid region id?. It is not difficult.
It is true that it was late in the evening and I was tired of all the day, and two days of the week between work and zoom university classes I work 15 hours and 13 hours respectively, not counting the assignments, so by the end of the week I am very tired. But that’s why it is very important to follow methodology and to read well. I think Amazon has 50% of the fault by the way they do things: how the created the SDK, how they communicate, and by the errors that the console returned me when I tried to create a VPC instance of an EC2-Classic AMI (they seem related to the fact I had old VPC Network objects with shorter hash than the current they use) and the other 50% was my fault for not identifying the source of the error, and not reading the message in their website well.
But the fact that there were having those errors in the API’s and timeouts made me believe they were going to cut the EC2-Classic Networking the next day.
All the mistakes fall together in a perfect storm.
I checked for documentation and I saw it was possible to migrate my Static Ip’s to VPC Static Ip’s.
It was Friday evening, and I cancelled my plans, in order to migrate the Blog to VPC in an attempt to keep running it with Amazon.
As Cloud Architect, I like to have running instances in several CSP as it allows me to stay up to date with the changes they do.
I checked the documentation for the migration. Disassociating the Static Ip (Elastic Ip in AWS jargon) was easy. Turning into VPC as well.
As I progressed, what had to be easy turned into a nightmare, as I was getting many errors from the Amazon API, without any information, and my Instances were not created.
I figured out that their API could have problems with old VPC objects I created time ago, so I had to create new objects for several things.
I managed to spawn my instances but they were being launch and terminated instantly without information. Frustrating.
When launching a new instance from the AMI (a Snapshot of the blog), I was giving shown options to add more volumes without any sense. My Instance was using 16GB from a 20GB total Space, and I was shown different volume configs, depending on the instance, in some case an additional 20GB volume, in other small SSD, ephemeral and 10 GB for the AMI (which requires at least 16GB).
After some fight I manage to make it work after deleting the volumes that made no sense, and keeping only one of 20GB, the same size of my AMI.
But then my nightmare started to make the VPC Instance to have Internet access and to be seen from outside. I had to create a new Internet Gateway, NAT, Network, etc…
As mentioned the old objects I was trying to reusing were making the process to fail.
I was running out of time, and I thought in few time they were going to shutdown EC2-Classic network (as I did not read correctly), so I decided to download everything and to migrate to another provider. For doing that first I blocked all the traffic, except for my Ip.
I worked in parallel, creating the new config in Google Cloud, just in case I had forgot something. I had created a document for the migration and it was accurate.
I managed to do everything fast enough. The slower part was to download all the Data, as I hold entire VM’s for projects like Cassandra Universal Driver.
Then I powered off my Amazon Instance for the Blog forever.
In GCP I blocked all the traffic in the firewall, except for my Ip, so I could work calmly.
When everything was ready, I had to redirect the DNS to the new static Ip from Google.
The DNS provider I used had implemented some changes in their API so I was getting errors replacing my old entry ‘.’ (their JSON calls returned Internal Server Error). Finally I figured it out how to workaround it and I was able to confirm that the first service was up and running.
I did some tests to make sure there were not unexpected permission problems, entries in the logs, etc…
Only then I opened the Google Firewall. I have a second firewall in each instance where I block or open at Ip tables level what I want. Basically abusive bot’s IPs trying to find exploits or brute force by dictionary passwords.
I checked with my phone, without Wifi that the Firewall was all good. (It is always a good idea to use another external Ip, different from the management one, to check)
TLTR: I’m undergoing a Maintenance on all my sites.
The main reason was that I was getting unexpected API Exceptions on the AWS SDK for Python (boto3), so I connected to the AWS Console to get more information.
Then I saw a message indicating that they will stop EC2-Classic today 30th of October. (Please read the Update on the Postmortem analysis as I understood incorrectly that banner message)
I already started migrating my Services, some I move to other providers like Digital Ocean. Other I had plant to keep in Amazon.
EOL (End of Life) was scheduled for 2022 August, so when I saw the message from Amazon the evening of the 29th, I decided to migrate my EC2-Classic Public Ip’s and Compute to VPC. Trying to deploy from an AMI, Amazon APIs were returning many internal errors, and as I figured out where their failures would be I was able get instances being launch without being Terminated immediately without an explanation. Still I had many problems with the Internet Gateway, VPC NAT, etc… after hours fighting with their errors, and their console, that is more a bunch of pages to manage Infrastructure rather than a user/developer friendly Cloud Tool I decided that I had enough.
After 11 years using Amazon AWS, including a trip to Dublin to be hired as Manager for Cloud Watch, and giving them the idea to add AutoScaling (I was told the project was too easy for me and that I would get bored in a year or too so I was not hired), I decided to move my Services to Google Cloud and to Digital Ocean.
I’m very polite and I saw that when I told to one Manager that the User Interface was terrible he didn’t like, but I have to speak up and say that tools for developers cannot be cold as your evil girlfriend. Cannot be API alike, stand alone pages to manage infinite parts of Architecture. Web providing services for developers cannot be created like in cold SysAdmin style. If the infrastructure is hard to manage and internally you use APIs, build nice Wizards in Javascript. I was leading a Team of Developers with infinite less resources than Amazon or Google and we wrote a Multi-Cloud product, with nice, and clever, and easy to use Wizards, and they were infinitely more better that those giant CSPs. We won a prize at European level at that time. But it was 2013.
I’ve migrated everything, moved all the data, statics, VMs… but I’m completing the adjustments for certain services like Cassandra nodes, web sites, bootstrapping some of my sites based of my PHP Catalonia Framework, adding Firewall rules to GCP, doing changes for Ansible provisioning, deploying the Server scripts from IaC, Docker, etc…
I’ve raised back the price for my books to normal levels. I’ve been keeping the price to the minimum to help people that wanted to learn during covid-19. I consider that who wanted to learn has already done it.
I still have bundles with a somewhat reduced price, and I authorized LeanPub platform to do discounts up to 50% at their discretion.
My dump takes 88MB, not much, but I compress it with gzip.
gzip wp_mysiteZ.sql
It takes only 15MB compressed.
Do not forget the parameter –databases even if only one database is exported, otherwise the CREATE DATABASE and USE `wp_mysiteZ`; will not be added to your dump.
I will need to take some data form the mysql database, referring to the user used for accessing the blog’s database.
I always keep the CREATE USER and the GRANT permissions, if you don’t check the wp-config.php file. Note that the SQL format to create users and grant permissions may be different from a SQL version to another.
I create a file named mysql.sql with this part and I compress with gzip.
Checking PHP version
php -v
PHP 7.3.23 (cli) (built: Oct 21 2020 20:24:49) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.3.23, Copyright (c) 1998-2018 Zend Technologies
WordPress is updated, and PHP is not that old.
The new Ubuntu 20.04 LTS comes with PHP 7.4. It will work:
php -v
PHP 7.4.3 (cli) (built: Jul 5 2021 15:13:35) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
with Zend OPcache v7.4.3, Copyright (c), by Zend Technologies
The Dockerfile
FROM ubuntu:20.04
MAINTAINER Carles Mateo
ARG DEBIAN_FRONTEND=noninteractive
# RUN echo "nameserver 8.8.8.8" > /etc/resolv.conf
RUN echo "Europe/Ireland" | tee /etc/timezone
# Note: You should install everything in a single line concatenated with
# && and finalizing with
# apt autoremove && apt clean
# In order to use the less space possible, as every command
# is a layer
RUN apt update && apt install -y apache2 ntpdate libapache2-mod-php7.4 mysql-server php7.4-mysql php-dev libmcrypt-dev php-pear git mysql-server less zip vim mc && apt autoremove && apt clean
RUN a2enmod rewrite
RUN mkdir -p /www
# If you want to activate Debug
# RUN sed -i "s/display_errors = Off/display_errors = On/" /etc/php/7.2/apache2/php.ini
# RUN sed -i "s/error_reporting = E_ALL & ~E_DEPRECATED & ~E_STRICT/error_reporting = E_ALL/" /etc/php/7.2/apache2/php.ini
# RUN sed -i "s/display_startup_errors = Off/display_startup_errors = On/" /etc/php/7.2/apache2/php.ini
# To Debug remember to change:
# config/{production.php|preproduction.php|devel.php|docker.php}
# in order to avoid Error Reporting being set to 0.
ENV PATH_WP_MYSITEZ /var/www/wordpress/wp_mysitez/
ENV PATH_WORDPRESS_SITES /var/www/wordpress/
ENV APACHE_RUN_USER www-data
ENV APACHE_RUN_GROUP www-data
ENV APACHE_LOG_DIR /var/log/apache2
ENV APACHE_PID_FILE /var/run/apache2/apache2.pid
ENV APACHE_RUN_DIR /var/run/apache2
ENV APACHE_LOCK_DIR /var/lock/apache2
ENV APACHE_LOG_DIR /var/log/apache2
RUN mkdir -p $APACHE_RUN_DIR
RUN mkdir -p $APACHE_LOCK_DIR
RUN mkdir -p $APACHE_LOG_DIR
RUN mkdir -p $PATH_WP_MYSITEZ
# Remove the default Server
RUN sed -i '/<Directory \/var\/www\/>/,/<\/Directory>/{/<\/Directory>/ s/.*/# var-www commented/; t; d}' /etc/apache2/apache2.conf
RUN rm /etc/apache2/sites-enabled/000-default.conf
COPY wp_mysitez.conf /etc/apache2/sites-available/
RUN chown --recursive $APACHE_RUN_USER.$APACHE_RUN_GROUP $PATH_WP_MYSITEZ
RUN ln -s /etc/apache2/sites-available/wp_mysitez.conf /etc/apache2/sites-enabled/
# Please note: It would be better to git clone from another location and
# gunzip and delete temporary files in the same line,
# to save space in the layer.
COPY *.sql.gz /tmp/
RUN gunzip /tmp/*.sql.gz; echo "Starting MySQL"; service mysql start && mysql -u root < /tmp/wp_mysitez.sql && mysql -u root < /tmp/mysql.sql; rm -f /tmp/*.sql; rm -f /tmp/*.gz
# After this root will have password assigned
COPY *.zip /tmp/
COPY services_up.sh $PATH_WORDPRESS_SITES
RUN echo "Unzipping..."; cd /var/www/wordpress/; unzip /tmp/*.zip; rm /tmp/*.zip
RUN chown --recursive $APACHE_RUN_USER.$APACHE_RUN_GROUP $PATH_WP_MYSITEZ
EXPOSE 80
CMD ["/var/www/wordpress/services_up.sh"]
Services up
For starting MySQL and Apache I relay in services_up.sh script.
#!/bin/bash
echo "Starting MySql"
service mysql start
echo "Starting Apache"
service apache2 start
# /usr/sbin/apache2 -D FOREGROUND
while [ true ];
do
ps ax | grep mysql | grep -v "grep "
if [ $? -gt 0 ];
then
service mysql start
fi
sleep 10
done
You see that instead of launching apache2 as FOREGROUND, what keeps the loop, not exiting from my Container is a while [ true ]; that will keep looping and checking if MySQL is up, and restarting otherwise.
MySQL shutting down
Some of my sites receive DoS attacks. More than trying to shutdown my sites, are spammers trying to publish comment announcing fake glasses, or medicines for impotence, etc… also some try to hack into the Server to gain control of it with dictionary attacks or trying to explode vulnerabilities.
The downside of those attacks is that some times the Database is under pressure, and uses more and more memory until it crashes.
More memory alleviate the problem and buys time, but I decided not to invest more than $6 USD per month on this old site. I’m just keeping the contents alive and even this site still receives many visits. A restart of the MySQL if it dies is enough for me.
As you have seen in my Dockerfile I only have one Docker Container that runs both Apache and MySQL. One of the advantages of doing like that is that if MySQL dies, the container does not exit. However I could have had two containers with both scripts with the while [ true ];
When planning I decided to have just one single Container, all-in-one, as when I export the image for a Backup, I’ll be dealing only with a single image, not two.
Building and Running the Container
I created a Bash script named build_docker.sh that does the build for me, stopping and cleaning previous Containers:
#!/bin/bash
# Execute with sudo
s_DOCKER_IMAGE_NAME="wp_sitez"
printf "Stopping old image %s\n" "${s_DOCKER_IMAGE_NAME}"
sudo docker stop "${s_DOCKER_IMAGE_NAME}"
printf "Removing old image %s\n" "${s_DOCKER_IMAGE_NAME}"
sudo docker rm "${s_DOCKER_IMAGE_NAME}"
printf "Creating Docker Image %s\n" "${s_DOCKER_IMAGE_NAME}"
# sudo docker build -t ${s_DOCKER_IMAGE_NAME} . --no-cache
sudo docker build -t ${s_DOCKER_IMAGE_NAME} .
i_EXIT_CODE=$?
if [ $i_EXIT_CODE -ne 0 ]; then
printf "Error. Exit code %s\n" ${i_EXIT_CODE}
exit
fi
echo "Ready to run ${s_DOCKER_IMAGE_NAME} Docker Container"
echo "To run type: sudo docker run -d -p 80:80 --name ${s_DOCKER_IMAGE_NAME} ${s_DOCKER_IMAGE_NAME}"
echo "or just use run_in_docker.sh"
echo
echo "Debug running Docker:"
echo "docker exec -it ${s_DOCKER_IMAGE_NAME} /bin/bash"
echo
I assign to the image and the Running Container the same name.
Running in Production
Once it works in local, I set the Firewall rules and I deploy the Droplet (VM) with Digital Ocean, I upload the files via SFTP, and then I just run my script build_docker.sh
And assuming everything went well, I run it:
sudo docker run -d -p 80:80 --name wp_mysitez wp_mysitez
I check that the page works, and here we go.
Some improvements
This could also have been put in a private Git repository. You only have to care about not storing the passwords in it. (Like the MySQL grants)
It may be interesting for you to disable directory browsing.
The build from the Git repository can be validated with a Jenkins. Here you have an article about setup a Jenkins for yourself.
I have benchmarked three different CPUs and two Compute optimized Amazon AWS instances with CMIPS 1.0.5 64bit. The two Intel Xeon baremetals equip 2 x Intel Xeon Processor and the third baremetal equips a single Intel Core i7-7800X:
If you’re surprised by the number of cores reported by the Amazon instance m5d.24xlarge, and even more for the baremetal c5n.metal, you’re guessing well that this comes from having Servers with 4 CPUs for Compute Optimized series.
When I can choose I use Linux, but in many companies I work with Windows workstations. I’ve published a list of useful Software I use in all my Windows workstations.
WFH I currently use two external monitors attached to the laptop. I planned to add a new one using a Display Port connected to the Dell USB-C dongle that provides me Ethernet and one additional HDMI as well. I got the cable from Amazon but unfortunately something is not working. In order to make myself comfortable and see some the graphs of the systems worldwide as I have on the office’s displays, I created a small HTML page, that joins several monitor pages in one single web page using frames. This way I only have one page loaded on the browser, maximized, and this monitor is dedicated to those graphs of the stats of the Systems. Something very simple, but very useful. You can extend the number of columns and rows it to have more graphics in the same screen.
If you don’t have the space or the resources for more monitors you can use the ingenious.
I have a cheap HDMI switch that allows me to do PinP (Picture in Picture) with one main source on the monitor, and two using a fraction of their original space. It may allow you to see variants in graphics.
And in you have only a single monitor, you can use a chrome extension that rotates tabs, which is also very useful.
Be careful if you use the reload features with software like Jira or Confluence. If they are slow normally, imagine if you mess it by reloading every 30 seconds… I discourage you to use auto refresh on these kind of Softwares.
This past week I have connected the XBOX One X Controller to the Windows laptop for the first time. Normally I use the Pc only for strategy games, but I wanted to play other games like Lost Planet 3, or Fall Guys in a console mode way. I figured that would be very easy and it was. You turn on the controller, press the connect button like you did to pair with the console, and in Windows indicate pair to a Xbox One controller. That’s it.
I’ve also updated my Python 3 Combat Guide, to add the explanation, step by step, about how to refactor and make resilient, and add Unit Testing to a spaghetti code, and turn it into a modern OOP. Is currently 255 DIN-A4 pages.
This is something I wanted to share with you for a while. One of the most funny things in my career is what I call: Squirrel Strikes Back
I named this as the first incident where a provider told that the reason of a fiber failure was a squirrel chewing the cable.
I popularized this with my friends in Systems Administration and SRE and when they suffer a Squirrel Attack incident, they forward it to me, for great joy.
I’m used to construction or gas, water, electricity, highways repair operations on the cities accidentally cutting fiber cables, thunders or truck accidents on the highway breaking the floor and cutting tubes and issues like that. I’ve been seeing that for around 25 years.
So the first time I saw a provider referring to a squirrel cutting the cables it was pretty hilarious. :)
In my funny mental picture: I could visually imagine a cable thrown in the middle of the forest, over trees, and a squirrel chewing it as it tastes like peanuts. :) or a shark cutting a Google’s or Facebook’s intercontinental cable thrown without any protection. ;)
The sense of humor and the good vibes, are two of the most important things in life.
This article covers the desperate situation where you had generated one or more instances, instructed Amazon to use a SSH Key Pair certs where only you have the Private Key, your instances are running, for example, an eCommerce site, running for months, and then you loss your Private Key (.pem file), and with it the SSH access to your instances’ Data.
Actually I’ve seen this situation happening several times, in actual companies. Mainly Start ups. And I solved it for them.
Assuming that you didn’t have a secondary method to access, which is another combination of username/password or other user/KeyPairs, and so you completely lost the access to the Database, the Webservers, etc… I’m going to show you how to recover the data.
For this article I will consider an scenario where there is only one Instance, which contains everything for your eCommerce: Webserver, code, and Database… and is a simple config, with a single persistent drive.
Warning: be very careful as if you use ephemeral drives, contents will be lost is you power off the instance.
Method 1: Quicker, launching a new instance from the previous
Step1: The first step you will take is to close the access from outside, using the Firewall, to avoid any new changes going to the disk. You can allow access to the instance only from your static Ip in the office/home.
Step 2: You’ll wait for 5 minutes to allow any transaction going on to conclude, and pending writes to be flushed to disk.
Step 3: From Amazon AWS Console, EC2, you’ll request an Snapshot. That step is to try to get extra security. Taking an Snapshot from a live, mounted, filesystem, is not the best of ideas, specially of a Database, but we are facing a desperate situation so we’re increasing the numbers of leaving this situation without Data loss. This is just for extra security and if everything goes well at the end you will not need this snapshot.
Make sure you select No reboot.
Step 4: Be very careful if you have extra drives and ephemeral drives.
Step 5: Wait till the Snapshot completes.
Step 6: Then request a graceful poweroff. Amazon will try to poweroff the Server in a gentle way. This may take two minutes.
Step 7: When the instance is powered off, request a new Snapshot. This is the one we really want. The other was just to be more safe. If you feel confident you can just unclick No Reboot on the previous Step and do only one Snapshot.
Step 8: Wait till the Snapshot completes.
Step 9: Generate and upload the new key you will use to AWS Console, or ask Amazon to generate a key pair for you. You can do it while creating the new instance through the wizard.
Step 10: Launch a new instance, based on your snapshot AMI. This will generate a copy of your previous instance (using the Snapshot) for the new one. Select the new Key pair. Finish assigning the Security groups, the elastic ip…
Step 11: Start the new instance. You can select a different flavor, like a more powerful instance, if you prefer. (scale vertically)
Step 12: Test your access by login via SSH with the new pair keys and from your static Ip which has access in the Firewall.
Step 13: Check that the web Starts correctly, check the Database logs to see if there is any corruption. Should not have any if graceful shutdown went well.
Step 14: Reopen the access from the Firewall, so the world can connect to your instance.
Method 2: Slower, access the Data and rebuild whatever you need
The second method is exactly the same until Step 6 included.
Step 7: After this, you will create a new instance based on your favorite OS, with a new pair of Keys.
Step 8: You’ll detach the Volume from the eCommerce previous instance (the one you lost access).
Step 9: You’ll attach the Volume to the new instance.
Step 10: You’ll have access to the Data from the previous instance in the new volume. type cat /proc/partitions or df -h to see the mountpoints available. You can then download or backup, or install the Software again and import the Database…
Step 11: Check that everything works, and enable the access worldwide to the Web in the Firewall (Security Group Inbound Rules).
If you are confident enough, you can use this method to upgrade the OS or base Software of your instance, making it part of your maintenance window. For example, to get the last version of Ubuntu or CentOS, MySQL, Python or PHP, etc…