Category Archives: Troubleshoot

Curiosity Python string.strip() removes just more than white spaces

Another Python curiosity.

If you see the Official Python3 documentation for strip(), it says that strip without parameters will return the string without the leading and trailing white spaces.
Optionally you can pass a string with the characters you want to eliminate.

The official documentation for Python 2 says:

string.strip(s[, chars])

Return a copy of the string with leading and trailing characters removed. If chars is omitted or None, whitespace characters are removed. If given and not None, chars must be a string; the characters in the string will be stripped from the both ends of the string this method is called on.

Changed in version 2.2.3: The chars parameter was added. The chars parameter cannot be passed in earlier 2.2 versions.

https://docs.python.org/2/library/string.html

A white space is a white space. Is not an Enter.
But strip() without parameters will remove white spaces (space), and Enter \n and Tabs \t.

Probably you will not realize that unless you read from a file that has empty lines at the end for a reason, and you use strip().

You can see a demonstration following this small program, that runs the same for Python2 and Python3.

And the corresponding output for python2 and python3:

The [ ] characters where added to show that there are no hidden tabs or similar after.

Here I paste the code so you can try yourself:

import sys


def print_bar():
    print("-----------------------------------------------------")


def print_between_brackets(s_test):
    print("[" + s_test + "]")


s_string_with_enters = "  Testing strip not only removing white spaces, but Enter and Tabs s well\n\t\n\n"

print("Testing .strip()")
print("You are running Python " + sys.version)
print("This is the original string")
print_bar()
print_between_brackets(s_string_with_enters)
print_bar()
print("Now after strip()...")
print_bar()
print_between_brackets(s_string_with_enters.strip())
print_bar()
print("As you can see the Enters and the Tabs have been removed, not just the spaced")

I think this should be disambiguate so I decided to take action. Is very easy to blame and never contribute. Not me. I went to Python to fix that and I located a bug reporting this issue:

https://bugs.python.org/issue25433

The issue was registed and made specially interesting contributions by Dimitri Papadopoulos Orfanos.

The thread is really interesting to read. I recommend it.

At a glance:

“Python heavily relies on isspace() to detect “whitespace” characters.”

* Lib/string.py near line 23:
  whitespace = ' \t\n\r\v\f'

So all those characters will be stripped in Python2.7 if you use just string.strip()

The ticket was opened the 2015-10-18 12:15. So it’s a shame the documentation has not been updated yet, more than 3 years later. Those are the kind of things, lack of care, that I can’t understand. Not looking for the excellence.

Please, do note that Python3 supports Unicode natively and things are always a bit different than with Python2 and AscII.

Solving a persistent MD Array problem in RHEL7.4 (Dual Port SAS drives)

Ok, so I lend one of my Servers to two of my colleagues in The States, that required to prepare some test for a customer. I always try to be nice and to stimulate sales across the organizations I help, so if they need a Server for a PoC and demo to a customer, they know they can count on me.

It is important to remark that the Servers I was using had two motherboards, with their CPU and RAM, and Dual Port SAS drives. We had those Servers so we can implement High Availability. The Dual Port SAS allow two different computers or IO controllers to access the same drive at the same time.

I work with Declustered RAID, DRAID, and ZFS.

The Server was a 4U90, so a 4U Server with 90 SAS3 spinning drives and 4 SSD. Drives are Dual Ported, and two Controllers (motherboard + CPU + RAM) have access simultaneously to the drives for HA.

After their tests my colleagues, returned me the Server, and I needed to use it and my surprise was when I tried to provision with ZFS and I encountered problems. Not much in the logs. Please note I was using only one node (or controller), and the other was not in use but they ask me to keep the OS and the data (in 2xMD drive). I shutdown the node A after the Engineers in San Jose powered the Server off, so only my node was working.

I checked:

cat /proc/mdstat

And that was the thing 8 MD Arrays where there.

[root@4u90-B ~]# cat /proc/mdstat 
Personalities : 
md2 : inactive sdba1[9](S) sdag1[7](S) sdaf1[3](S)
11720629248 blocks super 1.2

md1 : inactive sdax1[7](S) sdad1[5](S) sdac1[1](S) sdae1[9](S)
12056071168 blocks super 1.2

md0 : inactive sdat1[1](S) sdav1[9](S) sdau1[5](S) sdab1[7](S) sdaa1[3](S)
19534382080 blocks super 1.2

md4 : inactive sdbf1[9](S) sdbe1[5](S) sdbd1[1](S) sdal1[7](S) sdak1[3](S)
19534382080 blocks super 1.2

md5 : inactive sdam1[1](S) sdan1[5](S) sdao1[9](S)
11720629248 blocks super 1.2

md8 : inactive sdcq1[7](S) sdz1[2](S)
7813752832 blocks super 1.2

md7 : inactive sdbm1[7](S) sdar1[1](S) sdy1[9](S) sdx1[5](S)
15627505664 blocks super 1.2

md3 : inactive sdaj1[9](S) sdai1[5](S) sdah1[1](S)
11720629248 blocks super 1.2

md6 : inactive sdaq1[7](S) sdap1[3](S) sdr1[8](S) sdp1[0](S)
15627505664 blocks super 1.2

Ok. So I stop the Arrays

mdadm --stop /dev/md127

And then I zero the superblock:

mdadm --zero-superblock /dev/sdb1

After doing this for all I try to provision and… surprise! does not work. /dev/md127 has respawned like in the old times from Doom video game.

I check the mdmonitor service and even disable it.

systemctl disable mdmonitor

I repeat the process.

And /dev/md127 appears again, using another device.

At this point, just in case, I check the other controller, which should be powered off.

Ok, it was on. With different Ip, so it was not answering to ping, but I still had access to BMC//IPMI. After confirming with my colleagues that I can shutdown that node (they did not turn it on apparently) I launch the poweroff command, and repeat, same!.

I see that the poweroff command on the second Controller is doing a reboot, not poweroff. Is a Firmware issue I find. So I access to the Linux from the management tool and I launch the halt command that makes it not respond to the ping anymore.

I repeat the process, and still the ghost md array appears there, and blocks me from doing my zpool create.

The /etc/mdadm.conf file did not exist (by default is not created).

I try a more aggressive approach:

DRIVES=`cat /proc/partitions | grep 3907018584 | awk '{ print $4; }'`

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}1"; mdadm --examine /dev/${DRIVE}1; done

Ok. And destruction time:

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}"; wipefs -a -f /dev/${DRIVE}; done

for DRIVE in $DRIVES; do echo "Trying /dev/${DRIVE}1"; mdadm --zero-superblock /dev/${DRIVE}1; done

Apparently the system is clean, but still I cannot provision, and /dev/md127 respawns and reappears all the time.

After googling and not finding anything about this problem, and my colleagues no having clue about what is causing this, I just proceed with a simple solution, as I need the Server for my company completing the tests in the next 24 hours.

So I create the file /etc/mdadm.conf with this content:

[root@draid-08 ~]# cat /etc/mdadm.conf 
AUTO -all

After that I rebooted the Server and I saw the infamous /dev/md127 is not there and I’m able to provision.

I share the solution as it may help other people.

The most straightforward procedure would had been reinstalling clean the OS, but this operation is very slow when simulating a Virtual CD remotely, so it was worth fixing that as OS level, as I save one day delaying my work.

Troubleshooting upgrading and loading a ZFS module in RHEL7.4

I illustrate this troubleshooting as it will be useful for some of you.

I requested to one of the members of my Team to compile and to install ZFS 7.9 to some of the Servers loaded with drives, that were running ZFS 7.4 older version.

Those systems were running RHEL7.4.

The compilation and install was fine, however the module was not able to load.

My Team member reported that: when trying to run “modprobe zfs“. It was giving the error:

modprobe: ERROR: could not insert 'zfs': Invalid argument

Also when trying to use a zpool command it gives the error:

Failed to initialize the libzfs library

That was only failing in one of the Servers, but not in the others.

My Engineer ran dmesg and found:

zfs: `' invalid for parameter `metaslab_debug_unload

He though it was a compilation error, but I knew that metaslab_debug_unload is an option parameter that you can set in /etc/zfs.conf

So I ran:

 modprobe -v zfs

And that confirmed my suspicious, so I edited /etc/zfs.conf and commented the parameter and tried again. And it failed.

As I run modprobe -v zfs (verbose) it was returning me the verbose info, and so I saw that it was still trying to load those parameters so I knew it was reading those parameters from some file.
I could have grep all the files in the filesystem looking for the parameter failing in the verbose or find all the files in the system named zfs.conf. To me it looked inefficient as it would be slow and may not bring any result (as I didn’t know how exactly my team member had compiled the code), however I expected to get the result. But what if I found 5 or 7 zfs.conf files?. Slow.
I used strace. It was not installed but the RHEL license was active so I simple did:

yum install strace

strace stands for System Trace and so it records all the System Calls that the programs do.
That’s a pro trick that will accompany you all your career.

So I did:

strace modprobe zfs

I did not use -v in here cause all the verbose would had been logged as a System Call and made more difficult my search.
I got the output of all the System Calls and I just had to look for which files were being read.

Then I found that zfs.conf under /etc/modprobe.d/zfs.conf
That was the one being read. So I commented the line and tried modprobe zfs and it worked perfectly. :)

An Epic fail that are committing all the universities

Article created on: 1528997557 | 2018-06-14 18:32:15 IST

Recently a mentor of the UCC university came to visit me to my office, in order to do the following of one of the members of my Team, an intern.
Conversation was well, and then at some point he asked what courses could do the university teach to their students in order to be more prepared for working with us.
The Head of Business Development, that was in the meeting with me, mentioned something interesting:
– Make the publish their best code in github, bitbucket or similar git repository, and maintain it. It is like a CV.
He pointed that some of the students sent me their repository page, and they have not committed a thing for more than a year. And usually the code that I find there is less than a tic-tac-toe exercise.
– Obviously, to have git experience.
– Having contributed to an Open Source project

I exposed some things that would be helpful to have in the interns and grads that I hire:
– git experience
– Python programming
– C programming
– Unit Testing experience
– Networking experience, in particular iSCSI exports, tcpdump
– Programming Best practices, PEP-8 at least for Python
– Usage of Professional Tools like PyCharm, JetBrains IntelliJ, PHPStorm, Code Lion, Netbeans, Eclipse
– Linux experience. Many of them use Windows at home cause they also play video games. Really few programmers in real life use windows. So at least guys install Virtual Box or VMWare and run Linux in an Virtual Machine.
– Cloud experience. Using instances, CDNs, APIs, tools…

And as the talking advanced I gave him a hint of the Epic fail that all the universities are committing.
They teach git for a semester. They teach Python for one or two semester, the first year usually one, the second year another. And that’s it. Is gone.
When they exit the university they have not programmed in Python for 2 or 3 years, they have not used git, they have not used SQL for the same amount of time, etc…

My boss pointed that the best candidates do side projects in their spare time, and have that bright in their eyes. That sparkling in the eyes is what I call the eye of the tiger, the desire to improve, to learn. That spark.

I told the mentor of my intern that the big mistake is doing things in small parcels, isolated, one block and is gone. That the best way to proceed would be to:
Make the student start a project from the very beginning, from the first semester. Then keep making it bigger and better over time.
Let them improve it over time. Screw it in all the ways possible. Make them reach the limits of their initial architecture. Allow them to face having to redo the thing from the scratch. Allow them to do screw it, to break things, and to learn from their mistakes. Over and over.

Nobody becomes a great programmer coding average things for two semesters.
But let them realize where the problems are. Let them come back to their code of two or three months ago, before holidays, and realize how important is to make comments, to give proper names to the files and to the variables. Let them run that project over so many time, that at some point they have to change computer and they realize that what worked with windows Uppercases and Lowercase mixed files, does not work with Unix (case sensitive).
Let them grow.
Let them see their mistakes over the time.

Let them run the project for so long so they switch several times from Cloud provider, and discover the pros and the cons and the not-to-do, and things like run for your life before using sharing hostings that limit your CPU quota even that kills your MySql instances when they look at the email (true history, connecting to POP3 was raising the CPU and the provider was killing the MySQL instances, and so the queries) or that limits your queries per second, and then ask them to install a drupal and they will learn the hard way why Quality is always better than price and will make the right decisions when they work for somebody else or for their own Startup.

Even many of the supposedly Senior guys never learned from their mistakes, for example the Outsourcing guys, cause they work 6 months to a year in a project and then jump to another. Nobody explains the hell in maintenance and incidental they have left there. Nobody teach them.

Programming an small project for 6 months doesn’t make a master. Doing it for 5 years, growing it, learning from your mistakes and learning the YES and DO-NOT the hard way, the real way that works, cause makes you understand why something is better than other things, is the path.

That also remembers me why I love the MT Notation and many of the guys in Barcelona that saw it criticized the method, while my colleagues at Facebook and Dropbox actually told me that they use it, specially for Python and C/C++.

Allow them to thing about how to solve sorting a list of 1000 items by themselves. Let them think. The lazy will copy, but they will not grow.

Then let them implement a Bubble sort. Let them improve it, if they can. Allow them a week to try to improve that. Then make them sort 1,000,000 items so they see that is bloody slow. How can I improve that?. May I read the data from the drive at once, reading line by line was slow… let them think. Like if they were learning Martial Arts, and so discovering their strengths, that they have fast reflexes, allow them to grow.

Universities have to create good professional, not just machines of passing the exams. Real world demands talent, problem solving abilities, passion, ability to learn, and will to do the things well and to improve, and discipline.

After 5-6 years of programming on a daily basis, with an IDE, git, deploying to the Cloud as the basic, and growing a program and seeing the downsides of the solutions chosen, observing that the caveats where for a reason, learning that the Hardware is important, that is not the same to write to memory that to disk or to network, detecting the problems, redoing things, ending in a cul-de-sac, fixing, improving, learning, growing the project, growing himself/herself as a mind, as a programmer, as a thinker, as an expert, daily, even if it’s 30 minutes per day, then that person is prepared for some serious business.

Like piano, guitar, painting, writing… and any other activity, one require continue training in order to improve.

Students have to follow a journey in order to improve.

Let them start with Command Line, i.e. in C and files. Let’s do add later database support.

Deal with buffer overflow, file descriptor, locks and conversion types. Let them migrate to another language the entire project, using Git from the beginning.

Let them migrate again when they need to add Web support. Allow them to discover that instead of reloading all the page they can use Ajax/JSON. Let them deal with click-click that many common users do on the page buttons (so they submit twice the information). To discover SQL Injections. To use a Web Framework. To add Unit Testing. Add some improvement via Javascript Frameworks like responsive for mobiles.

Allow them to use a new Database, new Webserver or technology that is fashion and everybody on Twitter talks about, so it crashes in their face. And so they discover that they will not play or discover new technologies in actual project time in the Company of their future employers, cause shit happens, and impacts the Schedule, and the Company loses money. Universities: Teach them, let the students learn this for themselves, rather than screwing it up in several companies after university.

Linux command-line tools I usually install

Some additional command-line tools that I use to install and use on my text client Systems. Initially here were not listed commands that are shipped with every Linux, but the additional tools I install in every Workstation or Server.

Apache benchmarks (ab)

To stress a Web Server

atop

A good complement to htop, iftop… monitoring tools

bzip2

Cool compressor better than gzip and that also accepts streams.

cfdisk

Nice tool to work with partitions through modern menus.

ctop

Command line / text based Linux Containers monitoring tool.

If you use docker stats, ctop is much better.

ctop.py

My own moniroring too.

dmidecode

Is the dmi Table decoder.

For example: dmidecode -t memory

dstat

Very nice System Stats tool. You can specify individual stats, like a drive.

edac-util

Error reporting utility

edac-util --verbose
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_MC#0_Chan#2_DIMM#0: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#0_MC#1_Chan#2_DIMM#0: 0 Corrected Errors
mc2: 0 Uncorrected Errors with no DIMM info
mc2: 0 Corrected Errors with no DIMM info
mc2: csrow0: 0 Uncorrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#1_DIMM#0: 0 Corrected Errors
mc2: csrow0: CPU_SrcID#1_MC#0_Chan#2_DIMM#0: 0 Corrected Errors
mc3: 0 Uncorrected Errors with no DIMM info
mc3: 0 Corrected Errors with no DIMM info
mc3: csrow0: 0 Uncorrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#0_DIMM#0: 0 Corrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#1_DIMM#0: 0 Corrected Errors
mc3: csrow0: CPU_SrcID#1_MC#1_Chan#2_DIMM#0: 0 Corrected Errors

ethtool

ethtool

fatrace
Reports file access events from all running processes in real time.

flock
With flock several processes can have a shared lock at the same time, or be waiting to acquire a write lock. With lslocks from util-linux package you can get a list of these processes.

fstrim

discard unused blocks on a mounted filesystem (local or remote). Is useful for freeing blocks no longer used in ZFS zvols. That can also be achieved by mount -o discard

fuser
Show which processes use the named files, sockets, or filesystems.

git

glances

A tool for monitoring similar to ctop and htop.

hdparm
Get/set SATA/IDE device parameters

htop

An improved top

httpie

A http client.

id (configure to query OpenLDAP)

ifmetrics

To set the metrics of all IPV4 routes attached to a given network interface

ifstat

Pretty network interfaces stats.

iftop

To watch metrics for a network interface (or wireless)

iftop-wlan0

iostat

CPU and IO devices stats. I modified some collectors for telegraf and influxdb consumed by grafana for fetching the Write KB/s, Read KB/s, Bandwidth of the Magnetic Spinning drives and SSD during declustered rebuild.

iotop

iperf

Perform network throughput tests

ipmitool

iptables

iscsiadm

java (jre Oracle and OpenJDK)

journalctl

ldap-utils

ldapsearch and the other tools to work with LDAP.

less
According to manpages, the opposite of more. :)
What it does is display a file, and you can scroll up/down, you can search for patterns…
Examples:
cat /etc/passwd | less
less /etc/passwd
# -n doesn’t count the lines, to save time
# For a specific Offset
less -n +500000000P /var/log/apache2/giant.log
# For 50% point
less -n +50p /var/log/apache2/giant.log

lrzip /lrztar

Compressor that compresses very efficiently big files, specially GB of of source code.

lrzsz (Zmodem)

An utility to send files to the Server through a terminal.

Very useful when you don’t want to scp or rsftp, for example because that requires MFA  (Multi Factor Authentication) to be performed again and you already have a session open.

Moba xTerm for Windows is one of the Terminal clients that accepts Upload/Download of Z-modem

Moba xTerm for Windows is one of the Terminal clients that accepts Upload/Download of Z-modem

apt install lrzsz

lsblk

List the block devices. Also is handy blkid but you can get this from /dev/disk/by-id/

lynx

Text browser. really handy.

ltrace
To trace library calls.

mc

Midnight Commander

mc

 

md5sum

memtester

Basically for testing Memory.

This will allocate 4GB of RAM and run the test 10 times.

sudo memtester 4096 10

mtr

Network tool mix between ping and traceroute.

mytop

To see in real time queries and slow queries to mysql

ncdu

Show the space used by any directory and subdirectory

ncdu

ncdu-2

nginx (fpm-php) and apache

The webservers

nfs client

nmon

Offers monitoring of different aspects: Network, Disk, Processes…

open-vpn

openssh-server

parted

Partition manipulation

perf

Performance profiler.

Ie:
perf top
perf stat ls

PHP + curl + mysql (hhvm)

pixz

A parallel, multiprocessor, variant of gzip/bzip2 that can leverage several processors to speed up the compression over files.

If the input looks like a tar archive, it also creates an index of all the files in the archive. This allows the extraction of only a small segment of the tarball, without needing to decompress the entire archive.

postcat

postcat -q ID shows the details of a message in the queue

python-pip and pypy

pv
Pipe Viewer – is a terminal-based tool for monitoring the progress of data through a pipeline. It can be inserted into any normal pipeline between two processes to give a visual indication of how quickly data is passing through, how long it has taken, how near to completion it is, and an estimate of how long it will be until completion.

dd if=/dev/urandom | pv | dd of=/dev/null

Output:

1,74MB 0:00:09 [ 198kB/s] [      <=>                               ]

Probably you’ll prefer to use dd with status=progress option, it’s just a sample.

qshape

It shows the composition of the mail queue.

screen

sdparm
Access SCSI modes pages; read VPD pages; send simple SCSI commands

sfdisk

Utility to work with partitions that can export and import configs through STDIN and STDOUT to automate partitions operations.

slabtop
Displays Kernel slab cache information in real time.

smartctl

Utility for dealing with the S.M.A.R.T. features of the disks, knowing errors…

split
Split a file into several, based by text lines, or binary: number of bytes per file.

sha512sum

sshfs

Mount a mountpoint on a remote Server by using SSH.

sshpass
SSH without typing the password. -f for reading it from a file.
sshpass -p “mypassword” ssh -o StrictHostKeyChecking=no root@10.251.35.251

In this sample passing the command ls, so this will be executed, and logout.
sshpass -p “mypassword” ssh -o StrictHostKeyChecking=no root@10.251.35.251 ls

sshuttle

A poor’s man VPN through SSH that is available for Linux and Mac OS X.

E.g.: sudo sshuttle -r carles@8.8.1.234:8275 172.30.0.0/16

Sshuttle forwards TCP and DNS but does not forward UDP or ICMP. So ping or ipmi protocol won’t work. But it does work for http, https, ssh…

Nice article on tunneling only certain things here.

strace

To trace the system calls and signals.
To redirect the output to another process use:
strace zpool status 2>&1 | less

stress-ng

Basically to stress Servers.

strings

Shows only strings from a file (normally a Binary)

From package binutils.

subversion svn

systemd-cgtop and systemd-analyze

tcpdump

To see the traffic to your NIC

tcptraceroute

TCP/IP Traceroure. Targets ports so you can use it to see if port is open too.

tee

Reads from sdtin and sends to a file and outputs to stdout as well.
For example:

find . -mtime -1 -print | grep -v "/logs/" 2>&1 | tee /var/log/results.log

timeout
Kills the process after the indicated timeout.

timeout 1000 dd if=/dev/urandom of=/dev/zvol/carles-N51-C5-D8-P2-S1/gb1000

tracepath

traceroute

tree

Tree simply shows the directory hierarchy in a graphical (text mode) way. Useful to see where files and subfolders are.

xxd
Make a hexdump or do the reverse

sudo xxd /dev/nvme0n1p1 | less

zcat
Just like cat, but for compressed filed.

zcat logs.tar.gz | grep "Error"
zcat logs.tar.gz | less

zless

zram-config

Sergey Davidoff stumbled upon a project called compcache that creates a RAM based block device which acts as a swap disk, but is compressed and stored in memory instead of swap disk (which is slow), allowing very fast I/O and increasing the amount of memory available before the system starts swapping to disk. compcache was later re-written under the name zRam and is now integrated into the Linux kernel.

http://www.webupd8.org/2011/10/increased-performance-in-linux-with.html

watch

Whatch allow you to execuate a command and what it (refresh it) at a given intervals, for example every two seconds.
For example, if normally I would do:

 while [ true ]; do zpool status | head -n10 ; sleep 10; done 
 while [ true ]; do df -h; sleep 60; done 
while [ true ]; do ls -al /tmp | head -n5 ; sleep 2; done 

Then I can do:

watch -n10 zpool status

Or:

watch -n60 df -h

Or

watch "ls -al /tmp | head -n5"

For my Dockers, and cloudinit with Ubuntu, my defaults are:

apt update; apt install htop mc ncdu strace git binutils

Using Windows 10 Appliance in Ubuntu Virtual Box 4.3.10

blog-carlesmateo-com-microsoft-edgeMicrosoft has released Windows 10, and with it the possibility to Download a Windows 10 Appliance to run under Virtual Box, VMWare player, HyperV (for windows), Parallels (Mac). Their idea is to allow you to test Microsoft Edge new browser in addition of being able to test the older browsers in older VM images.

I wanted to use Windows 10 to check compatibility with my messenger c-client.

Also I wanted to know how Java behaves.

The Windows 10 VM image will work for 90 days. You can download it from here (http://dev.modern.ie/tools/vms/linux/).

Instructions are very precarious and they didn’t specify a minimum version, however if you use Virtual Box under Ubuntu 14.04, so Virtual Box 4.3.10, you’ll not be able to import the Appliance as you’ll get an error.

Update: Thanks to Razvan and Eric!, readers that reported that this also works for Mac OS 10.9.5. + Virtual Box 4.3.12 and VirtualBox 4.3.20 running under Windows 7 respectively.

‘Windows10_64’ is not a valid Guest OS type.

Result Code: NS_ERROR_INVALID_ARG (0x80070057)
Component: VirtualBox
Interface: IVirtualBox {fafa4e17-1ee2-4905-a10e-fe7c18bf5554}
Callee: IAppliance {3059cf9e-25c7-4f0b-9fa5-3c42e441670b}

blog-carlesmateo-com-virtualbox-appliance-import-error-is-not-a-valid-guest-os-type

I was looking to find a solution and found no solution on the Internet, so I decided to give a chance and try to fix it by myself.

The error is: ‘Windows10_64’ is not a valid Guest OS type. so obviously, the Windows10_64 is not on the list of the VirtualBox yet, it is a pretty new release. Microsoft could had shipped it with OS Type Windows 64 Other, or Windows 8 64 bits, but they did’t. I wondered if I could edit the image to trick it to appear as a recognized image.

I edited the file (MSEdge – Win10.ova) with Bless Hex Editor, an hexadecimal editor.

I looked for the String “Windows10_64” and found two occurrences.

blog-carlesmateo-com-bless-hex-editor-searchingI had to replace the string and leave it with exact number of bytes it has, so the same length (do not insert additional bytes). I searched for the list of supported OSes and found that “WindowsXP_64” would be a perfect match. I replaced that 10 for XP twice.

blog-carlesmateo-com-bless-hex-editor-windows10_64-to-windowsXP_64Then tried to import the Appliance and it worked.

blog-carlesmateo-com-virtual-box-importing-windows10-appliance-ova-cutblog-carlesmateo-com-bless-applicance-settingsI tried to run it like that, but it froze on the boot, with the new blue logo of windows.

I figured out that Windows XP would probably not be the best similar architecture, so I edited the config and I set Windows 8.1 (64 bit). I also increased the RAM to 4096 MB and set a 32 MB memory for the video card.

blog-carlesmateo-com-config-vbox-microsoft-windows-10-msedge

Then I just started the VM and everything worked.

blog-carlesmateo-com-windows10-in-virtual-box-linux

Ok, a funny note: Just started, it installed me an update without asking ;)

Scaling phantomjs with PHP

One of my clients had a problem with a Phantomjs Software.

I was asked to help in their project, that was relying on one of its features.

Phantomjs is an interesting project, but unfortunately it has not had enough maintenance and a terrible lack of sufficient documentation. The last contributions to repo are from mid May, with small frequency. (Latest releases are from Feb 2015, see the Phantomjs releases on github)

The Software from my client ran well for certain requests, but not for others and after a random time, seconds, or minutes, it became irresponsible.

My client wanted to fix that or to use nodejs to scale their phantom code or in the worst case to rewrite the code in nodejs. And it was urgent, because they were losing a lot of money because of their programs malfunctioning.

I began to investigate. That’s the history of how I fixed…

Connections being irresponsible

My client was using the Phantomjs webserver.

The problem with Phantom’s webserver is that it has a hard limit of 10 concurrent connections. After that all the next http connections are queried until one becomes free.

So if you do a telnet to that port, the connection is accepted, but nothing happens. Even sending malformed GET requests.

My guess was that something in the process of parsing the requests was wrong, and then some of those 10 connections became frozen. I started to debug.

I implemented a timedout that will quit the worker after some time.

mTimerExit = setTimeout(forceExitByTimeout, DEFAULT_TIME_TO_EXIT);

Before exiting is important to clear the timers

clearTimeout(mTimerExit);

I also implemented a debug mode to see what was going on with a method consoleDebug that basically did console.log according to if a parameter debug was set to true.

My quickwin system was working, but many urls still were not being parsed by the phantomjs Engine.

Connecting with nodejs

My client had the bad experience of previous versions of Phantomjs crashing a lot.

So it has the idea of running nodejs as the main webserver, for scaling, and invoking Phantomjs from it.

I did several work in this line.

I tried to link with nodejs with products like:

1) https://github.com/alexscheelmeyer/node-phantom

Unfortunately those packets are no longer maintained, having seen the last update from 2013.

It doesn’t work. I found no documentation, and no traces on errors.

I also got errors like:

XMLHttpRequest cannot load http://localhost:8888/start Origin file:// is not allowed by Access-Control-Allow-Origin

And had to figure out what parameters to tune. I did by starting phantomjs with the param:

 --web-security=false 

In the js scene products and packages are changing very fast and sadly often breaking retrocompatibility.

So you better have a very well defined package.json that installs exactly the software version that you need, or soon, when you deploy to another server it will be a disaster.

2) https://www.npmjs.com/package/ghost-town

Ghost Town is a product that allows to run phantomjs from inside nodejs.

It is a company maintained product, by a contributor, Teddy.

He was very nice replying my questions, but it didn’t help.

The process was failing with no debug, no info.

The package really lacks documentation, and has only the same sample across all the web.

I provide this ghost-town code sample, in case it is useful for people looking for more:

var phantomClusterOptions = require("./phantomClusterOptions");
var town = require("ghost-town")(phantomClusterOptions);
var alerts = require("./qualitynodephantom"); // Do not ad .js
var PORT = 8080;

if (town.isMaster) {
    var express = require('express');
    var app = express();
    app.get('/', function(req, res) {
        // Every request comes here
        var data = {url:req.query.url,device:req.query.quality};

        town.queue(data, function(err,result) {
            res.set('Content-Type', 'text/plain');
            if (!err) {
                res.send(result);
            } else {
                res.send(err);
            }
        }, phantomClusterOptions.pageTries);

    });
    app.listen(PORT);
    console.log('App running');
} else {
    town.on("queue", function(page, data, callback) {
        town.phantom.set('onError', function(msg,trace){});
        // quality is the exported method, you pass the useful page object as parameter
        quality(page, data, function(str){
            callback(null, str);
        });
    });
    town.on("error", function(err) {console.log("error");});
}

And the file phantomClusterOptions has:

//Options here https://github.com/buzzvil/ghost-town
phantomClusterOptions = {
  //phantomBinary:'./phantomjs', //if you want to use a different phantomjs version
  //phantomBinary:'/usr/bin/phantomjs', 
  workerDeath: 3, //number of times that instance of phantom will be reused
  pageTries:5, //tries to the page before rejecting
  pageCount: 1, //number of pages analysed concurrently by the same phantom instance (1 is recommended)
  // This is for versions 1.9 and older of ghost-town
  //phantomFlags:['--load-images=no', '--local-to-remote-url-access=yes', '--ignore-ssl-errors=true', '--web-security=false', '--debug=true'] //flags (http://phantomjs.org/api/command-line.html)
// For v.2 and newer versions
  phantomFlags: {"load-images" : false, "local-to-remote-url-access" : true, "ignore-ssl-errors" : true, "web-security" : false, "debug" : true}
}
module.exports = phantomClusterOptions;

3) Other products

https://www.npmjs.com/package/node-phantom-simple

https://github.com/sgentle/phantomjs-node

I tried to debug with node debugger from command-line:

 

node debug myapp.js

blog-carlesmateo-com-node-debug-command-line

And with node-debug (very nice integration with Chrome):

node-debug myapp.js

blog-carlesmateo-com-debug-node-ghost-town

But I was unable to see what was failing. The nodejs App was up, and the ghost-town queue was increased, but apparently the worker processing the queue was not working or unable to execute phantomjs. But I saw no errors. When I switched the params for ghost-town to v.2, I got some exception, and it really looks like is unable to execute Phantom, or perhaps phantomjs could not exec the .js due to some dependencies problem.

(throw err and error spawn EACCES)

blog-carlesmateo-com-ghost-town-throw-err-error-spawn-eaccessAlso:

Error: /mypath/node_modules/ghost-town/node_modules/phantom/node_modules/dnode/node_modules/weak/build/Release/weakref.node: undefined symbol: node_module_register
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at bindings (/mypath/node_modules/ghost-town/node_modules/phantom/node_modules/dnode/node_modules/weak/node_modules/bindings/bindings.js:76:44)
    at Object.<anonymous> (/mypath/node_modules/ghost-town/node_modules/phantom/node_modules/dnode/node_modules/weak/lib/weak.js:7:35)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)

/mypath/node_modules/ghost-town/node_modules/phantom/node_modules/dnode/node_modules/weak/node_modules/bindings/bindings.js:83
        throw e

But I was unable to find more info on the net, I tried to install additional modules and I even straced the processes but I didn’t find the origin of the problem.

I was using:

npm install browserify express ghost-town phantom socket.io URIjs
async dnode forever node-phantom request underscore.string waitfor

About CentOs and Ubuntu

Some SysAdmins love CentOs. I’m in love with Ubuntu.

Basically, is per the packages system. They are really well maintained.

Ubuntu has LTS Long Time Support versions, that last for 5 years.

And in the other hand, they release a new version every 6 months, and if you install a modern server, you have the latest stable packages of Software.

Working with Open Source, this is a really important point. As I have access to modern versions of PHP, Apache, Tomcat, etc…

To use phantomjs with CentOS you have to download the sources and compile it, it took like an hour in a Cloud commodity Virtual Server, and there were problems of dependencies. Also using a phantomjs compiled with a CentOS system didn’t worked with a Server with a different CentOS version. So it was a bit painful to distribute across heterogeneous machines.

With an Ubuntu 14.04 LTS, just:

sudo apt-get install phantomjs

did the trick installing phantomjs (1.9.0-1)

Scaling with PHP

So we had the decision to make between:

  • rewriting completely the application to nodejs, that certainly would take time
  • to invest more time trying to determine why workers freeze under phantomjs

Phantomjs is a headless WebKit scriptable so it was very convenient.

Nodejs is built on Chrome’s Javascript runtime, so it would do what we want to.

As we had a time-constraint and for my client was very important to have the system working asap.

So I decided to debug a bit more.

I found that url’s were being stop loading at the event page.onNavigationRequested

So I could keep all the url and after a timedout could force a page.open(url) inside the event if it stopped (timedout)

mPage.onNavigationRequested = function(url, type, willNavigate, main) {

That was working, finally, but was not my favourite solution. I wanted to understand why it was failing initially.

The lack of documentation was frustrating, but debugging the problematic urls, I found that they were doing several redirections, and after some I was getting SSL certificate error on one of the destination urls.

The thing had to be with chain certificates bad configured.

As nowadays there many cheap SSL certificates providers, based on chain certificates, and many sites are configuring them wrong, phantomjs was sensible to that and stopping following urls.

I already had the param:

--ignore-ssl-errors=true

But investigating I found a very interesting contribution on stackoverflow from user Micah:

http://stackoverflow.com/questions/12021578/phantomjs-failing-to-open-https-site

Note that as of 2014-10-16, PhantomJS defaults to using SSLv3 to open HTTPS connections. With the POODLE vulnerability recently announced, many servers are disabling SSLv3 support.

To get around that, you should be able to run PhantomJS with:

phantomjs --ssl-protocol=tlsv1

Hopefully, PhantomJS will be updated soon to make TLSv1 the default instead of SSLv3.

I decided to give a try to forcing the version of SSL to TLSV1:

--ssl-protocol=tlsv1

And it worked. It did the trick. All the urls were now being parsed right and following the redirects to the end (or to my timedout).

The problem and the solution has been there since 2015 October, and the default use of tlsv1 has not been implemented as default in Phantomjs. That lack of maintenance I found disappointing.

That is why, when recently a multinational interviewed me, and asked me about technologies like nodejs I told them that I’m conservative until it is clear that the version has been proved as stable. And I told that, in any case, a member of the company should me a core member of the contributors to the technology. They were surprised but they shouldn’t! they should have known what I told!. I explained them that if you use a new technology in production, at least you should have a member of your staff in the core of that product. So you pay a guy to build an Open Source technology, basically. This warranties you that if a heavy bug or security flaw appears, you’ll not be screwed until the release. You guy can fix it immediately and share the solution with the community.

Companies like google, Facebook or Amazon do that.

That conservativeness is what I drawn in an interview with Facebook Operations, where I was asked about an scenario where I would be requested by some Developers and DevOps to upgrade the Load Balancers Software. They were more for the action, and I told that LB are critical and I was replied that everything in FB was critical. I argued that if a chat component fails, only the chat fails, but if the Load Balancers fail, everything will fail as they are the entrance point. I had the confirmation that I was right when some months ago they had an outage for hours.

Sometimes you have to keep strong, defend your point, because you know you’re right. Even if you are in front of a person that doesn’t see the things like you and will take a decision that will let you out. Being honest is priceless.

Scaling Phantomjs with PHP

So cool, the system was working fine.

But there was something that could be improved.

As Phantomjs had the limit of 10 connections in their webserver, that was the maximum concurrent connections that it can serve at the same time, and so it was a bottleneck.

// Sample code to create a webserver from PhantomJS
mWebserver = require('webserver');
mServer = mWebserver.create();
console.log("Server created");
//consoleDebug('Debug enabled');
mService = mServer.listen(8080,{'keepAlive': true}, function(request, response) {
    //consoleDebug('URL:' + request.url);
    s_params = request.url;
    doRender(s_params, function(res) {
        //consoleDebug('Response from URL:' + request.url + ' (processed)');
        writeStringResponse(response,res);
    });
    //consoleDebug('URL:' + request.url + ' ready for processing');
});

I decided to do propose to the company to use one of my tricks.

To launch phantomjs from PHP.

This is doing a wrapper to launch Phantomjs from commandline, and getting the response. I did the same in my CQLSÍ Cassandra wrapper around cqlsh before Cassandra drivers for PHP were available. I did also this to connect the payment gateway of a bank, written in C, with the Java libraries from Ticketing Solutions in 1999.

That way the server would be able to process as many concurrent Phantomjs instances as we want, as each one would be running in its own process.

I modified the js code to remove the webserver functionality and to get parameters from command line.

var system = require('system');
var args = system.args;
var b_debug_write = false;

if (args.length < 2) {
    console.log("Minim 2 parameters");
    console.log("call with: phantomjs program.js http://myurl.com quality");
    console.log("Parameter debug is optional");
    args.forEach(function(arg, i) {
            console.log(i + ': ' + arg);
        });
    // Exit with error level 1
    phantom.exit(1);
}

var s_url = args[1];
var s_quality = args[2];

if (args.length > 3) {
    // Enable debug
    b_debug_write = true;
}

consoleDebug("Starting with url:" + s_url + " and quality:" + s_quality);

Then the PHP code:

<?php
/**
 * Creator: Carles Mateo
 * Date: 2015-05-11 11:56
 */

// Report all PHP errors
error_reporting(E_ALL);

$b_debug = false;

if (!isset($_GET['url']) || !isset($_GET['quality'])) {
    echo 'Invalid parameters';
    exit();
}

if (isset($_GET['debug'])) {
    $b_debug = true;
}

$s_url = $_GET['url'];
$s_quality = $_GET['quality'];

// Just in case is not decoded by the PHP installed
$s_url = urldecode($s_url);
// reencode url
$s_url = urlencode($s_url);

$s_script = '/mypath/myapp_commandline.sh';

$s_script_with_params = $s_script.' '.$s_url.' '.$s_quality;

if ($b_debug == true) {
    $s_script_with_params .= ' debug';
    echo 'Executing '.$s_script_with_params."<br />\n";
}

//$message=shell_exec("/var/www/scripts/testscript 2>&1");
$s_message = shell_exec($s_script_with_params);

header("Content-Type: text/plain");
echo $s_message;

And finally the bash script myapp_comandline.sh:

#!/bin/bash

PATH_QUALITY=/mypath/
#tlsv1 is recommended to avoid problems with certificates
PARAMETERS="--local-to-remote-url-access=yes --ignore-ssl-errors=true --web-security=false --ssl-protocol=tlsv1"
cd $PATH_QUALITY

#echo "Debug param1=$1 param2=$2 param3=$3"

if [ -z "$3" ]
  then
    phantomjs $PARAMETERS quality.js $1 $2
  else
    echo "Launching phantomjs with debug. url=$1 quality=$2"
    phantomjs $PARAMETERS quality.js $1 $2 $3
fi

If you don’t need to load the images you can speed up the thing with parameter:

--load-images=false

So finally we were able to use only 285 MB of RAM to handle more than 20 concurrent phantomjs processes.

blog-carlesmateo-com-scaling-phantomjs-with-php-htop

Stopping definitively the massive Distributed DoS attack

I explain some final information and tricks and definitive guidance so you can totally stop the DDoS attacks affecting so many sites.

After effectively mitigating the previous DDoS Torrent attack to a point that was no longer harmful, even with thousands of requests, I had another surprise.

It was the morning when I had a brutal increase of the attacks and a savage spike of traffic and requests of different nature, so in addition to the BitTorent attack that never stopped but was harmful. I was on route to the work when the alarms raised into my smartphone. I stopped in a gas station, and started to fix it from the car, in the parking area.

Service was irresponsible, so first thing I did was to close the Firewall (from the Cloud provider panel) to HTTP and HTTPS to the Front Web Servers.

Immediately after that I was able to log in via SSH to the Servers. I tried to open HTTPS for our customers and I was surprised that most of the traffic was going through HTTPS. So I closed both, I called the CEO of the company to give status, and added a rule to the Firewall so the staff on the company would be able to work against the Backoffice Servers and browse the web while I was fixing all the mess. I also instructed my crew and gave instructions to support team to help the business users and to deal with customers issues while I was stopping the attack.

I saw a lot of new attacks in the access logs.

For example, requests like coming from Android SDK. Weird.

I blocked those by modifying the index.php (code in the previous article)

// Patch urgency Carles to stop an attack based on Torrent and exploiting sdk's
// http://blog.carlesmateo.com
if (isset($_GET['info_hash']) || (isset($_GET['format']) && isset($_GET['sdk'])) || (isset($_GET['format']) && isset($_GET['id']))) {

(Note: If your setup allows it and the HOST requested is another (attacks to ip), you can just check requestes HOST and block what’s different, or if not also implement an Apache rule to divert all the traffic to that route (like /announce for Bittorrent attack) to another file.)

Then, with the cron blocking the ip addresses all improved.

Problem was that I was receiving thousands of requests per second, and so, adding those thousands of Ip’s to IPTABLES was not efficient, in fact was so slow, that the process was taking more than an hour. The server continued overwhelmed with thousands of new ip’s per second while it was able to manage to block few per second (and with a performance adding ip’s degraded since the ip number 5,000). It was not the right strategy to fight back this attack.

Some examples of weird traffic over the http logs:

119.63.196.31 - - [01/Feb/2015:06:54:12 +0100] "GET /video/iiiOqybRvsM/images/av
atar/0097.jpg HTTP/1.1" 404 499 "-" "Baiduspider-image+(+http://www.baidu.com/se
arch/spider.htm)\\nReferer: http://image.baidu.com/i?ct=503316480&z=0&tn=baiduim
agedetail"
119.63.196.126 - - [01/Feb/2015:07:15:48 +0100] "GET /video/Ig3ebdqswQI/images/avatar/0114.jpg HTTP/1.1" 404 499 "-" "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)\\nReferer: http://image.baidu.com/i?ct=503316480&z=0&tn=baiduimagedetail"
180.153.206.20 - - [30/Jan/2015:06:34:34 +0100] "GET /contents/all/tokusyu/skin_detail.html HTTP/1.1" 404 492 "-" "Mozilla/4.0"
180.153.206.20 - - [30/Jan/2015:06:34:35 +0100] "GET /contents/gift/ban_gift/skin_gift.js HTTP/1.1" 404 490 "http://top.dhc.co.jp/contents/gift/ban_gift/skin_gift.js" "Mozilla/4.0"
180.153.206.20 - - [30/Jan/2015:06:34:36 +0100] "GET /contents/all/content/skin.js HTTP/1.1" 404 483 "http://top.dhc.co.jp/contents/all/content/skin.js" "Mozilla/4.0
220.181.108.105 - - [30/Jan/2015:06:35:22 +0100] "GET /image/DbLiteGraphic/201405/thumb_14773360.jpg?1414567871 HTTP/1.1" 404 505 "http://image.baidu.com/i?ct=503316480&z=0&tn=baiduimagedetail" "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)"
118.249.207.5 - - [30/Jan/2015:06:39:26 +0100] "GET /widgets.js HTTP/1.1" 404 0 "http://www.javlibrary.com/cn/vl_star.php?&mode=&s=azgcg&page=2" "Mozilla/5.0 (MSIE 9.0; qdesk 2.4.1266.203; Windows NT 6.1; WOW64; Trident/7.0; rv:11.0; QQBrowser/8.0.3197.400) like Gecko"
120.197.109.101 - - [30/Jan/2015:06:57:20 +0100] "GET /ads/2015/20minutes/publishing/stval/3001.html?pos=6 HTTP/1.1" 404 542 "-" "20minv3/6 CFNetwork/609.1.4 Darwin/13.0.0"
101.226.33.221 - - [30/Jan/2015:07:00:04 +0100] "GET /?LR_PUBLISHER_ID=29877&LR_SCHEMA=vast2-vpaid&LR_PARTNERS=758858&LR_CONTENT=1&LR_AUTOPLAY=1&LR_URL=http://play.retroogames.com/penguin-hockey-lite/?pc1oibq9f5&utm_source=4207933&entId=7a6c6f60cd27700031f2802842eb2457b46e15c0 HTTP/1.1" 200 11783 "http://play.retroogames.com/penguin-hockey-lite/?pc1oibq9f5&utm_source=4207933&entId=7a6c6f60cd27700031f2802842eb2457b46e15c0" "Mozilla/4.0"
180.153.206.20 - - [30/Jan/2015:07:03:10 +0100] "GET /?metric=csync&p=3030&s=6109924323866574861 HTTP/1.1" 200 11783 "http://v.youku.com/v_show/id_XODYzNzA0NTk2.html" "Mozilla/4.0"
180.153.114.199 - - [30/Jan/2015:07:05:18 +0100] "GET /?LR_PUBLISHER_ID=78817&LR_SCHEMA=vast2-vpaid&LR_PARTNERS=763110&LR_CONTENT=1&LR_AUTOPLAY=1&LR_URL=http://www.dnd.com.pk/free-download-malala-malala-yousafzai HTTP/1.1" 200 11783 "http://www.dnd.com.pk/free-download-malala-malala-yousafzai" "Mozilla/4.0"
60.185.249.230 - - [30/Jan/2015:07:08:20 +0100] "GET /scrape.php?info_hash=8%e3Q%9a%9c%85%bc%e3%1d%14%213wNi3%28%e9%f0. HTTP/1.1" 404 459 "-" "Transmission/2.77"
180.153.236.129 - - [30/Jan/2015:07:11:10 +0100] "GET /c/lf-centennial-services-hong-kong-limited/hk041664/ HTTP/1.1" 404 545 "http://hk.kompass.com/c/lf-centennial-services-hong-kong-limited/hk041664/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider(compatible; HaosouSpider; http://www.haosou.com/help/help_3_2.html)"
36.106.38.13 - - [30/Jan/2015:07:18:28 +0100] "GET /op/icon?id=com.wsw.en.gm.ArrowDefense HTTP/1.1" 404 0 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
180.153.114.199 - - [30/Jan/2015:07:19:12 +0100] "GET /?metric=csync&p=3030&s=6109985239380459533 HTTP/1.1" 200 11783 "http://vox-static.liverail.com/swf/v4/admanager.swf" "Mozilla/4.0"
180.153.206.20 - - [30/Jan/2015:07:39:56 +0100] "GET /?metric=csync HTTP/1.1" 20
0 11783 "http://3294027.fls.doubleclick.net/activityi;src=3294027;type=krde;cat=
krde_060;ord=94170148950.07014?" "Mozilla/4.0"
220.181.108.139 - - [30/Jan/2015:07:47:08 +0100] "GET /asset/assetId/5994450/size/large/ts/1408882797/type/library/client/WD-KJLAK/5994450_large.jpg?token=4e703a62ec08791e2b91ec1731be0d13&category=pres&action=thumb HTTP/1.1" 404 550 "http://www.passion-nyc.com/" "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)"

And over HTTPS:

120.197.26.43 - - [29/Jan/2015:00:04:25 +0100] "GET /c/5356/cc.js?ns=_cc5356 HTTP/1.1" 404 11676 "http://m.accuweather.com/zh/cn/guangzhou/102255/current-weather/102255?p=huawei2" "HUAWEI Y325-T00_TD/V1 Linux/3.4.5 Android/2.3.6 Release/03.26.2013 Browser/AppleWebKit533.1 Mobile Safari/533.1;"
120.197.26.43 - - [29/Jan/2015:00:05:32 +0100] "GET /c/5356/cc.js?ns=_cc5356 HTTP/1.1" 404 11674 "http://m.accuweather.com/zh/cn/guangzhou/102255/weather-forecast/102255" "HUAWEI Y325-T00_TD/V1 Linux/3.4.5 Android/2.3.6 Release/03.26.2013 Browser/AppleWebKit533.1 Mobile Safari/533.1;"
117.136.1.106 - - [29/Jan/2015:06:35:05 +0100] "GET /v2.2/237613769760602?format=json&sdk=android&fields=supports_attribution%2Csupports_implicit_sdk_logging%2Cgdpv4_nux_content%2Cgdpv4_nux_enabled%2Candroid_dialog_configs HTTP/1.1" 404 6304 "-" "FBAndroidSDK.3.20.0"
117.136.1.106 - - [29/Jan/2015:06:35:11 +0100] "GET /v2.2/237613769760602?format=json&sdk=android&fields=supports_attribution%2Csupports_implicit_sdk_logging%2Cgdpv4_nux_content%2Cgdpv4_nux_enabled%2Candroid_dialog_configs HTTP/1.1" 404 6308 "-" "FBAndroidSDK.3.20.0"
117.136.1.107 - - [29/Jan/2015:06:35:36 +0100] "GET /v2.2/1493038024241481?fields=name,supports_implicit_sdk_logging,gdpv4_nux_enabled,gdpv4_nux_content,ios_dialog_configs,app_events_feature_bitmask&format=json&sdk=ios HTTP/1.1" 404 6731 "-" "FBiOSSDK.3.22.0"
180.175.55.177 - public [29/Jan/2015:07:40:23 +0100] "GET /public/checksum?a=8&t=p&v=2.5.0&c=d62c8d04f74057d51872d5dc5cdab098 HTTP/1.0" 404 39668 "https://rc-regkeytool.appspot.com/public/checksum?a=8&t=p&v=2.5.0&c=d62c8d04f74057d51872d5dc5cdab098" "Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1C28 Safari/419.3"

So, I was seeing requests that typically do applications like Facebook, smartphones with Android going to google store, and a lot of other applications and sites traffics, that was going diverted to my Ip.

So I saw that this attack was an attack associated to the ip address. I could change the ip in an extreme case and gain some hours to react.

So if those requests were coming from legitimate applications I saw two possibilities:

1. A zombie network controlled by pirates through malware/virus, that changes the /etc/hosts so those devices go to my ip address. Unlikely scenario, as there was a lot of mobile traffic and is not the easiest target of those attacks

2. A Dns Spoofing attack targeting our ip’s

That is an attack that cheats to Dns servers to make them believe that a name resolves to a different ip. (that’s why ssl certificates are so important, and the warnings that the browser raises if the server doesn’t send the right certificate, it allows you to detect that you’re connecting to a fake/evil server and avoid logging etc… if you have been directed a bad ip by a dns poisoned)

I added some traces to get more information, basically I dump the $_SERVER to a file:

$s_server_dump = var_export($_SERVER, true);

file_put_contents($s_debug_log_file, $s_server_dump.”\n”, FILE_APPEND | LOCK_EX);

array (
  'REDIRECT_SCRIPT_URL' => '/announce',
  'REDIRECT_SCRIPT_URI' => 'http://jonnyfavorite3.appspot.com/announce',
  'REDIRECT_STATUS' => '200',
  'SCRIPT_URL' => '/announce',
  'SCRIPT_URI' => 'http://jonnyfavorite3.appspot.com/announce',
  'HTTP_HOST' => 'jonnyfavorite3.appspot.com',
  'HTTP_USER_AGENT' => 'Bittorrent',
  'HTTP_ACCEPT' => '*/*',
  'HTTP_CONNECTION' => 'closed',
  'PATH' => '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin',
  'SERVER_SIGNATURE' => '<address>Apache/2.4.7 (Ubuntu) Server at jonnyfavorite3.appspot.com Port 80</address>',
  'SERVER_SOFTWARE' => 'Apache/2.4.7 (Ubuntu)',
  'SERVER_NAME' => 'jonnyfavorite3.appspot.com',
  'SERVER_ADDR' => 'deleted',
  'SERVER_PORT' => '80',
  'REMOTE_ADDR' => '60.219.115.77',
  'DOCUMENT_ROOT' => 'deleted',
  'REQUEST_SCHEME' => 'http',
  'CONTEXT_PREFIX' => '',
  'CONTEXT_DOCUMENT_ROOT' => 'deleted',
  'SERVER_ADMIN' => 'deleted',
  'SCRIPT_FILENAME' => 'deleted',
  'REMOTE_PORT' => '27361',
  'REDIRECT_QUERY_STRING' => 'info_hash=%26%27%0F%C2%EB%03J%E2%1F%B0%28%2B%29d%7C%8C%FE%C8l%E9&peer_id=%2DSD0100%2Dj%7E%C3%1C%14%FFsj%DA9%0B%27&ip=10.155.31.47&port=9978&uploaded=2244978664&downloaded=2244978664&left=2570039513&numwant=200&key=511&compact=1',
  'REDIRECT_URL' => '/announce',
  'GATEWAY_INTERFACE' => 'CGI/1.1',
  'SERVER_PROTOCOL' => 'HTTP/1.0',
  'REQUEST_METHOD' => 'GET',
  'QUERY_STRING' => 'info_hash=%26%27%0F%C2%EB%03J%E2%1F%B0%28%2B%29d%7C%8C%FE%C8l%E9&peer_id=%2DSD0100%2Dj%7E%C3%1C%14%FFsj%DA9%0B%27&ip=10.155.31.47&port=9978&uploaded=2244978664&downloaded=2244978664&left=2570039513&numwant=200&key=511&compact=1',
  'REQUEST_URI' => '/announce?info_hash=%26%27%0F%C2%EB%03J%E2%1F%B0%28%2B%29d%7C%8C%FE%C8l%E9&peer_id=%2DSD0100%2Dj%7E%C3%1C%14%FFsj%DA9%0B%27&ip=10.155.31.47&port=9978&uploaded=2244978664&downloaded=2244978664&left=2570039513&numwant=200&key=511&compact=1',
  'SCRIPT_NAME' => '/index.php',
  'PHP_SELF' => '/index.php',
  'REQUEST_TIME_FLOAT' => 1422528394.036,
  'REQUEST_TIME' => 1422528394,
)

Some fields were replaced with ‘deleted’ for security.

Ok, so the client was sending a HOST header

jonnyfavorite3.appspot.com

This address resolves as:

ping jonnyfavorite3.appspot.com
PING appspot.l.google.com (74.125.24.141) 56(84) bytes of data.
64 bytes from de-in-f141.1e100.net (74.125.24.141): icmp_seq=1 ttl=52 time=1.53 ms

So to google appspot.

Next one:

array (
  'REDIRECT_SCRIPT_URL' => '/v2.2/1493038024241481',
  'REDIRECT_SCRIPT_URI' => 'https://graph.facebook.com/v2.2/1493038024241481',
  'REDIRECT_HTTPS' => 'on',
  'REDIRECT_SSL_TLS_SNI' => 'graph.facebook.com',
  'REDIRECT_STATUS' => '200',
  'SCRIPT_URL' => '/v2.2/1493038024241481',
  'SCRIPT_URI' => 'https://graph.facebook.com/v2.2/1493038024241481',
  'HTTPS' => 'on',
  'SSL_TLS_SNI' => 'graph.facebook.com',
  'HTTP_HOST' => 'graph.facebook.com',
  'HTTP_ACCEPT_ENCODING' => 'gzip, deflate',
  'CONTENT_TYPE' => 'multipart/form-data; boundary=3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f',
  'HTTP_ACCEPT_LANGUAGE' => 'zh-cn',
  'HTTP_CONNECTION' => 'keep-alive',
  'HTTP_ACCEPT' => '*/*',
  'HTTP_USER_AGENT' => 'FBiOSSDK.3.22.0',
  'PATH' => '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin',
  'SERVER_SIGNATURE' => '<address>Apache/2.4.7 (Ubuntu) Server at graph.facebook.com Port 443</address>',
  'SERVER_SOFTWARE' => 'Apache/2.4.7 (Ubuntu)',
  'SERVER_NAME' => 'graph.facebook.com',
  'SERVER_ADDR' => 'deleted',
  'SERVER_PORT' => '443',
  'REMOTE_ADDR' => '223.100.159.96',
  'DOCUMENT_ROOT' => 'deletd',
  'REQUEST_SCHEME' => 'https',
  'CONTEXT_PREFIX' => '',
  'CONTEXT_DOCUMENT_ROOT' => 'deleted',
  'SERVER_ADMIN' => 'deleted',
  'SCRIPT_FILENAME' => 'deleted',
  'REMOTE_PORT' => '24797',
  'REDIRECT_QUERY_STRING' => 'fields=name,supports_implicit_sdk_logging,gdpv4_nux_enabled,gdpv4_nux_content,ios_dialog_configs,app_events_feature_bitmask&format=json&sdk=ios',
  'REDIRECT_URL' => '/v2.2/1493038024241481',
  'GATEWAY_INTERFACE' => 'CGI/1.1',
  'SERVER_PROTOCOL' => 'HTTP/1.1',
  'REQUEST_METHOD' => 'GET',
  'QUERY_STRING' => 'fields=name,supports_implicit_sdk_logging,gdpv4_nux_enabled,gdpv4_nux_content,ios_dialog_configs,app_events_feature_bitmask&format=json&sdk=ios',
  'REQUEST_URI' => '/v2.2/1493038024241481?fields=name,supports_implicit_sdk_logging,gdpv4_nux_enabled,gdpv4_nux_content,ios_dialog_configs,app_events_feature_bitmask&format=json&sdk=ios',
  'SCRIPT_NAME' => '/index.php',
  'PHP_SELF' => '/index.php',
  'REQUEST_TIME_FLOAT' => 1422528399.4159999,
  'REQUEST_TIME' => 1422528399,
)

So that was it, connections were arriving to the Server/Load Balancer requesting addresses from Google, Facebook, thepiratebay main tracker, etc… tens of thousands of users simultaneously.

Clearly there was a kind of DNS spoofing attack. And looking at the ACCEPT_LANGUAGE I saw that clients were using Chineses language (zh-cn).

Changing the ip address will take some time, as some dns have 1 hour or several hours of TTL to improve page speed of customers, and both ip’s have to coexist for a while until all the client’s dns have refreshed, browsers have refreshed their cache of dns names, and dns of partners consuming our APIs have refreshed.

You can have a virtual host just blank, but for some projects your virtual hosts needs to listen for all the request, for example, imagine that you serve contents like barcelona.yourdomain.com and for newyork.yourdomain.com and whatever-changing.yourdomain.com like carlesmateo.yourdomaincom, anotheruser.yourdomain.com, anyuserwilhaveownurl.yourdomain.com, that your code process to get the information and render the pages, so it is not always possible to have a default one.

First thing I had to do was, as the attack came from the ip, was to do not allow any host requests that did not belong to the hosts served. That way my framework will not process any request that was not for the exact domains that I expect, and I will save the 404 process time.

Also, if the servers return a white page with few bytes (header), this will harm less the bandwidth if under a DDoS. The bandwidth available is one of the weak points when suffering a DDoS.

To block hosts that are not expected there are several things that can be done, for example:

  • Nginx can easily block those requests not matching the hosts.
  • Modify my script to look at
    HTTP_HOST

    So:
    if (isset($_SERVER['HTTP_HOST']) && ($_SERVER['HTTP_HOST'] != 'www.mydomain.com' && $_SERVER['HTTP_HOST'] != 'mydomain.com') {
    exit();
    }

    That way any request that is not for the exact domain will be halted quickly.
    You can also check instead strpost($_SERVER['HTTP_HOST'], ‘mydomain.com’) for variable HTTP_HOST

  • Create an Apache default virtual host and default ssl. With a white page the load was so small that even tens of thousands ip’s cannot affect enough the server. (Max load was around 2.5%)

Optimization: If you want to save IOPS and network bandwidth to the storage you can create a ramdisk, and have the index.html of the default virtual host in there. This will increase server speed and reduce overhead.

If you’re in a hurry to stop the attack you can also:

  • Change the Ip, get a new ip address, and change the dns to the new ip, and hope this one is not attacked
  • An emergency solution, drastic but effective, would be to block entire China’s ip ranges.

Some of the webs I helped reported having attacks from outside China, but nothing related, the gross attacks, 99%, come from China.

Sadly Amazon’s Cloud Firewall does not allow to deny certain ip addresses, only allow services through a default deny policy.

I reopened the firewall in 80 and 443, so traffic to everyone, checked that the Service was Ok and continued my way to the office.

When later I spoke with some friends, one of them told me about the Chinese government firewall causing this sort of problems worldwide!. Look at this interesting article.

http://blog.sucuri.net/2015/01/ddos-from-china-facebook-wordpress-and-twitter-users-receiving-sucuri-error-pages.html

It is not clear if those DDoS attacks are caused by the Chinese Firewall, by a bug, or by pirates exploiting it to attack sites for money. But is clear that all those attacks and traffic comes from China ip addresses. Obviously this DNS issues are causing the guys browsing on China to not being able of browsing Facebook, WordPres, twitter, etc…

Here you can see live the origin of the world DDoS attacks:

http://www.digitalattackmap.com/#anim=1&color=0&country=ALL&list=0&time=16465&view=map

And China being the source of most of the DDoS attacks.

blog-carlesmateo-com-2015-02-01-15-23-33-china-source-of-attacks

Stopping a BitTorrent DDoS attack

After all the success about the article stopping an XMLRPC to WordPress site attack and thanks messages (I actually helped a company that was being thrown down every day and asked me for help) it’s the moment to explain how to stop an attack much more heavily in evilness.

The first sign I saw was that the server was more and more slower, what is nearly impossible as I setup a very good server, and it has a lot of good development techniques to not having bottlenecks.

I looked at the server and I saw like 3,000 SYN_SENT packets. Apparently we were under a SYN Flood attack.

blog-carlesmateo-com-atack-to-the-web-2015-high-load-blacknetstat revealed more than 6k different ip addresses connecting to the Server.

Server had only 30 GB of RAM so, and started to be full, with more and more connections, and so more Apache processes to respond to the real users fast it was clear that it was going to struggle.

I improved the configuration of the Apache so the Server would be able to handle much more connections with less memory consumption and overhead, added some enhancements for blocking SYN Flood attacks, and restarted the Apache Server.

I reduced greatly the scope of the attacks but I knew that it would only be being worst. I was buying time while not disrupting the functioning of the website.

The next hours the attacks increased to having around 7,500 concurrent connections simultaneously. The memory was reaching its limits, so I decided it was time to upgrade the instance. I doubled the memory and added much more cores, to 36, by using one of the newest Amazon c4.8xlarge.

blog-carlesmateo-com-heavy-load-with-c4-8xlarge-black

The good thing about Cloud is that you pay for the time you use the resources. So when the waters calm down again, I’m able to reduce the size of the instance and save some hundreds to the company.

I knew it was a matter of time. The server was stabilized at using 40 GB out of the 60 GB but I knew the pirates will keep trying to shutdown the service.

Once the SYN Flood was stopped and I was sure that the service was safe for a while, I was checking the logs to see if I can detect a pattern among the attacks. I did.

attacks-access-log-bittorrent

Most request that we were receiving where to a file called announce.php that obviously does not exist in the server, and so it was returning 404 error.

The user agent reported in many cases BitTorrent, or Torrent compatible product, and the url sending a hash, uploaded, downloaded, left… so I realized that somehow my Server was targeted by a Torrent attack, where they indicated that the Server was a Torrent tracker.

As the .htacess in frameworks like Laravel, Catalonia Framework… and CMS like WordPress, Joomla, ezpublish… try to read the file from filesystem and if it doesn’t exist index.php is served, then as first action I created a file /announce.php that simple did an exit();

Sample .htaccess from Laravel:

<IfModule mod_rewrite.c>
    <IfModule mod_negotiation.c>
        Options -MultiViews
    </IfModule>

    RewriteEngine On

    # Redirect Trailing Slashes...
    RewriteRule ^(.*)/$ /$1 [L,R=301]

    # Handle Front Controller...
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteRule ^ index.php [L]
</IfModule>

Sample code for announce.php would be like:

<?php
/**
 * Creator: Carles Mateo
 * Date: 2015-01-21 Time: 09:39
 */

// A cheap way to stop an attack based on requesting this file
http_response_code(406);
exit();

The response_code 406 was an attemp to see if the BitTorrent clients were sensible to headers and stop. But they didn’t.

With with simple addition of announce.php , with exit(), I achieved reducing the load on the Server from 90% to 40% in just one second.

The reason why a not found page was causing so many damage was that as the 404 error page from the Server is personalized, and offers alternative results (assuming the product you was looking for is no longer available), and before displaying all the Framework is loaded and the routes are checked to see if the url fits and so has some process to be done in the PHP side (it takes 100 ms to reply, is not much, but it was not necessary to waste so much CPU), even being very optimized, every single not found url was causing certain process and CPU waste. Since the attack had more than 7,000 different ip’s simultaneously coming to the Server it would be somewhat a problem at certain point and start returning 500 errors to the customers.

The logs were also showing other patterns, for example:

announce?info_hash…

So without the PHP extension. Those kind of requests would not go through my wall file announce.php but though index.php (as .htaccess tells what is not found is directed there).

I could change the .htaccess to send those requests to hell, but I wanted a more definitive solution, something that would prevent the Server from wasting CPU and the Servers to being able to resist an attack x1000 times harder.

At the end the common pattern was that the BitTorrent clients were requesting via GET a parameter called info_hash, so I blocked through there all the request.

I wrote this small program, and added it to index.php

// Patch urgency Carles to stop an attack based on Torrent
// http://blog.carlesmateo.com
if (isset($_GET['info_hash'])) {

    // In case you use CDN, proxy, or load balancer
    $s_ip_proxy = '';

    $s_ip_address = $_SERVER['REMOTE_ADDR'];


    // Warning if you use a CDN, a proxy server or a load balancer do not add the ip to the blacklisted
    if ($s_ip_proxy == '' || ($s_ip_proxy != '' && $s_ip_address != $s_ip_proxy)) {
        $s_date = date('Y-m-d');

        $s_ip_log_file = '/tmp/ip-to-blacklist-'.$s_date.'.log';
        file_put_contents($s_ip_log_file, $s_ip_address."\n", FILE_APPEND | LOCK_EX);
    }


    // 406 means 'Not Acceptable'
    http_response_code(406);
    exit();
}

 

Please note, this code can be added to any Software like Zend Framework, Symfony, Catalonia Framework, Joomla, WordPress, Drupal, ezpublish, Magento… just add those lines at the beginning of the public/index.php just before the action of the Framework starts. Only be careful that after a core update, you’ll have to reapply it.

After that I deleted the no-longer-needed announced.php

What the program does is, if you don’t have defined a proxy/CDN ip, to write the ip connecting with the Torrent request pattern to a log file called for example:

/tmp/ip-to-blacklist-2015-01-23.log

And also exit(), so stopping the execution and saving many CPU cycles.

The idea of the final date is to blacklist the ip’s only for 24 hours as we later will see.

With this I achieved reducing the CPU consumption to around 5-15% of CPU.

Then, there is the other part of stopping the attack, that is a bash program, that can be run from command line or added to cron to be launched, depending on the volume of the attacks, every 5 minutes, or every hour.

blog-carlesmateo-com-blocking-traffick-black

ip_blacklist.sh

#!/bin/bash
# Ip blacklister by Carles Mateo
s_DATE=$(date +%Y-%m-%d)
s_FILE=/tmp/ip-to-blacklist-$s_DATE.log
s_FILE_UNIQUE=/tmp/ip-to-blacklist-$s_DATE-unique.log
cat $s_FILE | sort | uniq > $s_FILE_UNIQUE

echo "Counting the ip addresses to block in $s_FILE_UNIQUE"
cat $s_FILE_UNIQUE | wc -l

sleep 3
# We clear the iptables rules
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT

# To list the rules sudo iptables -L
# /sbin/iptables -L INPUT -v -n
# Enable ssh for all (you can add a Firewall at Cloud provider level or enstrict the rule to your ip)
sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT

for s_ip_address in `cat $s_FILE_UNIQUE`
do
    echo "Blocking traffic from $s_ip_address"
    sudo iptables -A INPUT -s $s_ip_address -p tcp --destination-port 80 -j DROP
    sudo iptables -A INPUT -s $s_ip_address -p tcp --destination-port 443 -j DROP
done

# Ensure Accept traffic on Port 80 (HTTP) and 443 (HTTPS)
sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# To block the rest
# sudo iptables -A INPUT -j DROP

# User iptables -save and iptables -restore to make this changes permanent
# sudo sh -c "iptables-save > /etc/iptables.rules"
# sudo pre-up iptables-restore < /etc/iptables.rules
# https://help.ubuntu.com/community/IptablesHowTo

This scripts gets the list of ip’s addresses, gets the list of unique ip’s into another file, and then makes a loop and adds all of them to the iptables, the Firewall from Linux, and blocks them for accessing the web at port 80 (http) or 443 (https, ssl). You can block all the ports also if you want for those ip’s.

With this CPU use went to 0%.

Note: One of my colleagues, a wonderful SysAdmin at Ackstorm ISP, points that some of you may prefer using REJECT instead of DROP. An interesting conversation on serverfault about this.

After fixing the problem I looked over the Internet to locate any people reporting attacks like what I suffered. The most interesting I found was this article: BotTorrent: Misusing BitTorrent to Launch DDoS Attacks, from University of California, Irvine. (local copy on this website BotTorrent)

Basically any site on the Internet can be attacked at a large scale, as every user downloading Torrent will try to connect to the innocent Server to inform of the progress of the down/upload. If this attack is performed with hundreds of files, the attack means hundreds of thousands of ip’s connecting to the Server… the server will run out of connections, or memory, or bandwidth will be full from the bad traffic.

I saw that the attackers were using porno files that were highly downloaded and apparently telling the Torrent network that our Server was a Torrent tracker, so corroborating my hypothesis all the people downloading Torrents were sending updates to our Server, believing that our Server was a tracker. A trick from the sad pirates.

Some people, business users, asked me who could be interested in injuring other’s servers or disrupting other’s businesses without any immediate gain (like controlling your Servers to send Spam).

I told:

  • Competitors that hate you because you’re successful and want to disrupt your business (they pay to the pirates for doing attacks. I’ve helped companies that were let down by those pirates)
  • Investors that may want to buy you at a cheaper price (after badly trolling you for a week or two)
  • False “security” companies that will offer their services “casually” when you most need them and charge a high bill
  • Pirates that want to extort you

So bad people that instead that using their talent to create, just destroy and act bad being evil to others.

In other cases could be bad luck to have been assigned an Ip that previously had a Torrent tracker, it has not much sense for the Cloud as it is expensive, but it has that a Server with that ip was hacked and used as tracked for a while.

Also governments could be so wanting to disrupt services (like torrent) by clumsy redirecting dns to random ip’s, or entertainment companies trying to shutdown Torrent trackers could try to poison dns to stop users from using Bittorrent.

 

See the definitive solution in the next article.

Stopping and investigating a WordPress xmlrpc.php attack

One of my Servers got heavily attacked for several days. I describe here the steps I took to stop this.

The attack consisted in several connections per second to the Server, to path /xmlrpc.php.

This is a WordPress file to control the pingback, when someone links to you.

My Server it is a small Amazon instance, a m1.small with only one core and 1,6 GB RAM, magnetic disks and that scores a discrete 203 CMIPS (my slow laptop scores 460 CMIPS).

Those massive connections caused the server to use more and more RAM, and while the xmlrpc requests were taking many seconds to reply, so more and more processes of Apache were spawned. That lead to more memory consumption, and to use all the available RAM and start using swap, with a heavy performance impact until all the memory was exhausted and the mysql processes stopped.

I saw that I was suffering an attack after the shutdown of MySql. I checked the CloudWatch Statistics from Amazon AWS and it was clear that I was receiving many -out of normal- requests. The I/O was really high too.

This statistics are from today to three days ago, look at the spikes when the attack was hitting hard and how relaxed the Server is now (plain line).

blog-carlesmateo-com-statistics-use-last-3-days

First I decided to simply rename the xmlrpc.php file as a quick solution to stop the attack but the number of http connections kept growing and then I saw very suspicious queries to the database.

blog-carlesmateo-suspicious-queries-2014-08-30-00-11-59Those queries, in addition to what I’ve seen in the Apache’s error log suggested me that may be the Server was hacked by a WordPress/plugin bug and that now they were trying to hide from the database’s logs. (Specially the DELETE FROM wp_useronline WHERE user_ip = the Ip of the attacker)

[Tue Aug 26 11:47:08 2014] [error] [client 94.102.49.179] Error in WordPress Database Lost connection to MySQL server during query a la consulta SELECT option_value FROM wp_options WHERE option_name = 'uninstall_plugins' LIMIT 1 feta per include('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), include_once('/plugins/captcha/captcha.php'), register_uninstall_hook, get_option
[Tue Aug 26 11:47:09 2014] [error] [client 94.102.49.179] Error in WordPress Database Lost connection to MySQL server during query a la consulta SELECT option_value FROM wp_options WHERE option_name = 'uninstall_plugins' LIMIT 1 feta per include('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), include_once('/plugins/captcha/captcha.php'), register_uninstall_hook, get_option
[Tue Aug 26 11:47:10 2014] [error] [client 94.102.49.179] Error in WordPress Database Lost connection to MySQL server during query a la consulta SELECT option_value FROM wp_options WHERE option_name = 'widget_wppp' LIMIT 1 feta per include('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), do_action('plugins_loaded'), call_user_func_array, wppp_check_upgrade, get_option

The error log was very ugly.

The access log was not reassuring, as it shown many attacks like that:

94.102.49.179 - - [26/Aug/2014:10:34:58 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
94.102.49.179 - - [26/Aug/2014:10:34:59 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
127.0.0.1 - - [26/Aug/2014:10:35:09 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.2.22 (Ubuntu) (internal dummy connection)"
94.102.49.179 - - [26/Aug/2014:10:34:59 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
94.102.49.179 - - [26/Aug/2014:10:34:59 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
94.102.49.179 - - [26/Aug/2014:10:35:00 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
94.102.49.179 - - [26/Aug/2014:10:34:59 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

Was difficult to determine if the Server was receiving SQL injections so I wanted to be sure.

Note: The connection from 127.0.0.1 with OPTIONS is created by Apache when spawns another Apache.

As I had super fresh backups in another Server I was not afraid of the attack dropping the database.

I was a bit suspicious also because the /readme.html file mentioned that the version of WordPress is 3.6. In other installations it tells correctly that the version is the 3.9.2 and this file is updated with the auto-update. I was thinking about a possible very sophisticated trojan attack able to modify wp-includes/version.php and set fake $wp_version = ‘3.9.2’;
Later I realized that this blog had WordPress in Catalan, my native language, and discovered that the guys that do the translations forgot to update this file (in new installations it comes not updated, and so showing 3.6). I have alerted them.

In fact later I did a diff of all the files of my WordPress installation against the official WordPress 3.9.2-ca and later a did a diff between the WordPress 3.9.2-ca and the WordPress 3.9.2 (English – default), and found no differences. My Server was Ok. But at this point, at the beginning of the investigation I didn’t know that yet.

With the info I had (queries, times, attack, readme telling v. 3.6…) I balanced the possibility to be in front of something and I decided that I had an unique opportunity to discover how they do to inject those Sql, or discover if my Server was compromised and how. The bad point is that it was the same Amazon’s Server where this blog resides, and I wanted the attack to continue so I could get more information, so during two days I was recording logs and doing some investigations, so sorry if you visited my blog and database was down, or the Server was going extremely slow. I needed that info. It was worth it.

First I changed the Apache config so the massive connections impacted a bit less the Server and so I could work on it while the attack was going on.

I informed my group of Senior friends on what’s going on and two SysAdmins gave me some good suggestions on other logs to watch and on how to stop the attack, and later a Developer joined me to look at the logs and pointed possible solutions to stop the attack. But basically all of them suggested on how to block the incoming connections with iptables and to do things like reinstalling WordPress, disabling xmlrpc.php in .htaccess, changing passwords or moving wp-admin/ to another place, but the point is that I wanted to understand exactly what was going on and how.

I checked the logs, certificates, etc… and no one other than me was accessing the Server. I also double-checked the Amazon’s Firewall to be sure that no unnecessary ports were left open. Everything was Ok.

I took a look at the Apache logs for the site and all the attacks were coming from the same Ip:

94.102.49.179

It is an Ip from a dedicated Servers company called ecatel.net. I reported them the abuse to the abuse address indicated in the ripe.net database for the range.

I found that many people have complains about this provider and reports of them ignoring the requests to stop the spam use from their servers, so I decided that after my tests I will block their entire network from being able to access my sites.

All the requests shown in the access.log pointed to requests to /xmlrpc.php. It was the only path requested by the attacker so that Ip did nothing more apparently.

I added some logging to WordPress xmlrpc.php file:

if ($_SERVER['REMOTE_ADDR'] == '94.102.49.179') {
    error_log('XML POST: '.serialize($_POST));
    error_log('XML GET: '.serialize($_GET));
    error_log('XML REQUEST: '.serialize($_REQUEST));
    error_log('XML SERVER: '.serialize($_SERVER));
    error_log('XML FILES: '.serialize($_FILES));
    error_log('XML ENV: '.serialize($_ENV));
    error_log('XML RAW: '.$HTTP_RAW_POST_DATA);
    error_log('XML ALL_HEADERS: '.serialize(getallheaders()));
}

This was the result, it is always the same:

[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML POST: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML GET: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML REQUEST: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML SERVER: a:24:{s:9:"HTTP_HOST";s:24:"barcelona.afterstart.com";s:12:"CONTENT_TYPE";s:8:"text/xml";s:14:"CONTENT_LENGTH";s:3:"287";s:15:"HTTP_USER_AGENT";s:50:"Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)";s:15:"HTTP_CONNECTION";s:5:"close";s:4:"PATH";s:28:"/usr/local/bin:/usr/bin:/bin";s:16:"SERVER_SIGNATURE";s:85:"<address>Apache/2.2.22 (Ubuntu) Server at barcelona.afterstart.com Port 80</address>\n";s:15:"SERVER_SOFTWARE";s:22:"Apache/2.2.22 (Ubuntu)";s:11:"SERVER_NAME";s:24:"barcelona.afterstart.com";s:11:"SERVER_ADDR";s:14:"[this-is-removed]";s:11:"SERVER_PORT";s:2:"80";s:11:"REMOTE_ADDR";s:13:"94.102.49.179";s:13:"DOCUMENT_ROOT";s:29:"/var/www/barcelona.afterstart.com";s:12:"SERVER_ADMIN";s:19:"webmaster@localhost";s:15:"SCRIPT_FILENAME";s:40:"/var/www/barcelona.afterstart.com/xmlrpc.php";s:11:"REMOTE_PORT";s:5:"40225";s:17:"GATEWAY_INTERFACE";s:7:"CGI/1.1";s:15:"SERVER_PROTOCOL";s:8:"HTTP/1.0";s:14:"REQUEST_METHOD";s:4:"POST";s:12:"QUERY_STRING";s:0:"";s:11:"REQUEST_URI";s:11:"/xmlrpc.php";s:11:"SCRIPT_NAME";s:11:"/xmlrpc.php";s:8:"PHP_SELF";s:11:"/xmlrpc.php";s:12:"REQUEST_TIME";i:1409338974;}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML FILES: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML ENV: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML RAW: <?xmlversion="1.0"?><methodCall><methodName>pingback.ping</methodName><params><param><value><string>http://seretil.me/</string></value></param><param><value><string>http://barcelona.afterstart.com/2013/09/27/afterstart-barcelona-2013-09-26/</string></value></param></params></methodCall>
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML ALL_HEADERS: a:5:{s:4:"Host";s:24:"barcelona.afterstart.com";s:12:"Content-type";s:8:"text/xml";s:14:"Content-length";s:3:"287";s:10:"User-agent";s:50:"Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)";s:10:"Connection";s:5:"close";}

So nothing in $_POST, nothing in $_GET, nothing in $_REQUEST, nothing in $_SERVER, no files submitted, but a text/xml Posted (that was logged by storing: $HTTP_RAW_POST_DATA):

<?xmlversion="1.0"?><methodCall><methodName>pingback.ping</methodName><params><param><value><string>http://seretil.me/</string></value></param><param><value><string>http://barcelona.afterstart.com/2013/09/27/afterstart-barcelona-2013-09-26/</string></value></param></params></methodCall>

I show you in a nicer formatted aspect:blog-carlesmateo-com-xml-xmlrpc-requestSo basically they were trying to register a link to seretil dot me.

I tried and this page, hosted in CloudFare, is not working.

accessing-seretil-withoud-id

The problem is that responding to this spam xmlrpc request took around 16 seconds to the Server. And I was receiving several each second.

I granted access to my Ip only on the port 80 in the Firewall, restarted Apache, restarted MySql and submitted the same malicious request to the Server, and it even took 16 seconds in all my tests:

cat http_post.txt | nc barcelona.afterstart.com 80

blog-carlesmateo-com-response-from-the-server-to-xmlrpc-attackI checked and confirmed that the logs from the attacker were showing the same Content-Length and http code.

Other guys tried xml request as well but did one time or two and leaved.

The problem was that this robot was, and still sending many requests per second for days.

May be the idea was to knock down my Server, but I doubted it as the address selected is the blog of one Social Event for Senior Internet Talents that I organize: afterstart.com. It has not special interest, I do not see a political, hateful or other motivation to attack the blog from this project.

Ok, at this point it was clear that the Ip address was a robot, probably running from an infected or hacked Server, and was trying to publish a Spam link to a site (that was down). I had to clarify those strange queries in the logs.

I reviewed the WPUsersOnline plugin and I saw that the strange queries (and inefficient) that I saw belonged to WPUsersOnline plugin.

blog-carlesmateo-com-grep-r-delete-from-wp-useronline-2014-08-30-21-11-21-cut

The thing was that when I renamed the xmlrpc.php the spamrobot was still posting to that file. According to WordPress .htaccess file any file that is not found on the filesystem is redirected to index.php.

So what was happening is that all the massive requests sent to xmlrpc.php were being attended by index.php, then showing an error message that page not found, but the WPUsersOnline plugin was deleting those connections. And was doing it many times, overloading also the Database.

Also I was able to reproduce the behaviour by myself, isolating by firewalling the WebServer from other Ips other than mine and doing the same post by myself many times per second.

I checked against a friend’s blog but in his Server xmlrpc.php responds in 1,5 seconds. My friend’s Server is a Digital Ocean Virtual Server with 2 cores and SSD Disks. My magnetic disks on Amazon only bring around 40 MB/second. I’ve to check in detail why my friend’s Server responds so much faster.

Checked the integrity of my databases, just in case, and were perfect. Nothing estrange with collations and the only errors in the /var/log/mysql/error.log was due to MySql crashing when the Server ran out of memory.

Rechecked in my Server, now it takes 12 seconds.

I disabled 80% of the plugins but the times were the same. The Statistics show how the things changed -see the spikes before I definitively patched the Server to block request from that Spam-robot ip, to the left-.

I checked against another WordPress that I have in the same Server and it only takes 1,5 seconds to reply. So I decided to continue investigating why this WordPress took so long to reply.

blog-carlesmateo-com-statistics-use-last-24-hours

As I said before I checked that the files from my WordPress installation were the same as the original distribution, and they were. Having discarded different files the thing had to be in the database.

Even when I checked the MySql it told me that all the tables were OK, having seen that the WPUserOnline deletes all the registers older than 5 minutes, I guessed that this could lead to fragmentation, so I decided to do OPTIMIZE TABLE on all the tables of the database for the WordPress failing, with InnoDb it is basically recreating the Tables and the Indexes.

I tried then the call via RPC and my Server replied in three seconds. Much better.

Looking with htop, when I call the xmlrpc.php the CPU uses between 50% and 100%.

I checked the logs and the robot was gone. He leaved or the provider finally blocked the Server. I don’t know.

Everything became clear, it was nothing more than a sort of coincidences together. Deactivating the plugin the DELETE queries disappeared, even under heavy load of the Server.

It only was remain to clarify why when I send a call to xmlrpc to this blog, it replies in 1,5 seconds, and when I request to the Barcelona.afterstart.com it takes 3 seconds.

I activated the log of queries in mysql. To do that edit /etc/mysql/my.cnf and uncomment:

general_log_file        = /var/log/mysql/mysql.log
general_log             = 1

Then I checked the queries, and in the case of my blog it performs many less queries, as I was requesting to pingback to an url that was not existing, and WordPress does this query:

SELECT   wp_posts.* FROM wp_posts  WHERE 1=1  AND ( ( YEAR( post_date ) = 2013 AND MONTH( post_date ) = 9 AND DAYOFMONTH( post_date ) = 27 ) ) AND wp_posts.post_name = 'afterstart-barcelona-2013-09-26-meet' AND wp_posts.post_type = 'post'  ORDER BY wp_posts.post_date DESC

As the url afterstart-barcelona-2013-09-26-meet with the dates indicated does not exist in my other blog, the execution ends there and does not perform the rest of the queries, that in the case of Afterstart blog were:

40 Query     SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (81) ORDER BY meta_id ASC
40 Query     SELECT ID, post_name, post_parent, post_type
FROM wp_posts
WHERE post_name IN ('http%3a','','seretil-me')
AND post_type IN ('page','attachment')
40 Query     SELECT   wp_posts.* FROM wp_posts  WHERE 1=1  AND (wp_posts.ID = '0') AND wp_posts.post_type = 'page'  ORDER BY wp_posts.post_date DESC
40 Query     SELECT * FROM wp_comments WHERE comment_post_ID = 81 AND comment_author_url = 'http://seretil.me/'

To confirm my theory I tried the request to my blog, with a valid url, and it lasted for 3-4 seconds, the same than Afterstart’s blog. Finally I double-checked with the blog of my friend and was slower than before. I got between 1,5 and 6 seconds, with a lot of 2 seconds response. (he has PHP 5.5 and OpCache that improves a bit, but the problem is in the queries to the database)

Honestly, the guys creating WordPress should cache this queries instead of performing 20 live queries, that are always the same, before returning the error message. Using Cache Lite or Stash, or creating an InMemory table for using as Cache, or of course allowing the use of Memcached would eradicate the DoS component of this kind of attacks. As the xmlrpc pingback feature hits the database with a lot of queries to end not allowing the publishing.

While I was finishing those tests (remember that the attacker ip has gone) another attacker from the same network tried, but I had patched the Server to ignore it:

94.102.52.157 - - [31/Aug/2014:02:06:16 +0000] "POST /xmlrpc.php HTTP/1.0" 200 189 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

This was trying to get a link published to a domain called socksland dot net that is a domain registered in Russia and which page is not working.

As I had all the information I wanted I finally blocked the network from the provider to access my Server ever again.

Unfortunatelly Amazon’s Firewall does not allow to block a certain Ip or range.
So you can block at Iptables level or in .htaccess file or in the code.
I do not recommend blocking at code level because sadly WordPress has many files accessible from outside so you would have to add your code at the beginning of all the files and because when there is a WordPress version update you’ll loss all your customizations.
But I recommend proceeding to patch your code to avoid certain Ip’s if you use a CDN. As the POST will be sent directly to your Server, and the Ip’s are Ip’s from the CDN -and you can’t block them-. You have to look at the Header: X-Forwarded-For that indicates the Ip’s the proxies have passed by, and also the Client’s Ip.

I designed a program that is able to patch any PHP project to check for blacklisted Ip’s (even though a proxy) with minimal performance impact. It works with WordPress, drupal, joomla, ezpublish and Framework like Zend, Symfony, Catalonia… and I patched my code to block those unwanted robot’s requests.

A solution that will work for you probably is to disable the pingback functionality, there are several plugins that do that. Disabling completely xmlrpc is not recommended as WordPress uses it for several things (JetPack, mobile, validation…)

The same effect as adding the plugin that disables the xmlrpc pingback can be achieved by editing the functions.php from your Theme and adding:

add_filter( 'xmlrpc_methods', 'remove_xmlrpc_pingback_ping' );
function remove_xmlrpc_pingback_ping( $methods ) {
    unset( $methods['pingback.ping'] );
    
    return $methods;
}

Update: 2016-02-24 14:40 CEST
I got also a heavy dictionary attack against wp-login.php .Despite having a Captcha plugin, that makes it hard to hack, it was generating some load on the system.
What I did was to rename the wp-login.php to another name, like wp-login-carles.php and in wp-login.php having a simply exit();

<?php
exit();

Take in count that this will work only until WordPress is updated to the next version. Then you have to reapply the renaming trick.