Scaling phantomjs with PHP

One of my clients had a problem with a Phantomjs Software.

I was asked to help in their project, that was relying on one of its features.

Phantomjs is an interesting project, but unfortunately it has not had enough maintenance and a terrible lack of sufficient documentation. The last contributions to repo are from mid May, with small frequency. (Latest releases are from Feb 2015, see the Phantomjs releases on github)

The Software from my client ran well for certain requests, but not for others and after a random time, seconds, or minutes, it became irresponsible.

My client wanted to fix that or to use nodejs to scale their phantom code or in the worst case to rewrite the code in nodejs. And it was urgent, because they were losing a lot of money because of their programs malfunctioning.

I began to investigate. That’s the history of how I fixed…

Connections being irresponsible

My client was using the Phantomjs webserver.

The problem with Phantom’s webserver is that it has a hard limit of 10 concurrent connections. After that all the next http connections are queried until one becomes free.

So if you do a telnet to that port, the connection is accepted, but nothing happens. Even sending malformed GET requests.

My guess was that something in the process of parsing the requests was wrong, and then some of those 10 connections became frozen. I started to debug.

I implemented a timedout that will quit the worker after some time.

mTimerExit = setTimeout(forceExitByTimeout, DEFAULT_TIME_TO_EXIT);

Before exiting is important to clear the timers

clearTimeout(mTimerExit);

I also implemented a debug mode to see what was going on with a method consoleDebug that basically did console.log according to if a parameter debug was set to true.

My quickwin system was working, but many urls still were not being parsed by the phantomjs Engine.

Connecting with nodejs

My client had the bad experience of previous versions of Phantomjs crashing a lot.

So it has the idea of running nodejs as the main webserver, for scaling, and invoking Phantomjs from it.

I did several work in this line.

I tried to link with nodejs with products like:

1) https://github.com/alexscheelmeyer/node-phantom

Unfortunately those packets are no longer maintained, having seen the last update from 2013.

It doesn’t work. I found no documentation, and no traces on errors.

I also got errors like:

XMLHttpRequest cannot load http://localhost:8888/start Origin file:// is not allowed by Access-Control-Allow-Origin

And had to figure out what parameters to tune. I did by starting phantomjs with the param:

 --web-security=false

In the js scene products and packages are changing very fast and sadly often breaking retrocompatibility.

So you better have a very well defined package.json that installs exactly the software version that you need, or soon, when you deploy to another server it will be a disaster.

2) https://www.npmjs.com/package/ghost-town

Ghost Town is a product that allows to run phantomjs from inside nodejs.

It is a company maintained product, by a contributor, Teddy.

He was very nice replying my questions, but it didn’t help.

The process was failing with no debug, no info.

The package really lacks documentation, and has only the same sample across all the web.

I provide this ghost-town code sample, in case it is useful for people looking for more:

var phantomClusterOptions = require("./phantomClusterOptions");
var town = require("ghost-town")(phantomClusterOptions);
var alerts = require("./qualitynodephantom"); // Do not ad .js
var PORT = 8080;

if (town.isMaster) {
    var express = require('express');
    var app = express();
    app.get('/', function(req, res) {
        // Every request comes here
        var data = {url:req.query.url,device:req.query.quality};

        town.queue(data, function(err,result) {
            res.set('Content-Type', 'text/plain');
            if (!err) {
                res.send(result);
            } else {
                res.send(err);
            }
        }, phantomClusterOptions.pageTries);

    });
    app.listen(PORT);
    console.log('App running');
} else {
    town.on("queue", function(page, data, callback) {
        town.phantom.set('onError', function(msg,trace){});
        // quality is the exported method, you pass the useful page object as parameter
        quality(page, data, function(str){
            callback(null, str);
        });
    });
    town.on("error", function(err) {console.log("error");});
}

And the file phantomClusterOptions has:

//Options here https://github.com/buzzvil/ghost-town
phantomClusterOptions = {
  //phantomBinary:'./phantomjs', //if you want to use a different phantomjs version
  //phantomBinary:'/usr/bin/phantomjs', 
  workerDeath: 3, //number of times that instance of phantom will be reused
  pageTries:5, //tries to the page before rejecting
  pageCount: 1, //number of pages analysed concurrently by the same phantom instance (1 is recommended)
  // This is for versions 1.9 and older of ghost-town
  //phantomFlags:['--load-images=no', '--local-to-remote-url-access=yes', '--ignore-ssl-errors=true', '--web-security=false', '--debug=true'] //flags (http://phantomjs.org/api/command-line.html)
// For v.2 and newer versions
  phantomFlags: {"load-images" : false, "local-to-remote-url-access" : true, "ignore-ssl-errors" : true, "web-security" : false, "debug" : true}
}
module.exports = phantomClusterOptions;

3) Other products

https://www.npmjs.com/package/node-phantom-simple

https://github.com/sgentle/phantomjs-node

I tried to debug with node debugger from command-line:

node debug myapp.js

And with node-debug (very nice integration with Chrome):

node-debug myapp.js

But I was unable to see what was failing. The nodejs App was up, and the ghost-town queue was increased, but apparently the worker processing the queue was not working or unable to execute phantomjs. But I saw no errors. When I switched the params for ghost-town to v.2, I got some exception, and it really looks like is unable to execute Phantom, or perhaps phantomjs could not exec the .js due to some dependencies problem.

(throw err and error spawn EACCES)

Also:

Error: /mypath/node_modules/ghost-town/node_modules/phantom/node_modules/dnode/node_modules/weak/build/Release/weakref.node: undefined symbol: node_module_register
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at bindings (/mypath/node_modules/ghost-town/node_modules/phantom/node_modules/dnode/node_modules/weak/node_modules/bindings/bindings.js:76:44)
    at Object.<anonymous> (/mypath/node_modules/ghost-town/node_modules/phantom/node_modules/dnode/node_modules/weak/lib/weak.js:7:35)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)

/mypath/node_modules/ghost-town/node_modules/phantom/node_modules/dnode/node_modules/weak/node_modules/bindings/bindings.js:83
        throw e

But I was unable to find more info on the net, I tried to install additional modules and I even straced the processes but I didn’t find the origin of the problem.

I was using:

npm install browserify express ghost-town phantom socket.io URIjs
async dnode forever node-phantom request underscore.string waitfor

About CentOs and Ubuntu

Some SysAdmins love CentOs. I’m in love with Ubuntu.

Basically, is per the packages system. They are really well maintained.

Ubuntu has LTS Long Time Support versions, that last for 5 years.

And in the other hand, they release a new version every 6 months, and if you install a modern server, you have the latest stable packages of Software.

Working with Open Source, this is a really important point. As I have access to modern versions of PHP, Apache, Tomcat, etc…

To use phantomjs with CentOS you have to download the sources and compile it, it took like an hour in a Cloud commodity Virtual Server, and there were problems of dependencies. Also using a phantomjs compiled with a CentOS system didn’t worked with a Server with a different CentOS version. So it was a bit painful to distribute across heterogeneous machines.

With an Ubuntu 14.04 LTS, just:

sudo apt-get install phantomjs

did the trick installing phantomjs (1.9.0-1)

Scaling with PHP

So we had the decision to make between:

rewriting completely the application to nodejs, that certainly would take time
to invest more time trying to determine why workers freeze under phantomjs

Phantomjs is a headless WebKit scriptable so it was very convenient.

Nodejs is built on Chrome’s Javascript runtime, so it would do what we want to.

As we had a time-constraint and for my client was very important to have the system working asap.

So I decided to debug a bit more.

I found that url’s were being stop loading at the event page.onNavigationRequested

So I could keep all the url and after a timedout could force a page.open(url) inside the event if it stopped (timedout)

mPage.onNavigationRequested = function(url, type, willNavigate, main) {

That was working, finally, but was not my favourite solution. I wanted to understand why it was failing initially.

The lack of documentation was frustrating, but debugging the problematic urls, I found that they were doing several redirections, and after some I was getting SSL certificate error on one of the destination urls.

The thing had to be with chain certificates bad configured.

As nowadays there many cheap SSL certificates providers, based on chain certificates, and many sites are configuring them wrong, phantomjs was sensible to that and stopping following urls.

I already had the param:

--ignore-ssl-errors=true

But investigating I found a very interesting contribution on stackoverflow from user Micah:

http://stackoverflow.com/questions/12021578/phantomjs-failing-to-open-https-site

Note that as of 2014-10-16, PhantomJS defaults to using SSLv3 to open HTTPS connections. With the POODLE vulnerability recently announced, many servers are disabling SSLv3 support.

To get around that, you should be able to run PhantomJS with:
phantomjs --ssl-protocol=tlsv1
Hopefully, PhantomJS will be updated soon to make TLSv1 the default instead of SSLv3.

I decided to give a try to forcing the version of SSL to TLSV1:

--ssl-protocol=tlsv1

And it worked. It did the trick. All the urls were now being parsed right and following the redirects to the end (or to my timedout).

The problem and the solution has been there since 2015 October, and the default use of tlsv1 has not been implemented as default in Phantomjs. That lack of maintenance I found disappointing.

That is why, when recently a multinational interviewed me, and asked me about technologies like nodejs I told them that I’m conservative until it is clear that the version has been proved as stable. And I told that, in any case, a member of the company should me a core member of the contributors to the technology. They were surprised but they shouldn’t! they should have known what I told!. I explained them that if you use a new technology in production, at least you should have a member of your staff in the core of that product. So you pay a guy to build an Open Source technology, basically. This warranties you that if a heavy bug or security flaw appears, you’ll not be screwed until the release. You guy can fix it immediately and share the solution with the community.

Companies like google, Facebook or Amazon do that.

That conservativeness is what I drawn in an interview with Facebook Operations, where I was asked about an scenario where I would be requested by some Developers and DevOps to upgrade the Load Balancers Software. They were more for the action, and I told that LB are critical and I was replied that everything in FB was critical. I argued that if a chat component fails, only the chat fails, but if the Load Balancers fail, everything will fail as they are the entrance point. I had the confirmation that I was right when some months ago they had an outage for hours.

Sometimes you have to keep strong, defend your point, because you know you’re right. Even if you are in front of a person that doesn’t see the things like you and will take a decision that will let you out. Being honest is priceless.

Scaling Phantomjs with PHP

So cool, the system was working fine.

But there was something that could be improved.

As Phantomjs had the limit of 10 connections in their webserver, that was the maximum concurrent connections that it can serve at the same time, and so it was a bottleneck.

// Sample code to create a webserver from PhantomJS
mWebserver = require('webserver');
mServer = mWebserver.create();
console.log("Server created");
//consoleDebug('Debug enabled');
mService = mServer.listen(8080,{'keepAlive': true}, function(request, response) {
    //consoleDebug('URL:' + request.url);
    s_params = request.url;
    doRender(s_params, function(res) {
        //consoleDebug('Response from URL:' + request.url + ' (processed)');
        writeStringResponse(response,res);
    });
    //consoleDebug('URL:' + request.url + ' ready for processing');
});

I decided to do propose to the company to use one of my tricks.

To launch phantomjs from PHP.

This is doing a wrapper to launch Phantomjs from commandline, and getting the response. I did the same in my CQLSÍ Cassandra wrapper around cqlsh before Cassandra drivers for PHP were available. I did also this to connect the payment gateway of a bank, written in C, with the Java libraries from Ticketing Solutions in 1999.

That way the server would be able to process as many concurrent Phantomjs instances as we want, as each one would be running in its own process.

I modified the js code to remove the webserver functionality and to get parameters from command line.

var system = require('system');
var args = system.args;
var b_debug_write = false;

if (args.length < 2) {
    console.log("Minim 2 parameters");
    console.log("call with: phantomjs program.js http://myurl.com quality");
    console.log("Parameter debug is optional");
    args.forEach(function(arg, i) {
            console.log(i + ': ' + arg);
        });
    // Exit with error level 1
    phantom.exit(1);
}

var s_url = args[1];
var s_quality = args[2];

if (args.length > 3) {
    // Enable debug
    b_debug_write = true;
}

consoleDebug("Starting with url:" + s_url + " and quality:" + s_quality);

Then the PHP code:

<?php
/**
 * Creator: Carles Mateo
 * Date: 2015-05-11 11:56
 */

// Report all PHP errors
error_reporting(E_ALL);

$b_debug = false;

if (!isset($_GET['url']) || !isset($_GET['quality'])) {
    echo 'Invalid parameters';
    exit();
}

if (isset($_GET['debug'])) {
    $b_debug = true;
}

$s_url = $_GET['url'];
$s_quality = $_GET['quality'];

// Just in case is not decoded by the PHP installed
$s_url = urldecode($s_url);
// reencode url
$s_url = urlencode($s_url);

$s_script = '/mypath/myapp_commandline.sh';

$s_script_with_params = $s_script.' '.$s_url.' '.$s_quality;

if ($b_debug == true) {
    $s_script_with_params .= ' debug';
    echo 'Executing '.$s_script_with_params."<br />\n";
}

//$message=shell_exec("/var/www/scripts/testscript 2>&1");
$s_message = shell_exec($s_script_with_params);

header("Content-Type: text/plain");
echo $s_message;

And finally the bash script myapp_comandline.sh:

#!/bin/bash

PATH_QUALITY=/mypath/
#tlsv1 is recommended to avoid problems with certificates
PARAMETERS="--local-to-remote-url-access=yes --ignore-ssl-errors=true --web-security=false --ssl-protocol=tlsv1"
cd $PATH_QUALITY

#echo "Debug param1=$1 param2=$2 param3=$3"

if [ -z "$3" ]
  then
    phantomjs $PARAMETERS quality.js $1 $2
  else
    echo "Launching phantomjs with debug. url=$1 quality=$2"
    phantomjs $PARAMETERS quality.js $1 $2 $3
fi

If you don’t need to load the images you can speed up the thing with parameter:

--load-images=false

So finally we were able to use only 285 MB of RAM to handle more than 20 concurrent phantomjs processes.

csort, my algorithm that heavily beats quicksort

Here is my proposal for sorting much, much faster than quicksort.

Time and memory consumption in C, for my algorithm csort, versus quicksort, with an universe of 50 Million registers, with values duplicated from 0 to 1,000,000. Less is better

All the algorithms must be checked with different universes of data and their utility is limited to some situations. I explain here the algorithm, show the logic behind code in C, in PHP and Java, the metricas, push the boundaries of the computer… and test against different sets of data looking for the big-O. I prove how much more perfomant it is and explain everything.

This article is complex, is intended for very Senior Engineers at huge companies that need to handle very big data.

The code is written in C to be sure that we compare everything in the same order of magnitude and to avoid optimizations from the compilers / precompilers / Just in Time engines that could bias the conclusions, and so knowing that we are playing fair, but there is a porting to PHP so everyone can understand the algorithm and there are some utilities written in PHP as well. I’ll also explain certain interesting things about PHP internal functions and how it handles memory, and other goodies about C. Finally the code is also available in Java, and is possible to observe how clearly my csort beats quicksort and how fast que JIT compiler performs. Perhaps you can be interested into reading my previous article performance of several languages.

We need a huge amount of data in order to be able to demonstrate the time differences clearly.

Some of the PHP tests and proves require you to have a decent computer with at least 16 GB of RAM. You can skip those or reduce the magnitude of the set of data, or simply use C.

Take this article as an intellectual entertainment, although I think you will discover this useful algorithm / technique of mine will allow you to sort much faster than quicksort in most of the cases I can imagine, and so you will have another tool in your set of resources for sorting fast and dealing with heavy data structures.

Problem

An initial intellectual problem description to enter into matter:

Having a set of 50,000,000 numbers, with values possible between 0 and 1,000,000:

Sort them and eliminate duplicates as fast as possible

To be fair and trustworthy the numbers will be generated and saved to disk, and the same set of numbers will be used for comparing between all the algorithms and tests and when checking with different languages.

Generating the numbers

For this test I created an array of numeric items.

I created a .txt file filled with the numbers, and then I load the same set of numbers for all the tests so all the numbers are always the same and results cannot be biased by a random set very lucky for the algorithm to sort.

I’ve generated several universes of tests, that you can download.

50 Million results, with values from 0 to 1M: carles_sort-50M.txt.zip [156 MB] (344 MB uncompressed)

Code in PHP to generate the random numbers and save to a file

If you prefer to generate a new set of numbers, you can do it by yourselves.

Please note: my computers have a lot of memory, but if you have few memory you can easily modify this program to save the values one by one and so saving memory by not having all the array in memory. Although I recommend you to generate big arrays and pass to a function while measuring the memory consumption to really understand how PHP manages passing Arrays to functions and memory. I found many Senior devs wrong about it.

<?php 
/**
 * Created by   : carles.mateo
 * Creation Date: 2015-05-22 18:36
 */
error_reporting(E_ALL);
// To avoid raising warnings from HHVM
date_default_timezone_set('Europe/Andorra');

// This freezes the server, using more than 35 GB RAM
//define('i_NUM_VALUES', 500000000);
// 50M uses ~ 10 GB RAM
define('i_NUM_VALUES', 50000000);
define('i_MAX_NUMBER', 500000000);

function get_random_values() {

    $st_values = null;

    $s_values_filename = 'carles_sort.txt';

    if (file_exists($s_values_filename)) {
        print_with_datetime("Reading values...");
        $st_values = file($s_values_filename);
        print_with_datetime("Ensuring numeric values...");
        foreach($st_values as $i_key=>$s_value) {
            $i_value = intval($s_value);
            $st_values[$i_key] = $i_value;
        }
    } else {
        print_with_datetime("Generating values...");
        $st_values = generate_random_values();
        print_with_datetime("Writing values to disk...");
        save_values_to_disk($s_values_filename, $st_values);
        print_with_datetime("Finished writing values to disk");
    }

    return $st_values;
}

function generate_random_values() {

    $i_num_values = i_NUM_VALUES;
    $i_max_number = i_MAX_NUMBER;

    $st_values = array();

    for($i_counter = 0; $i_counter<$i_num_values; $i_counter++) {
        $i_random = rand(0, $i_max_number);
        $st_values[] = $i_random;
    }

    return $st_values;

}

function save_values_to_disk($s_values_filename, $st_values) {
    $i_counter = 0;

    $o_file = fopen($s_values_filename, 'w');
    foreach ($st_values as $i_key => $s_value) {
        $i_counter++;
        fwrite($o_file, $s_value.PHP_EOL);
        if (($i_counter % 100000) == 0) {
            // every 100K numbers we show a status message
            $f_advance_percent = ($i_counter / $i_max_values)*100;
            print_with_datetime("Item num: $i_counter ($f_advance_percent%) value: $s_value written to disk");
        }
    }
    fclose($o_file);
}

function get_datetime() {
    return date('Y-m-d H:i:s');
}

function print_with_datetime($s_string) {
    echo get_datetime().' '.$s_string."\n";
}
print_with_datetime('Starting number generator...');

$st_test_array = get_random_values();

First comments about this code:

Obviously to generate an array from 50M items in PHP you need enough of RAM. In PHP this takes ~9GB, in C you need much less: just 50M * 4 bytes per item = 200 MB, although as my program is multipurpose and use several arrays you will realistic be needing >2 GB RAM.
Playing with memory_get_usage() ayou will discover that returning a big array from the function to the main code does not work by copying it to the stack, and that by passing the array to the function by value (not specifying by reference), does not copy the values neither. A pointer alike is passed unless the data passed is modified. It used a technique called copy-on-write. Other modern languages use similar techniques as well to save memory. So it is not true that “A copy of the original $array is created within the function scope“. As PHP.net section PHP at the Core: A Hacker’s guide defines “PHP is a dynamic, loosely typed language, that uses copy-on-write and reference counting.“
Please note that when we read from disk we do:
```
        foreach($st_values as $i_key=>$s_value) {
            $i_value = intval($s_value);
            $st_values[$i_key] = $i_value;
        }
```
This forced conversion to int is because when PHP reads the values from the disk it reads them as String (and it puts a jump of line at the end as well). We convert to make sure we use PHP integer values.

To know the max value of an Integer in your platform do:
```
echo PHP_INT_MAX;
```
For a 64 bits is: 9223372036854775807

I limited the number of registers sorted in this tests to 50,000,000 because it makes PHP use 12.6GB of RAM, and my computer has 32 GB. I tried with 500,000,000 but the Linux computer started to swap like crazy and had to kill the process by ssh (yes, the keyboard and mouse were irresponsible, I had to login via ssh from another computer and kill the process). Although 500M is a reasonably number for the code in C or Java.
So the fourth conclusion is that for working with big Data Structures PHP is not recommended because it uses a lot of RAM. You can use it if its flexibility brings you more value than performance and memory consumption.
Everything that you do can do in PHP will be much much much faster if done in C.

Sample status of the advance of the process of writing to disk from PHP (that is slow) 50M values from ranges 0 to 500M

My solution 1: simple csort

I’ve developed several implementations, with several utility. This is the first case.

Please take a look at this table of the cost of sorting an Array by using different algorithms from http://bigocheatsheet.com/ :

It is interesting to see the complexity of each algorithm in all the cases.

So Quicksort has a complexity of O(n log(n)) for the best case, and O(n^2) for the worst.

For a 50 Million Array, so n=50M, then 50M * log2(50M) = 1,278,771,237.95 ~ 1.28 Billion

My algorithm approach is this: having the unsorted array loaded into the array st_values, I do a main loop like this (remember that we want to sort and remove duplicates):

for (l_counter=0; l_counter < l_max_values; l_counter++) {
    st_values_sorted[st_values[l_counter]] = st_values[l_counter];
}

So, I copy to the destination array st_values_sorted the values, assigning the value as the index of the Array st_values_sorted.

With this:

We are sure we will have no duplicates, as if the value is duplicated n times it will be simply reassigned to the same position n times, and so having only it one time as st_values_sorted[n] in the final array.
We have an array, that has as index, the same value. So even, if in a foreach loop you would get the values assorted, they’re sorted by their key (hash index numeric value).

Finally, if you want to generate a sorted sequential array or a linked list, we do:

        // Resort for a creating a ordered list 
        l_max_range = l_max_values_sorted + 1;
        for (l_counter=0; l_counter < l_max_range; l_counter++) {
            if (st_values_sorted[l_counter] > -1 ) {
                // If the value existed
                st_values_sorted_renum[l_counter_hit] = st_values_sorted[l_counter];
                l_counter_hit++;
            }
        }

The magic of this:

In the first loop, we read each item of the array just one time. So O(n)
In the second loop, if we really want to generate a sorted list, we simply loop the sorted and deduplicated Array by index, as seen in the code before, from the initial possible value (0) to the maximum value we can have (in this sample m=1M)
So we loop, for 1,000,001 items
So in total we do n+m cycles
The cost in the Best and in the Worst case is O(n)
This is fucking awesome faster than quicksort
For quicksort time complexity is: Best O(n log n). Worst O(n^2)
After quicksort you would have to delete duplicates also, adding some cost (in the case of PHP unique() function is really costly)
An optimization is that we can also associate the index, in the process of loading the values from disk, so we save the first loop (step 1) and so we save O(n).
```
    while( b_exit == 0 ) {
        i_fitems_read = fscanf(o_fp, "%li", &l_number);
        if (i_fitems_read < 1) {
            printf("Unable to read item %li from file.\n", l_counter);
            exit(EXIT_FAILURE);            
        }
        st_values[l_number] = l_number;
                       
        l_counter++;
        if (l_counter > l_max_values -1) break;
    }
```
And we will have the Array sorted (by key) without any sort. That’s why I say…
The fastest sort… is no sorting at all!.

This is the recommended way of using the algorithm because it saves a lot of memory, so, instead of having and Array of 50M for loading the data to sort, and another of 1M to hold the csorted data, we will have only the 1M Array. We will be adding the data using csort and eliminating duplicates and sorting on the way.
If in the process of loading the have all the values sorted, we will not do the first loop O(n), so our cost will be O(m) where m<n
Finally, if our array is sorted (by index) there is no need to do the second loop to produce a sorted list, as the array is sorted by key. But if we need to do it, the cost is m, otherwise we have no cost at all, it was sorted and deduplicated when loaded from disk!
Even doing both loops, this is awesome faster than quicksort (even without deduplicating), for the universes tested:
1. 50M items, possible values 0-1M
2. 50M items, possible values 0-50M
3. 50M items, possible values 0-500M
4. The same proportions with less items

Comparing the performance for the first case

Notes:

Please note the code has been compiled with -O0 to disable optimizations for what I think is a bug I found in gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2. that manifests in the csort with compression. With the parameter -O2 the times for the sorting are half and better (0.1 seconds for csort basic, around 4.7 for quicksort).
To use arrays of 500M, so using more than 2GB of ram you should pass the parameter -mcmodel=medium to gcc or the compiler will return an error “relocation truncated to fit: R_X86_64_32S against .bss”
Computer used is Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz, as computers nowadays use turbo, times vary a bit from test to test, so tests are done many times and an average value is provided
Please understand how your language pass the parameters in functions before trying to port this code to another language (passing a pure by value string of 50M items to a function would not be a good idea, even worst for the case of recursive functions)

As you can see in the C code below, I have used two quicksort algorithms, both quicksort take between 8.7 and 10 seconds. (the second algorithm is a bit slower)

This time does not change, no matter if the random numbers were for ranges 0-1M, 0-50M or 0-500M.

My implementation of the quicksort with removing duplicates take just a bit more as I use the same base of the csort for removing the duplicates.

With the csort basic, doing the two loops (so no optimizing on load) the time for sorting with removing duplicates and generating a final array like a list, depends on the maximum possible value:

Universe of 50M items, random values between 0-1M, time: 0.5 to 0.7 seconds
Here csort basic invoked from command line. Firefox was opened and heavily full of tabs, so it was a bit slower but it takes 785 ms

Launching csort basic from the netbeans IDE, Release config but -O0, and firewox removed, takes 554 milliseconds / 0.554 seconds
Universe of 50M items, random values between 0-50M, 1.7-2 seconds
Universe of 50M items, random values between 0-500M, 3.7 seconds

Advantages of csort (for its purpose, so for the universes compatible, so if the range of values m is smaller or not many times bigger than the number of items n):

Is super-fast
It saves a lot of memory
Can be used to remove duplicates or to keep duplicates
It is possible to use it with hybrid universes (see additional optimizations) and have significant performance gains respect using a single sorting algorithm
It is very easy to understand and to implement

Disadvantages of csort basic:

It is intended for the universe of max(m) < n. If max(m) (maximum value of the number of random values) is much much higher than n (the number of items), then the sorted by key array will use more memory (bigger array), even if it has few values. (* see csort with compression bellow and Additional optimizations sections)
In certain border edge cases where n is many more times smaller than the difference between min(m) and max(m) the algorithm could be slower than quicksort. (this is if data does not accomplish max(m) < n)
The sorted array uses memory. Other algorithms, starting from an initial array, do not need additional arrays in order to sort. Although we can generate a final linked list and remove the arrays from memory during the process we are using that memory. But this was supposing we have an original array not sorted, but actually we don’t need to have an starting array, just csort the Data as you get it to the destination Array, and then you save memory because an Array of 1M takes less memory than an Array of 50M. (See My solution 2: optimized when reading)

My solution 2: optimized when reading

This is basically csort, but using the read from disk process to assign the right index, not only save CPU but a lot of memory.

That way the complexity is totally reduced as data is sorted (by index) and deduplicated when loading, with no cost.

In the case of the optimized read, with an universe of 50M items, ranges 0-1M, it takes the astonishing time of.. 3 ms! So 0.003 seconds.

Probably this represents better the idea of beating quicksort: csort 0.003 seconds vs ~10 seconds quicksort.

My solution 3: supporting linked data

Image we have a numeric array with unique values, for example id, and another String array with the names.

For example:

st_values[0] = 1500000; st_names[0] = “Larry Page”;

st_values[1] = 731207; st_names[1] = “Sergey Brin”;

st_values[2] = 23788889; st_names[2] = “Steve Wozniak”;

st_values[3] = 5127890; st_names[3] = “Bill Gates”;

st_values[1000] = 5127890; st_names[1000] = “Alan Turing”;

st_values[10000] = 73215; st_names[10000] = “Grace Hopper”;

st_values[100000] = 218919; st_names[100000] = “Jean E. Sammet”;

Instead of doing:

st_sorted[1500000] = 1500000;

st_values[731207] = 731207;

…

We will do:

st_sorted[1500000] = 0;

st_values[731207] = 1;

st_values[23788889] = 2;

…

So, we will have csorted for the id, but the value will be pointed to the linked data in st_names[].

This can be used too with csort optimized reading and saving the use in memory of the array st_values[]. Obviously st_names[] must be kept.

My solution 4: csort with compression and supporting duplicates

csort is very quick for sorting and removing duplicates. But what happens if we want to sort and keep the duplicates?

The algorithm csort with compression what does is:

    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
        for (l_counter=0; l_counter < l_max_values; l_counter++) {
            l_value = st_values[l_counter];
            st_values_sorted[l_value] = st_values_sorted[l_value] + 1;
        }
        l_counter_hit = l_max_values_sorted + 1;
    }

So, it assigns the index, as the other versions of csort, but instead of assigned the same value as the index, it is assigned the number of times that this value is found.

So, assuming we have a list like this:

{0,10,15,0,3,5,21,5,100000,0,7,5}

It will produce:

st_values[0] = 3;
st_values[1] = 0;
st_values[2] = 0;
st_values[3] = 1;
st_values[4] = 0;
st_values[5] = 3;
st_values[6] = 0;
st_values[7] = 1;…
st_values[100000] = 1;

From this Array it is possible to get the original linked list sorted easily:
{0,0,0,3,5,5,5,7,10,15,21,100000}

Now, take the first universe of data 50M items, with ranges 0-1M, and you get that you have compressed an originally unsorted array of 50M items into a sorted array of 1M items (1,000,001 to be exact) and you can still access by index. So a compression of 50 to 1, for this universe of numbers. (As you are thinking that won’t compress, but use more space, if the range of numbers is bigger than the number of items.)

If your values repeat a lot, and you have a huge set of items to sort, like in Big Data, this algorithm is wonderful and have huge memory savings and amazing speedy random access to data.

The idea, as mentioned before, is to do not load the data into an Array, and then sort, but to assigning using csort when reading data from disk. That way instead of having an Array of 50M and another of 1M, we would have only a single Array of 1M.

This is the relation of times for the universe 50M, 0-1M:

csort basic: 0.553 seconds
Please note: this sample including the time to generate a array like a list: st_values_sorted_renum, so the time just to create the csorted array is ~3ms less
csort opt read: 0.003 seconds
csort with compression and duplicates support: 0.883 seconds

But awesome news are that the csort with compression can be implemented on the load time, like csort opt read.

Additional improvements

1) Using overlay to save Memory and run even faster

Ideally min(m) and max(m) should be calculated.

Imagine that you have a universe of data composed by id driver licenses. Guess that those numbers range from 140,000,000 to 190,000,000.

There is no need have create an array of 190M items. Just create an array of 50M, (max(m) – min(m)) and save an overlay to a variable.

l_overlay = 140000000;

Then simply do your loops from 0 to max(m) – l_overlay and when you’re building the final linked list or the sequentially ordered final array add l_overlay to each value.

You will save memory and increase speed.

2) Hybrid universes / spaces

Guess you have a hybrid universe, composed by let’s say 75% of your values in the range of 0-1M and 25% of your values being frontier values, f.e. between 35M and 900M.

    // Last value covered by the csort
    long l_last_frontier = 1000000;

    long l_counter_unsorted_array = 0;

    while( b_exit == 0 ) {
        i_fitems_read = fscanf(o_fp, "%li", &l_number);
        if (i_fitems_read < 1) {
            printf("Unable to read item %li from file.\n", l_counter);
            exit(EXIT_FAILURE);            
        }

        if (l_number > l_last_frontier) {
            st_values_unsorted[l_counter_unsorted_array] = l_number;
            l_counter_unsorted_array++;
        } else {
            st_values[l_number] = l_number;
        }
                       
        l_counter++;
        if (l_counter > l_max_values -1) break;
    }

You can then sort the 25% with another algorithm and merge the csorted array (it goes first) and the 25% (bigger values, go later), now sorted Array, and you’ll have benefit from a 75% super-fast sorting.

3) Hybrid sort big-O

For my tests, as long as you have memory, you can csort up to n * 10, and have significant time saving respect Quicksort. So if you have a really huge universe you can implement a hybrid algorithm that sorts up to m = n * 10 (l_last_frontier = n * 10) and use another algorithm to sort the rest >m. Finally just add the sorted second block to the sorted from first block results.

4) Hybrid sorting, with undetermined n and m

If you don’t know how long will be your space, like in a mapreduce, or the max value, you can also use the hybrid approach and define an array big enough, like 10M. If your arrays are based on long int, so using 4 bytes per position, it will take only 38 Mbytes of RAM. Peanuts. (Use the l_last_frontier seen in the point before just in case you have bigger values)

When you’re loading the data, and csorting using csort opt read, keep a variable with the max and min values found, so you’ll not have to parse all the 10M, just from min-max in order to generate a final sorted, deduplicated, list or array.

It’s up to you to fine tune and adjust the size for the minimum memory consumption and maximum efficiency according to the nature of your data.

5) Using less memory

If you have stored the data to sort in a Collection, you can reduce an item one by one, so you will save memory for huge amounts of Data. So, you can add that item to csort resulting array, and eliminate from the Collection/LinkedList source. One by one. That way you’ll save memory. With the sample of 50M registers with 1M different random values, it passes from 400MB* of RAM to 4MB of RAM in C. (* I’m calculating here a Linked List structure of 2 Long Data-type, one for the value and another for next item, in C. In other languages gain can be even more spectacular)

csort and Big Data

With the progression of Big Data we are finding ourselves with seas of data. Those universe of data tend to have a much bigger number of registers (n) than the range of values (m).

For example, in the case of devices monitoring temperatures every second, we will have 3,600 rows per hour, 86,400 rows per day, 31,536,000 per year while the range of values read will be a very stable subset, like from 19.0 to 24.0 C. In this example, multiplying the float value per 10 to work always with Integers, we would be working with a subset of 190-240, so m would be 51 if optimized, or 0-240 if not. If we want to get the unique temperatures for the past year, csort will generate just an Array of 51 items, crushing any ordering algorithm in space and time.

The bigger the Data is, the faster csort will perform compared to other sorting algorithms.

The complete code in C

This is the complete source code. It supports all the cases and you can invoke it from command line with a parameter for the case number you want to test.

/* 
 * File:   main.c
 * Author: Carles Mateo
 *
 * Created on 2015-05-14 16:45
 */

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <string.h>

/*
 * Change does values for your tests, and recompile it
 */
#define l_MAX_ITEMS 50000000
#define l_MAX_VALUE 1000000


static int const i_ALGORITHM_QUICKSORT        = 1;
static int const i_ALGORITHM_QUICKSORT_UNIQUE = 2;
static int const i_ALGORITHM_QUICKSORT2       = 3;
static int const i_ALGORITHM_CSORT_BASIC      = 7;
static int const i_ALGORITHM_CSORT_OPT1       = 8;
static int const i_ALGORITHM_CSORT_COMP       = 9;


#define s_HELP  "CSort by Carles Mateo\n"\
                "=====================\n"\
                "\n"\
                "This demonstrates a sort algorithm faster than quicksort,\n"\
                "as proposed in the article.\n"\
                "Please read the complete article at:\n"\
                "http://blog.carlesmateo.com\n"\
                "\n"\
                "Please invoke the program with one of these parameters:\n"\
                "\n"\
                "--help or without parameters for this help message\n"\
                "1 To test using quicksort\n"\
                "  Only sorts, does not deletes duplicates\n"\
                "2 To test using quicksort and removing duplicates\n"\
                "3 To test using another implementation of quicksort\n"\
                "  Only sorts, does not deletes duplicates\n"\
                "7 To test using csort (basic example, sorts and deletes duplicates)\n"\
                "8 To test using csort, improvement with indexes assigned when reading\n"\
                "  Use to watch the time saving from basic csort\n"\
                "9 To test using csort, with indexes assigned when reading and counting\n"\
                "  the number of coincidences (compressed output array)\n"\
                "\n"\
                "Please, make sure that the file carles_sort.txt is present in the\n"\
                "current directory.\n"\
                "\n"


// Allocate in the BSS segment, which is a part of the heap
static long st_values[l_MAX_ITEMS];
// Only max value is required
static long st_values_sorted[l_MAX_VALUE];
static long st_values_sorted_renum[l_MAX_VALUE];

int init_values(long l_max_values);
void quicksort(long x[l_MAX_ITEMS], long first, long last);
void quicksort2(long arr[], long beg, long end);
void swap(long *a, long *b);
void printwdt(char s_string[]);

int init_values(long l_max_values) {
    long l_counter = 0;
    for (l_counter = 0; l_counter < l_max_values; l_counter++) {
        // Init to -1
        st_values[l_counter] = -1;
    }
}

int init_values_sorted(long l_max_values, long l_num) {
    long l_counter = 0;
    for (l_counter = 0; l_counter < l_max_values; l_counter++) {
        // Init to -1
        st_values_sorted[l_counter] = l_num;
    }
}

// Quicksort modified from original sample from http://www.cquestions.com/2008/01/c-program-for-quick-sort.html
 void quicksort(long x[l_MAX_ITEMS], long first, long last){
    long pivot,j,temp,i;

     if (first<last) {
         pivot=first;
         i=first;
         j=last;

         while(i<j){
             while(x[i]<=x[pivot]&&i<last)
                 i++;
             while(x[j]>x[pivot])
                 j--;
             if(i<j){
                 temp=x[i];
                  x[i]=x[j];
                  x[j]=temp;
             }
         }

         temp=x[pivot];
         x[pivot]=x[j];
         x[j]=temp;
         quicksort(x,first,j-1);
         quicksort(x,j+1,last);

    }
}

// Modified by Carles Mateo to use long based on http://alienryderflex.com/quicksort/
void quicksort2(long arr[], long beg, long end)
{
  if (end > beg + 1)
  {
    long piv = arr[beg], l = beg + 1, r = end;
    while (l < r)
    {
      if (arr[l] <= piv)
        l++;
      else
        swap(&arr[l], &arr[--r]);
    }
    swap(&arr[--l], &arr[beg]);
    quicksort2(arr, beg, l);
    quicksort2(arr, r, end);
  }
} 

void swap(long *a, long *b)
{
  long t=*a; *a=*b; *b=t;
}

 /* print with date and time
  */
 void printwdt(char s_string[]) {
    time_t timer;
    char buffer[26];
    struct tm* tm_info;

    time(&timer);
    tm_info = localtime(&timer);

    strftime(buffer, 26, "%Y-%m-%d %H:%M:%S", tm_info);
    
    printf("%s %s\n", buffer, s_string);
  
 }
 
int main(int argc, char** argv) {
    
    char s_filename[] = "carles_sort.txt";
    long l_max_values = l_MAX_ITEMS;
    long l_max_values_sorted = l_MAX_VALUE;
    char s_ch;
    FILE *o_fp;
    long l_counter=0;
    long l_counter_hit=0;
    long l_number;
    int b_exit = 0;
    int i_fitems_read = 0;
    int i_algorithm = 0;
    // Small counter for printing some final results
    int i_counter=0;
    // temp value for the final print
    long l_value=0;
    // Used for counting the number of results on optimized array
    long l_total_items_sorted = 0;
    long l_max_range;
    long l_items_found = 0;
    
    if (argc < 2) {
        printf(s_HELP);
        exit(EXIT_SUCCESS);
    }
        
    printwdt("Starting program...");
    if (strcmp(argv[1], "1") == 0) {
        i_algorithm = i_ALGORITHM_QUICKSORT;
        printwdt("Algorithm: Quicksort with duplicates");
    } else if (strcmp(argv[1], "2") == 0) {
        i_algorithm = i_ALGORITHM_QUICKSORT_UNIQUE;
        printwdt("Algorithm: Quicksort removing duplicates");
    } else if (strcmp(argv[1], "3") == 0) {
        i_algorithm = i_ALGORITHM_QUICKSORT2;
        printwdt("Algorithm: Quicksort alternative");
    } else if (strcmp(argv[1], "7") == 0) {
        i_algorithm = i_ALGORITHM_CSORT_BASIC;
        printwdt("Algorithm: csort basic");
    } else if (strcmp(argv[1], "8") == 0) {
        i_algorithm = i_ALGORITHM_CSORT_OPT1;
        printwdt("Algorithm: csort optimized read");
    } else if (strcmp(argv[1], "9") == 0) {
        i_algorithm = i_ALGORITHM_CSORT_COMP;
        printwdt("Algorithm: csort compress");
    }
    
    if (i_algorithm == 0) {
        printf(s_HELP);
        exit(EXIT_SUCCESS);        
    }
    
    if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT2) {
        // Use this if you want to make tests with partial data to find the end of the values
        //printf("Initializing %li values to -1...\n", l_max_values);
        //init_values(l_max_values);        
    }
    
    if (i_algorithm == i_ALGORITHM_CSORT_BASIC || i_algorithm == i_ALGORITHM_CSORT_OPT1 || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {
        // We init to -1 as it could happen that some indexes have not a value,
        // and we want to know what's the case.
        printf("Initializing %li values to -1...\n", l_max_values_sorted + 1);
        init_values_sorted(l_max_values_sorted + 1, -1);        
    } 
    
    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
        printf("Initializing %li values_sorted to 0...\n", l_max_values_sorted + 1);
        init_values_sorted(l_max_values_sorted + 1, 0);        
    }

    
    o_fp = fopen(s_filename,"r"); // read mode
    
    if( o_fp == NULL ) {
        printf("File %s not found or I/O error.\n", s_filename);  
        perror("Error while opening the file.\n");
        exit(EXIT_FAILURE);
    }
    
    printf("Reading file %s :\n", s_filename);
    while( b_exit == 0 ) {
        i_fitems_read = fscanf(o_fp, "%li", &l_number);
        if (i_fitems_read < 1) {
            printf("Unable to read item %li from file.\n", l_counter);
            perror("File ended before reading all the items.\n");
            exit(EXIT_FAILURE);            
        }
        if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE ||
            i_algorithm == i_ALGORITHM_QUICKSORT2 ||
            i_algorithm == i_ALGORITHM_CSORT_BASIC || i_algorithm == i_ALGORITHM_CSORT_COMP) {
            st_values[l_counter] = l_number;
        }
        if (i_algorithm == i_ALGORITHM_CSORT_OPT1) {
            st_values[l_number] = l_number;
            l_total_items_sorted++;
        }
        
        if (l_counter < 10) {
            // Echo the reads from the first 10 items, for debugging
            printf("Item %li: read %li\n", l_counter, l_number);            
        }

        l_counter++;
        if (l_counter > l_max_values -1) break;
    }
    
    printf("Total read values: %li\n", l_counter);
    printf("Sample values read from file: First item: %li Last item: %li\n", st_values[0], st_values[l_max_values -1]);
 
    fclose(o_fp);        
    
    printwdt("Sorting...");
    
    struct timeval time;
    gettimeofday(&time, NULL); //This actually returns a struct that has microsecond precision.
    long l_microsec_ini = ((unsigned long long)time.tv_sec * 1000000) + time.tv_usec;

    if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {
        quicksort(st_values, 0, l_MAX_ITEMS-1);
        l_counter_hit = l_MAX_ITEMS;
    }
    
    if (i_algorithm == i_ALGORITHM_QUICKSORT2) {
        quicksort2(st_values, 0, l_MAX_ITEMS-1);
        l_counter_hit = l_MAX_ITEMS;
    }
    
    if (i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {        
        printwdt("Removing duplicates...");
        l_counter_hit = 0;
        long l_last_known_value = -1;
        for (l_counter=0; l_counter < l_max_values; l_counter++) {
            if (l_last_known_value != st_values[l_counter]) {
                l_last_known_value = st_values[l_counter];
                st_values_sorted_renum[l_counter_hit] = st_values[l_counter];
                l_counter_hit++;
            }
        }            
    }
    
    if (i_algorithm == i_ALGORITHM_CSORT_BASIC) {
        // Shrink and set the index
        for (l_counter=0; l_counter < l_max_values; l_counter++) {
            st_values_sorted[st_values[l_counter]] = st_values[l_counter];
        }

        // Resort for forearch
        l_max_range = l_max_values_sorted + 1;
        for (l_counter=0; l_counter < l_max_range; l_counter++) {
            if (st_values_sorted[l_counter] > -1 ) {
                st_values_sorted_renum[l_counter_hit] = st_values_sorted[l_counter];
                l_counter_hit++;
            }
        }        
    }
    
    if (i_algorithm == i_ALGORITHM_CSORT_OPT1) {
        l_max_range = l_max_values_sorted + 1;
        for (l_counter=0; l_counter < l_max_range; l_counter++) {
            if (st_values[l_counter] > -1 ) {
                st_values_sorted_renum[l_counter_hit] = st_values[l_counter];
                l_counter_hit++;
            }
        }        
    }
    
    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
        for (l_counter=0; l_counter < l_max_values; l_counter++) {
            l_value = st_values[l_counter];
            st_values_sorted[l_value] = st_values_sorted[l_value] + 1;
        }
        l_counter_hit = l_max_values_sorted + 1;
    }   
    
    
    struct timeval time_end;
    gettimeofday(&time_end, NULL); //This actually returns a struct that has microsecond precision.
    long l_microsec_end = ((unsigned long long)time_end.tv_sec * 1000000) + time_end.tv_usec;
    
    long l_microsec_total = l_microsec_end - l_microsec_ini;
    long l_milliseconds_total = l_microsec_total / 1000;
    double f_seconds = (double)l_milliseconds_total / 1000;
    printf("Time used: %li microseconds = %li milliseconds = %f seconds \n", l_microsec_total, l_milliseconds_total, f_seconds);
    
    printf("Total items in the final array %li\n", l_counter_hit);
    printf("Sample items in the final array:\n");

    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
        l_max_range = l_max_values_sorted + 1;        
        l_counter_hit = 0;

        for (l_counter=0; l_counter < l_max_range; l_counter++) { 
            if (st_values_sorted[l_counter] > 0) {
                l_counter_hit++;
                l_items_found = l_items_found + st_values_sorted[l_counter];                
            }            
        }
        printf("A final array of %li significant items, contains sorted lossless %li\n", l_counter_hit, l_items_found);
        
    } else {
        for (i_counter=0; i_counter<10000; i_counter=i_counter+1000) {
            /* if (st_values_sorted[i_counter] > -1) {
                // If you enjoy looking at values being printed or want to debug :)
                printf("item at position %i : %li\n", i_counter, st_values_sorted[i_counter]);
            }*/
            if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT2) {
                l_value = st_values[i_counter];
            } else {
                l_value = st_values_sorted_renum[i_counter];
            }
            printf("item at position %i : %li\n", i_counter, l_value);

        }
        
    }
    
    return (EXIT_SUCCESS);
}

My solution in PHP

As told before I have implemented my algorithm in PHP, and it outperforms by much the PHP native functions (PHP sort() is a quicksort algorithm written in C, and unique() ). I explain everything below for the PHP lovers, is the same than in C, but with special considerations for PHP language. Even in PHP the efficiency of the algorithm is clearly demonstrated.

Warning, do not attempt to run the PHP code as is in your machine unless you have at least 16 GB of RAM. If you have 8 GB RAM you should reduce the scope to avoid your computer to hang up or swap very badly.

Printing of datetime is done by those functions:

function get_datetime() {
    return date('Y-m-d H:i:s');
}

function print_with_datetime($s_string) {
    echo get_datetime().' '.$s_string."\n";
}

So once we have our numbers generated we do a sort($st_test_values);

And we calculate the time elapsed with microtime for profiling:

$f_microtime_ini = microtime(true);

print_algorithm_type('PHP sort() method');
sort($st_test_array);

$f_microtime_end = microtime(true);

$f_microtime_resulting = $f_microtime_end - $f_microtime_ini;

echo "Microtime elapsed: $f_microtime_resulting\n";

In this case we did a PHP sort(), and we didn’t even remove the duplicates and it takes 88.5 seconds.

If we do a sort() and array_unique() to get the unique results sorted it takes: 741.27 seconds

We do the opposite array_unique() and sort() and the result is: 402.45 seconds

It takes a lot.

Please note that sorting and doing array_unique takes more time than doing in the opposite order.

This is curious as one may thing that removing duplicates and sorting could be much faster than sorting and removing duplicates, but just sorting takes 88 seconds. I found the explanation looking at the source Source Code of PHP array_unique. Internally it sorts the array (zend_qsort line 2813) for removing the duplicates.

So well, we were sorting twice, and sorting a sorted array. This is the worst case for quicksort, so that explains the slowlyness.

Then it enters what I engined. PHP works with arrays, that can have a numeric index. So my point is: the fastest way to sort an array is… not sorting it at all.

We know that the range of values we have is much smaller than the number of registers:

We have values between 0 and 1,000,000
We have 50,000,000 of registers

PHP sort() function is written in C, so it should be fastest than any thing we program in PHP. Should be…. really?.

Knowing that the values are 0-1M and the registers are 50M, and knowing the number of operations that a quick sort will do, I figured another way to improve this.

Look at this code:

function csort($st_test_array) {

    $s_algorithm = 'Carles_Sort';
    echo "Testing $s_algorithm\n";

    $i_loop_counter = 0;
    $st_result_array = Array();

    $i_ending_pos = count($st_test_array);
    for($i_pos = 0; $i_pos < $i_ending_pos; $i_pos++) {
        $i_loop_counter++;
        $i_value = $st_test_array[$i_pos];
        $st_result_array[$i_value] = $i_value;
    }

    return $st_result_array;

}

So, the only think I’m doing is looping the 50M array and in a new array $st_result_array setting the key to the same value, than the integer value. So if the original $st_test_array[0] is equal to 574145, then I set st_result_array[574145] = 574145. Easy.

Think about it.

With this we achieve three objectives:

You eliminate duplicates. With 50M items, with values ranged from 0 to 1M you’ll set a value many times. No problem, it will use it’s index.
We have the key assigned, matching the value, so you can reference every single value
It’s amazingly faster. Much more than any sorting even in C

If you do a var_export or var_dump or print_r or a foreach you will notice that the first item of the array has key 574145 and value 574145, the second 262240, the third 153341, the fourth 656736 and the fifth 879329. They are not ordered in the way that one will think about an ordered array parsed with foreach(), but the thing is: do we need to sort it?. Not really. Every key is a hash that PHP will access very quickly. So we can do something like this to get the results ordered:

    $i_max_value = max($st_result_array);

    for($i_pos = 0; $i_pos <= $i_max_value; $i_pos++) {
        if (isset($st_result_array[$i_pos])) {
            $st_result_array2[] = $i_pos;
        }
    }

In this code we create a new array, with the values sorted in the sense that a foreach will return them sorted. But instead of recurring 50M registers, we only go to the max value (as values and indexes are the same, the max value also shows the max index), because we now don’t have 50M registers as we eliminated the duplicates, but 1M. And in an universe of 50M values from a range 0-1M, many are duplicates.

Even doing this second step still makes this sorting algorithm written in PHP, much more efficient than the quicksort written in C.

I ported the code to C, with the same data file, and it takes 0.674 seconds (674 milliseconds) and it uses only ~1 GB of RAM.

Take in consideration that if the data is directly loaded assigning the right index, so doing like in the next code, it will be blazing fast:

    o_fp = fopen(s_filename,"r"); // read mode
    
    if( o_fp == NULL ) {
        printf("File %s not found or I/O error.\n", s_filename);  
        perror("Error while opening the file.\n");
        exit(EXIT_FAILURE);
    }
    
    printf("Reading file %s :\n", s_filename);
    while( 1 ) {
        fscanf(o_fp, "%li", &l_number);
        // The key is here ;)
        st_values[l_number] = l_number;
        l_counter++;
        if (l_counter > l_max_values -1) break;
    }
    
    printf("First item: %li Last item: %li\n", st_values[0], st_values[l_counter-1]);
 
    fclose(o_fp);

So if we only resort in another array to have the values sorted eliminating the non existing values (half the tasks we do in the previous program), it takes and order of milliseconds.

blog-carlesmateo-com-csort-second-part-fast-0_1seconds

Using csort opt read with a busy computer (firefox using 16 GB RAM and many tabs) it takes 108 ms, with a idle computer 3 ms

Now imagine you don’t want to eliminate the duplicates.

You can create an memory optimized array where the index is the number, and the value represents the number of apparitions in the original file:

$st_example = Array( 0 => 1,
                     3 => 5,
                     4 => 1);

That would mean that the original sorted array, in form of list would be:

{0, 3, 3, 3, 3, 3, 4}

For the case of C, where the arrays have static size, a value of 0 would indicate there is no presence of that number in the orderer list so:

st_example[0] = 1;
st_example[1] = 0;
st_example[2] = 0;
st_example[3] = 5;
st_example[4] = 1;

Even in the case of the resulting C array, it will be much smaller than the original 50M Array, as will have only the size corresponding to the max value. In our sample 1M.

In PHP you can get an Array alike the csort compressed by using a function called array_count_values(). It will not have the non existing items = 0, so it would not generate st_example[1] = 0 neither st_example[2] = 0, but it would generate the others.

So conditions for the algorithm to work blazing fast:

The max number (max(m)) must be ideally lower than the number of registers (this is true for languages like C or Java. Languages like PHP grow dynamically the Array size and work as Array Hash, so it’s not necessary as we can have st_sorted[0] and st_sorted[1000000] and the array size will be only 2 items. In this case we can use less memory than the csort algorithm and be faster using foreach() instead as for() from 0 to 1M)
The most numbers that are repeated, the faster that will work the algorithm

Key conclusions and Advantages:

If you directly load the values form disk assigning the value as a key as well, you’ll save one loop per item (n=50M in this universe) and make it faster (having only to parse count(m), in PHP it doesn’t have to be 0-m).
The fastest way to sort it, is no sorting at all. You can have the values stored per key, even if a foreach doesn’t start by 0,1,2… simply accessing the value index you need will be super-fast.

Source code in PHP

<?php
/**
 * User: Carles Mateo
 * Date: 2015-04-05 14:19
 */

error_reporting(E_ALL);
// To avoid raising warnings from HHVM
date_default_timezone_set('Europe/Andorra');

// This freezes the server, using more than 35 GB RAM
//define('i_NUM_VALUES', 500000000);
// 50M uses ~ 10 GB RAM
define('i_NUM_VALUES', 50000000);
define('i_MAX_NUMBER', 1000000);

define('i_EXIT_SUCCESS', 0);
define('i_EXIT_ERROR', 1);

abstract class CsortDemo {

    const i_ALGORITHM_QUICKSORT        = 1;
    const i_ALGORITHM_PHP_QSORT        = 4;
    const i_ALGORITHM_PHP_ARRAY_FLIP   = 5;
    const i_ALGORITHM_PHP_ARRAY_UNIQUE = 6;
    const i_ALGORITHM_CSORT_BASIC      = 7;
    const i_ALGORITHM_CSORT_OPT1       = 8;


    public static function displayHelp() {
        $s_help = <<<EOT
CSort.php by Carles Mateo
=========================

This demonstrates a sort algorithm faster than quicksort,
as proposed in the article.
Please read the complete article at:
http://blog.carlesmateo.com

Please invoke the program with one of these parameters:

--help or without parameters for this help message
1 To test using a quicksort algorithm written in PHP
4 To test using PHP sort() (is a quicksort algorithm)
5 To test using PHP array_flip()
6 To test using PHP unique() (is a quicksort algorithm)
7 To test using csort (basic example, sorts and deletes duplicates)
8 To test using csort, improvement with indexes assigned when reading
  Use to watch the time saving from basic csort


9 To test using csort, with indexes assigned when reading and counting
  the number of coincidences (compressed output array)

Please, make sure that the file carles_sort.txt is present in the
current directory.

EOT;

        echo $s_help;

    }

    public static function get_random_values($i_algorithm) {

        $st_values = Array();
        $i_value = 0;

        $s_values_filename = 'carles_sort.txt';

        if (file_exists($s_values_filename)) {
            self::print_with_datetime("Reading values...");
            if ($i_algorithm == CsortDemo::i_ALGORITHM_CSORT_OPT1) {
                $o_fp = fopen($s_values_filename, 'r');
                if ($o_fp) {
                    while($s_line = fgets($o_fp) !== false) {
                        $i_value = intval($s_line);
                        $st_values[$i_value] = $i_value;
                    }
                }
            } else {
                $st_values = file($s_values_filename);
                self::print_with_datetime("Ensuring numeric values...");
                foreach($st_values as $i_key=>$s_value) {
                    $i_value = intval($s_value);
                    $st_values[$i_key] = $i_value;
                }

            }
        } else {
            self::print_with_datetime("Generating values...");
            $st_values = self::generate_random_values();
            self::print_with_datetime("Writing values to disk...");
            self::save_values_to_disk($s_values_filename, $st_values);
            self::print_with_datetime("Finished writing values to disk");
        }

        return $st_values;
    }

    public static function save_values_to_disk($s_values_filename, $st_values) {
        $i_counter = 0;
        $f_advance_percent = 0;
        $i_max_values = 0;

        $i_max_values = count($st_values);

        $o_file = fopen($s_values_filename, 'w');
        foreach ($st_values as $i_key => $s_value) {
            $i_counter++;
            fwrite($o_file, $s_value.PHP_EOL);
            if (($i_counter % 100000) == 0) {
                // every 100K numbers we show a status message
                $f_advance_percent = ($i_counter / $i_max_values)*100;
                self::print_with_datetime("Item num: $i_counter ($f_advance_percent%) value: $s_value written to disk");
            }
        }
        fclose($o_file);
    }

    public static function generate_random_values() {

        $i_num_values = i_NUM_VALUES;
        $i_max_number = i_MAX_NUMBER;

        $st_values = array();

        for($i_counter = 0; $i_counter<$i_num_values; $i_counter++) {
            $i_random = rand(0, $i_max_number);
            $st_values[] = $i_random;
        }

        return $st_values;

    }

    public static function get_datetime() {
        return date('Y-m-d H:i:s');
    }

    public static function print_with_datetime($s_string) {
        echo self::get_datetime().' '.$s_string."\n";
    }

    public static function print_algorithm_type($s_algorithm) {
        echo "Testing $s_algorithm\n";
    }

    public static function print_results($st_values, $i_initial_pos_to_display, $i_number_of_results_to_display) {

        self::print_with_datetime("Showing results from $i_initial_pos_to_display, $i_number_of_results_to_display items :");

        $i_counter = 0;
        $i_counter_found = 0;
        for($i_counter=0; ($i_counter_found<$i_number_of_results_to_display && (($i_counter + $i_initial_pos_to_display) < i_MAX_NUMBER)); $i_counter++) {
            $i_pos = $i_initial_pos_to_display + $i_counter;
            if (isset($st_values[$i_pos])) {
                echo '['.$i_pos.'] '.$st_values[$i_pos]."\n";
                $i_counter_found++;
            }
        }

    }

    // Found on the web
    public static function quicksort( $array ) {
        if( count( $array ) < 2 ) {
            return $array;
        }
        $left = $right = array( );
        reset( $array );
        $pivot_key  = key( $array );
        $pivot  = array_shift( $array );
        foreach( $array as $k => $v ) {
            if( $v < $pivot )
                $left[$k] = $v;
            else
                $right[$k] = $v;
        }
        return array_merge(self::quicksort($left), array($pivot_key => $pivot), self::quicksort($right));
    }

    public static function csort_basic($st_test_array) {
        // Note that even if we do not explicit to PHP to pass array by reference it does it,
        // and we do not use more memory unless we try to write to original array


        $s_algorithm = 'Carles_sort_basic';
        echo "Testing $s_algorithm\n";

        $st_result_array = Array();
        $st_result_array2 = Array();

        $i_min_value_pos = null;
        $i_max_value     = null;
        $i_max_value_pos = null;

        $i_temp_value = null;
        $i_loop_counter = 0;

        $i_num_items = count($st_test_array);

        $i_initial_pos = 0;
        $i_ending_pos  = $i_num_items;

        for($i_pos = $i_initial_pos; $i_pos < $i_ending_pos; $i_pos++) {
            $i_loop_counter++;

            $i_value = $st_test_array[$i_pos];
            $st_result_array[$i_value] = $i_value;

            // If you want to debug a bit
            //var_export($st_result_array); sleep(3);

            // If you want to debug...
            //if ($i_loop_counter % 1000 == 0) {
                //echo "Loop: $i_loop_counter i_pos: $i_pos Value: $i_value\n";
                //print_results($st_test_array, 0, 10);
                //print_results($st_test_array, $i_num_items - 2, 2);
            //}

        }

        // Return in the way of a foreach sorted Array
        $i_max_value = max($st_result_array);

        for($i_pos = 0; $i_pos <= $i_max_value; $i_pos++) {
            if (isset($st_result_array[$i_pos])) {
                $st_result_array2[] = $i_pos;
            }
        }

        return $st_result_array2;

    }

}

    // Checking parameters
    if (count($argv) < 2) {
        // Display help and exit
        CsortDemo::displayHelp();
        exit(i_EXIT_SUCCESS);
    }

    $i_algorithm = intval($argv[1]);

    CsortDemo::print_with_datetime('Starting program...');

    $st_test_array = CsortDemo::get_random_values($i_algorithm);

    $i_num_items = count($st_test_array);
    CsortDemo::print_with_datetime("Starting sorting with $i_num_items items, i_MAX_NUMBER was ".i_MAX_NUMBER);

    $i_loop_counter = 0;

    $i_initial_pos = 0;
    $i_ending_pos = $i_num_items - 1;

    $i_min_value     = i_MAX_NUMBER * 2; // A high number for the tests
    $i_min_value_pos = null;
    $i_max_value     = null;
    $i_max_value_pos = null;

    $i_temp_value = null;

    $b_changes_done = false;
    $b_exit = false;


    // TODO Test array_count_values
    $f_microtime_ini = microtime(true);

    if ($i_algorithm == CsortDemo::i_ALGORITHM_QUICKSORT) {
        $s_algorithm = 'Quicksort written in PHP';
        CsortDemo::print_algorithm_type($s_algorithm);
        CsortDemo::quicksort($st_test_array);
    }

    if ($i_algorithm == CsortDemo::i_ALGORITHM_CSORT_BASIC) {
        $st_test_array = CsortDemo::csort_basic($st_test_array);
    }

    if ($i_algorithm == CsortDemo::i_ALGORITHM_PHP_QSORT) {
        CsortDemo::print_algorithm_type('PHP sort() method');
        sort($st_test_array);
    }

    if ($i_algorithm == CsortDemo::i_ALGORITHM_PHP_ARRAY_FLIP) {
        CsortDemo::print_algorithm_type('PHP array_flip() method');
        $st_test_array = array_flip($st_test_array);
    }

    if ($i_algorithm == CsortDemo::i_ALGORITHM_PHP_ARRAY_UNIQUE) {
        CsortDemo::print_algorithm_type('PHP array_unique() method');
        $st_test_array = array_unique($st_test_array);
    }

    $f_microtime_end = microtime(true);

    $f_microtime_resulting = $f_microtime_end - $f_microtime_ini;

    echo "Microtime elapsed: $f_microtime_resulting\n";

    CsortDemo::print_with_datetime('Results:');
    //print_r($st_test_array);
    $i_total_items = count($st_test_array);
    echo 'Total items:'.$i_total_items."\n";
    CsortDemo::print_results($st_test_array, 0, 10);

    $i_last_number  = array_pop($st_test_array);
    $i_last_number1 = array_pop($st_test_array);
    CSortDemo::print_with_datetime('Last values:');
    echo $i_last_number1."\n";
    echo $i_last_number."\n";

    CsortDemo::print_with_datetime('Finished program');

PHP under HHVM

Additionally I executed the csort.php in the Facebook’s Hip Hop Virtual Machine, version 3.4.0-dev, and as expected everything is much faster than with PHP standard.

For example:

Using PHP sort() 88-104 seconds, using HHVM PHP sort() 15.8 seconds
Using csort basic from PHP 34 to 40 seconds, using csort basic from HHVM 3.4 to 5.13 seconds

The version of PHP used for these tests was:

PHP 5.5.9-1ubuntu4.9 (cli) (built: Apr 17 2015 11:44:57)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies

The version of hhvm used for the test was:

HipHop VM 3.4.0-dev (rel)
Compiler: heads/master-0-g96dec459bb84f606df2fe15c8c82df684b20ed83
Repo schema: 0100f1aaeeacd7b9d9a369b34ae5160acb0d1163
Extension API: 20140829

Must also mention that some operations like initializing the arrays, are super-fast with HHVM.

The solution in Java

The code has been executed with Oracle’s JDK 1.7.

I love Just In Time compilers. They are amazingly fast.

For this case, the results with Java, for the universe of 50M items, range 0-1M of values:

quicksort in Java: 3555 ms =3.555 seconds versus ~10 seconds in C
Important: In C I had to disable -O2 optimization for what it looks like a bug in gcc (only detected in csort with compression). With -O2 times were around 4.7 seconds
csort.java basic: 26 ms to 120 ms = 0.026 seconds versus 0.5 seconds csort in C
csort.java opt read: 2-3 ms = 0.002-0.003 seconds versus 3 ms = 0.003 seconds in C
csort.java with compression and duplicates support: 160 ms = 0.16 seconds versus 0.272-0.277 segons in C
The execution on those languages (Java and C) is so fast that it’s hard to measure accurately as any random execution of Ubuntu in background will increase a bit the time.

blog-carlesmateo-com-java-quicksort-3555ms

Quicksort in Java, 3555 ms = 3.555 seconds

blog-carlesmateo-com-java-csort-basic-26ms

Csort basic with Java, 26 milliseconds

blog-carlesmateo-com-java-csort-read-opt-5ms

Csort read opt with Java, 5 milliseconds

The code in Java:

/*
 * This code, except quicksort implementation, is written by Carles Mateo, 
 * and published as Open Source.
 */
package csort;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;

/**
 *
 * @author carles
 */
public class Csort {

    public static final int i_MAX_REGISTERS = 50000000;
    public static final int i_MAX_VALUE     = 1000000;
    public static final String s_FILE = "/home/carles/Desktop/codi/php/tests/carles_sort.txt";
    
    public static final int i_ERROR_SUCCESS = 0;
    public static final int i_ERROR_FAILURE = 1;
    
    public static final int i_ALGORITHM_QUICKSORT        = 1;
    public static final int i_ALGORITHM_CSORT_BASIC      = 7;
    public static final int i_ALGORITHM_CSORT_OPT1       = 8;
    public static final int i_ALGORITHM_CSORT_COMP       = 9;

    // Original values read
    public int[] st_values; 
    // Our Csorted array
    public int[] st_values_sorted;
    // To resort as a alike list array
    public int[] st_values_sorted_renum;
    
    // For quicksort
    private int[] numbers;
    private int number;    
    
    public void systemExit(String s_reason, int i_error_code) {
        printWithDatetime("Exiting to system, cause: " + s_reason);
        System.exit(i_error_code);
    }
    
    public static String getDateTime() {
        DateFormat o_dateformat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss");
        Date o_date = new Date();
        
        return o_dateformat.format(o_date);
    }
    
    public static void printWithDatetime(String s_message) {
        System.out.println(getDateTime() + " " + s_message);
    }
    
    public void printAlgorith(String s_algorithm) {
        Csort.printWithDatetime("Using algorithm: " + s_algorithm);
    }
    
    public static void showHelp() {
        
        String s_help = "" + 
            "CSort.java by Carles Mateo\n" +
            "==========================\n" +
            "\n" +
            "This demonstrates a sort algorithm faster than quicksort,\n" +
            "as proposed in the article.\n" +
            "Please read the complete article at:\n" +
            "http://blog.carlesmateo.com\n" +
            "\n" +
            "Please invoke the program with one of these parameters:\n" +
            "\n" +
            "--help or without parameters for this help message\n" +
            "1 To test using quicksort\n" +
            "  Only sorts, does not deletes duplicates\n" +
            "7 To test using csort (basic example, sorts and deletes duplicates)\n" +
            "8 To test using csort, improvement with indexes assigned when reading\n" +
            "  Use to watch the time saving from basic csort\n" +
            "\n" +
            "Please, make sure that the file carles_sort.txt is present in the\n" +
            "current directory.\n" +
            "\n";
        
        System.out.println(s_help);
        
    }
    
    public int readFile(int i_algorithm) {
        BufferedReader br = null;

        int i_value = 0;
        int i_counter = 0;        
        
        try {

                String sCurrentLine;

                br = new BufferedReader(new FileReader(Csort.s_FILE));

                while ((sCurrentLine = br.readLine()) != null) {
                    i_value = Integer.parseInt(sCurrentLine);
                    if (i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
                        this.st_values[i_value] = i_value;
                    } else {
                        this.st_values[i_counter] = i_value;
                    }
                    
                    if (i_counter % 500000 == 0) {
                        float f_percent = ((float) i_counter / Csort.i_MAX_REGISTERS) * 100;
                        String s_percent_msg = "Read register " + i_counter + " value " + i_value + " " + f_percent + "% read";
                        System.out.println(s_percent_msg);
                    }
                    i_counter++;
                }

        } catch (IOException e) {
                e.printStackTrace();
        } finally {
                try {
                    if (br != null) br.close();
                } catch (IOException ex) {
                    ex.printStackTrace();
                    this.systemExit("Error processing the file: " + Csort.s_FILE, Csort.i_ERROR_FAILURE);
                }
        }
        
        return i_counter;
    }
    
    // http://www.vogella.com/tutorials/JavaAlgorithmsQuicksort/article.html
    public void qsort(int[] values) {
        // check for empty or null array
        if (values == null || values.length == 0){
            return;
        }
        this.numbers = values;
        number = values.length;
        quicksort(0, number - 1);
    }    
    
    private void quicksort(int low, int high) {
        int i = low, j = high;
        // Get the pivot element from the middle of the list
        int pivot = numbers[low + (high-low)/2];

        // Divide into two lists
        while (i <= j) {
          // If the current value from the left list is smaller then the pivot
          // element then get the next element from the left list
          while (numbers[i] < pivot) {
            i++;
          }
          // If the current value from the right list is larger then the pivot
          // element then get the next element from the right list
          while (numbers[j] > pivot) {
            j--;
          }

          // If we have found a values in the left list which is larger then
          // the pivot element and if we have found a value in the right list
          // which is smaller then the pivot element then we exchange the
          // values.
          // As we are done we can increase i and j
          if (i <= j) {
            exchange(i, j);
            i++;
            j--;
          }
        }
        // Recursion
        if (low < j)
          quicksort(low, j);
        if (i < high)
          quicksort(i, high);
    }

    private void exchange(int i, int j) {
        int temp = numbers[i];
        numbers[i] = numbers[j];
        numbers[j] = temp;
    }
    // End of http://www.vogella.com/tutorials/JavaAlgorithmsQuicksort/article.html
  
    public void csortBasic() {
        int i_counter = 0;
        int i_value = 0;
        int i_max_range = 0;
        int i_counter_hit = 0;
        
        try {
            for (i_counter=0; i_counter < i_MAX_REGISTERS; i_counter++) {
                i_value = this.st_values[i_counter];
                this.st_values_sorted[i_value] = i_value;
            }

            // Resort for forearch
            i_max_range = i_MAX_VALUE;
            for (i_counter=0; i_counter < i_max_range; i_counter++) {
                if (this.st_values_sorted[i_counter] > -1 ) {
                    this.st_values_sorted_renum[i_counter_hit] = this.st_values_sorted[i_counter];
                    i_counter_hit++;
                }
            }
        } catch (Exception e) {
            printWithDatetime("Exception captured at i_counter: " + i_counter);
            this.systemExit("Exception, so ending", i_ERROR_FAILURE);
        }        
        
    }
    
    public void csortOptimized() {
        int i_counter = 0;
        int i_value = 0;
        int i_max_range = 0;
        int i_counter_hit = 0;
        
        try {
            // It was sorted while reading :)

            // Resort for forearch/list-style lovers
            i_max_range = i_MAX_VALUE;
            for (i_counter=0; i_counter < i_max_range; i_counter++) {
                if (this.st_values[i_counter] > -1 ) {
                    this.st_values_sorted_renum[i_counter_hit] = this.st_values[i_counter];
                    i_counter_hit++;
                }
            }
        } catch (Exception e) {
            printWithDatetime("Exception captured at i_counter: " + i_counter);
            this.systemExit("Exception, so ending", i_ERROR_FAILURE);
        }        
        
    }    
    
    public void csortCompress() {
        int i_counter = 0;
        int i_value = 0;
        
        for (i_counter=0; i_counter < i_MAX_REGISTERS; i_counter++) {
            i_value = this.st_values[i_counter];
            this.st_values_sorted[i_value] = this.st_values_sorted[i_value] + 1;
        }

    }
    
    public void printItems(int i_num_items, int[] st_array) {
        int i_counter = 0;
        int i_value   = 0;
        
        for(i_counter=0;i_counter<i_num_items;i_counter++) {
            i_value = st_array[i_counter];
            System.out.println("Item[" + i_counter + "] value: " + i_value);
        }
    }
        
    public Csort(int i_registers, int i_max_value) {
        
        // int range is -2,147,483,648 to 2,147,483,647
        this.st_values = new int[i_registers];
        // If max value is 10 then we could have values from 0 to 10, so a 11 items Array
        this.st_values_sorted = new int[i_max_value + 1];
        this.st_values_sorted_renum = new int[i_max_value + 1];
        
    }
    
    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
                
        long l_milliseconds_ini   = 0;
        long l_milliseconds_end   = 0;
        long l_milliseconds_total = 0;
        
        int i_items_read = 0;
        
        int i_algorithm = 0;
        String s_algorithm = "";
        
        int i_counter = 0;
        
        /* Check parameter consistency */
        if (args.length == 0 || args.length > 1 || (args.length == 1 && Integer.valueOf(args[0]) == 0)) {            
            showHelp();
            System.exit(i_ERROR_SUCCESS);
        }
        
        i_algorithm = Integer.valueOf(args[0]);
        
        Csort o_csort = new Csort(Csort.i_MAX_REGISTERS, Csort.i_MAX_VALUE);
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC || i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1 || i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {
            // We init empty position to -1 to differenciate from empty items and from value 0 (that is valid)
            Csort.printWithDatetime("Initializing values to -1...");
            for(i_counter=0; i_counter < Csort.i_MAX_VALUE; i_counter++) {
                o_csort.st_values_sorted[i_counter] = -1;
                o_csort.st_values_sorted_renum[i_counter] = -1;
            }            
        } else if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
            // We init empty position to -1 to differenciate from empty items and from value 0 (that is valid)
            Csort.printWithDatetime("Initializing values to 0...");
            for(i_counter=0; i_counter < Csort.i_MAX_VALUE; i_counter++) {
                o_csort.st_values_sorted[i_counter] = 0;
            }            
        } else {
            showHelp();
            System.exit(i_ERROR_SUCCESS);            
        }        
        
        Csort.printWithDatetime("Starting csort.java...");
        
        Csort.printWithDatetime("Reading file...");
        i_items_read = o_csort.readFile(i_algorithm);
        Csort.printWithDatetime("Registers read:" + i_items_read);
        
        l_milliseconds_ini = System.currentTimeMillis();
        
        if (i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {            
            s_algorithm = "Quicksort";
            o_csort.printAlgorith(s_algorithm);
            
            o_csort.qsort(o_csort.st_values);
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC) {
            s_algorithm = "Csort basic";
            o_csort.printAlgorith(s_algorithm);
            
            o_csort.csortBasic();
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
            s_algorithm = "Csort Read optimized";
            o_csort.printAlgorith(s_algorithm);
            
            o_csort.csortOptimized();            
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
            s_algorithm = "Csort compress with duplicates";
            o_csort.printAlgorith(s_algorithm);
            
            o_csort.csortCompress();
        }        
                
        l_milliseconds_end = System.currentTimeMillis();
        
        l_milliseconds_total = l_milliseconds_end - l_milliseconds_ini;
        
        Csort.printWithDatetime("Algorithm execution time: " + l_milliseconds_total + " milliseconds");
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC || i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
            o_csort.printItems(10, o_csort.st_values_sorted_renum);
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {
            o_csort.printItems(10, o_csort.numbers);
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
            o_csort.printItems(10, o_csort.st_values_sorted);
        }
                        
    }
    
}

Note: This article is growing. I’ll be expanding the article, putting more samples, detailing timmings from different universes I’ve tested and screenshots, specially in the PHP section, to add more tricks and whys, but I wanted to put the base for being able to discuss with some colleagues.

What is PrototypeC and how you can DIY

It all started with my best friend, and the best Engineer I know, E. H. We often meet for doing hackatons and we explore node, go, patterns, technologies… Some times we just meet and improvise a hackaton.

I’ve been carrying with me a laptop, almost every single day of my life, since many years ago (that’s why I created some jokes about gym and IT and have fun with my friends, like: we IT guys go to the gym to be so strong so we can carry a laptop everywhere, all the time, without effort).

But some of my friends suffer from the back, and they can’t carry weight, so they don’t bring with them the laptop unless necessary.

There are some other friends, Sysadmins in the Operations field, that sometimes have alerts, so they have to carry a laptop with them when they’re on guard, but sometimes they have an alert out of guard, and nobody else can fix it and they don’t have a computer with them.

I also I’m concerned about privacy and sadly most smartphone builders install crappy Software that spy the users, and send keystrokes to third party companies, etc… So having a portable device, small as a credit card, where you can type your business ideas or your thoughts with some privacy, or you can use to fix remote servers when they broke, etc… looked appealing. This also fits with my private messenger c-client.

So, when I configured my raspberry 2 with OSMC and started to install Software, program crons, etc… I realized that is a tiny computer, capable enough. And then an idea came to me.

What if I can make this lightweight motherboard a wearable computer?.

And I started to play with this.

OS

I first needed an interesting enough OS, that I or my DevOps friends would like to use.

I choose Debian Jessie for Arm (armhf). Raspberry Pi 2 uses ARM v7 that is a very important improvement respect previous Raspberry models, not only in performance, but also in the floating point. That simplifies many things and much more Software packages are available (like Snappy Ubuntu Core or Windows 10).

Using Debian Jessie provided me with a very basic system that really uses few RAM memory. It uses only 65 MB RAM, with all the Wifi supported firmwares.

Then I installed X-Windows, with LXDE as Desktop Manager. Everything uses only 120 MB RAM.

I installed several packages, to highlight the epiphany web browser, Open Java Runtime Environment, and PHPStorm IDE.

I also use Ubuntu 15.04 Desktop for ARM.

Battery

Unless Raspberry pi first generation, Raspberry Pi 2 uses micro-USB standard 5V input, like most of the Android phones. So I decided to do some tests about sustained energy input with batteries.

First I calculated the energy consumption of the Pi 2, plus Ethernet and plus some USB devices I would like to attach.

I found that common power banks, those you use to charge your smartphone when battery is depleted and you’re on the run, bring a continuous energy signal that works perfectly well. I tested it by running the Raspberry Pi 2 for hours while reproducing video, doing live updates, flawlessly.

Then the first version of the Prototype C was born.

Costs:

$40 Raspbery Pi 2
$10 Powerbank 2000 mAh
$15 USB Wifi-N card

It weights 160 grams/0.3527 pounds/5.64 ounces.

Some of the powerbanks are so cool that they allow you to charge them while at the same time they are providing energy to the output. So you can have the Prototype C continuously running, and if the battery is going to deplete you can just plug it to the energy plug of a bar, and the C will continue running non-stop, and the battery will charge, so in a while you can continue your walk without stopping the C.

Later I bought a Levin solar powered battery, that is lightweight, protected to impacts and that has two outputs (1, 2 A) and charges while exposed to sun, so I can have the battery in the outside of my bag, and while I walk in the city it is being charged at a max rate of 200 mA. This is ideal for people hiking in the mountain or traveling in the world, so it can save lives if a problem comes and phone’s battery is depleted.

I think also in boys in Africa or poor countries where they children have to walk a long to the school. This walking could help charging the battery so they can study at the school and continue studying at home.

With the C a solar powered battery also allow to blog from everywhere. :)

You don’t even need a cable, just an small USB-micro-USB connector

Note: This Levin solar battery doesn’t accept to be charged and to bring energy to the device at the same time.

Note: you can also aliment it from the car lighter connector with and adapter.

Monitor

That part is not easy.

The monitors I was interested in were not sold to Catalonia. Finally I was able to find some of my interest in Amazon.es

Most of those monitors work for the Raspberry Pi, first gen, or for Arduino, Banana Pi, etc… but not for Raspberry Pi 2. So this is the first thing to be careful.

Other of those monitors get the signal from the Raspberry Pi GPIO, the General Purpose Input Output, from Raspberry. So they need some kernel patching, binaries, not Open Source code usually, and that also prevents from certain kind direct access to drawing in X not being displayed in the monitor attached via GPIO.

Other monitors require external power source, often 12 V, and so are very voluminous. So they are not a fit for our means.

Normally those monitors are touch screens, some take the power from the GPIO and also control the touching from there, but others require two USB: one for the energy and other for the touch controlling part.

You have also to consider if they have external button to power on, power off.

Also those monitors have extremely low resolution, 320×480 and like this. Only found a model supporting 800×600 and is a 5″.

And finally you have to take in consideration the power consumption.

I solved this by buying power banks that have a 2 or 2.5 A output.

Simple power banks provide a 1 A output, better ones provide two outputs: 1 A output for smartphones, and 2 A output, normally for tablets.

Part of the cool thing is the possibility to have the monitor in a place, and the Pi and the battery in another place, like for creating wearables and other cool stuff.

As I wanted to provide DIY for people, and an elegant cheap lightweight solution, the difficulty to find those monitors was a gap, so I explored several ways to use cheaper screens until I realized what is what every engineer (and common people) carries with him every day?. The smartphone.

USB Wifi

As I wanted to have Wifi I needed USB Wifi that can work with Raspberry Pi 2 and Debian Jessie.

I had some problems at the beginning (caused by a bug in wcid), but I managed to make everything work. I have made it work with two different Wifi card models flawlessly: TP-Link Nano, Asus Nano USB-N10.

Nothing stops you from adding two Wifi USB to the device. In fact is my preferred option, as I use one for being connected/controlled/display to the phone, and another for connecting to the Wifi of the place I am. That way I don’t use my mobile Data plan that much when I’m outside. In this case I recommend to use different cards (different supported chipsets) to save you headaches configuring and setting what does what (unless you like to fight with configurations until you master everything :).

Some USB Wifi devices also equip Bluetooth transmission, that is very interesting for future prototypes for controlling other devices (car radio, headsets, commands..).

Controlling the C from the smartphone

I started to test certain things to overcome the display difficulties and finally I had cool an idea and I found a very nice option.

Here configuring the Vnc client to access to :1

Sharing the Internet from my iPhone creates an internal network of the type 172.20.10.x.

Sharing the Internet connection from the phone, and connecting the C automatically it creates a private network, with full network visibility between the devices.

So with a ssh app for iPhone like Server Auditor I was able to connect through ssh to the C, and with a VNC Client, like Real Vnc Client, I was able to connect to the Desktop of the Prototype C, control it, send keystrokes, zoom in, zoom out. Both apps are free to download and to use.

For Android it works the same, for sharing the Internet connection it creates an internal network with another range and there are also free apps for the purpose.

The ip assigned is normally the same, 172.20.10.9 in my case.

The only required thing was to setup the C to auto-connect via Wifi to my phone’s Wifi shared connection, and then connect to the C device’s ip.

This is very easily done with Wcid from LXDE.

Mark Automatically connect to this Network

I configured vnc server so I can access to the same display than the HDMI, so I can switch from working with the smartphone to HDMI easily, and I also configured vnc server through other display at a 1024×768 resolutions. That was more comfortable to work only with the small screen of my iPhone 4. Although setting a lower resolution, like 800×600, makes it easier to work with tiny screens.

With the Vnc client you can zoom in, zoom out, with the classical gestures and use keyboard.

Not only this, I allowed several of my friends to have their own independent sessions to the same C. So with only one C several Engineers can Develop or do DevOps stuff.

Must say that with the Nexus and other smartphones with bigger screens, or tablets, the experience is amazing, much much better than with my tiny iPhone’s 4 screen.

And of course you can control the C also from a laptop.

Nailing it

I added to the set an external bluetooth keyboard+touchpad.

Prototype C with the bluetooth keyboard. The motherboard is hidden under the battery

If you use VNC on display :0 it’s perfect, obviously external keyboard won’t work if you use different displays (as key strokes are sent only to the device and not to the different X VNC sessions, of course).

With an external keyboard, the worries about spyware in the iPhone were mitigated, as all the keystrokes are on the physical external bluetooth keyboard and not on the iPhone’s screen.

The C with a smaller battery

I used a Rii mini i8, and a Logitech.

Some photos of real use examples

Sometimes I just connect the raw motherboard to the solar charger in the bag and only use the smartphone over the table.

Sometimes I just connect the raw motherboard to the solar charger in the bag, and just use the smartphone over the table.

We were dinning and to the cinema with some IT friends when suddenly one of the SysAdmins got an alert and needed a Linux box

We were dinning and to the cinema with some IT Engineer friends when suddenly one of the SysAdmins got an alert and needed a Linux box

No space? Do you need a sand box and have no memory for a virtual one?

Opening PHPStorm at the bar. Under LXDE

blog-carlesmateo-com-web-browsing-at-the-bar

The C at the bar displaying epiphany web browser with the address http://www.prototypec.com

C over a jacket in a restaurant

Waiting time? Get C!

Mounting a USB mini webcam in a glass, while at a bar. The battery is charging the iPhone and powering the Prototype C

Honestly, I love to carry with me the raw Raspberry Pi 2 motherboard (in a cartoon box) and to connect it as is, but you can use a plastic box. It is also very useful to hang to walls, furniture, or the car.

Raspberry Pi and osmc

There is something that fascinates me from the new Raspberry Pi, and using it as a media center.
It is the fact that is a really small board.
That is powered by a micro USB 1000 mhA.
That is powered with Linux.

I had other media centers before but they were magnetic hard disk, closed in a proprietary system.
The media center I installed, with RaspBerry Pi+, is osmc, that is Open Source Media Center.

So I have full access via ssh to the RaspBerry, and as it used so few energy I have it all the day up.
Then, as it is a Linux box, and I have full access, and I’ve around 546 MB RAM free, I can run as many background process as I want.
Do I want to be a jump point for my VPN? Let’s go.
Do I want to have some monitoring processes over few websites? Let’s do it!.

I’m really happy about having a so tiny, so few energy consuming, full Linux, being my media center and whatever I want to it to do.

I must say that is wonderful having SSH and a network interface. Ok, it’s 10/100 Mbps, not Gigabit, but it is enough to allow me to copy new files in background to the USB stick via SFTP while reproducing at FullHD Blueray MKV, files right. Also allows to mount network folders via NFS or SMB amd play from them. Copying via SFTP to the USB device is generally very slow -don’t be surprised to upload at 30 KB/s- so I recommend to set a NFS folder in the computer, with read access to the ip of the Raspberry. It’s very cool and plays totally smooth using the 100 Mbit ethernet connection. You can also configure a FTP in the Pi, that will be much faster than the SFTP.

The RaspBerry micro SD card has a performance of ~22 MB, that is enough to boot very quickly and to load programs quite fast. I have other microSD cards with Debian Jessie, and I load PHPStorm (Java based PHP IDE) quite fast.

It boots really fast, in case you stop and start it frequently.

It accepts my wired Mouse and Keyboard, and also wireless bluetooth.

I’m really in love with this small motherboard. :)

This tiny RaspBerry 2, has 4 cores at 900 Mhz.

The CPU announces (cat /proc/cpuinfo):

processor    : 3
model name    : ARMv7 Processor rev 5 (v7l)
BogoMIPS    : 38.40
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
CPU implementer    : 0x41
CPU architecture: 7
CPU variant    : 0x0
CPU part    : 0xc07
CPU revision    : 5

As you see, it scores only 38.40 bogomips, compared to my tower desktop 6384.59, and my old laptop 2593.45, but it’s still beautiful.

Note: you cannot trust bogomips as a performance measurement, and in addition my computers are Intel based -so CISC architecture- while RasbBerry uses ARM processors that are RISC, that is a completely different architecture. I notice a very fluid speed, only I sense a bit slowliness in the process when I install new packages. When unpacking it feels slow, although it can perfectly be caused by the SSD card IO as well, so I installed iotop and monitorized the I/O while I was installing PHP5 :) . I got small writings up to 1,000 KB/sec, so 1 MB/s, with average of ~30-50KB writing operations, no iowait, while I was seeing with htop that the core unpacking was at 100 % of CPU, the other 3 were free, so my initial conclusion is that the bottleneck was on the CPU. Still happy about my little gadget. :)

The osmc image I installed comes with python 2.7.9 and Linux kernel 3.18.9 as uname -a shows:

Linux osmc 3.18.9-5-osmc #1 SMP PREEMPT Wed Mar 11 18:59:35 UTC 2015 armv7l GNU/Linux

It also comes with wget 1.16 and curl 7.38.0.

In fact the OSMC is based on the Debian Jessie distro.

The OSMC software also have upgrades, and Debian upgrades, that keep the Linux box up to date.

So that brings a lot of possibilities.

After a sudo apt-get update I was able to install htop, mc and apache2.

sudo apt-get install htop
sudo apt-get install iotop
sudo apt-get install iftop
sudo apt-get install mc
sudo apt-get install apache2
sudo apt-get install php5
sudo apt-get instlal ncdu

So it’s a lot of fun. :)

Note: Although a 1000mhA is enough (Raspberry Pi 2 needs around 700mhA) if you plan to plug a cheap case 2.5 hard disk without external power -just USB- it will not be enough. In this case I recommend buying a 2000mhA transformer for the Pi, or a external USB hub energy powered (2000mhA otherwise you risk energy from Raspbery + USB hub being to sufficient). If the disk has external power, then you’ll have no probem. Personaly I use USB sticks.

When I had my incubator of Start ups some years ago, one of my Start up project was embedding motherboards within screens, and offering the ability to play videos, images, even flash games and animations, and manage and update everything and update contents for a groups of players from the Internet, or based on time triggers. I was finalist for selling my product to a enormous multinational, it was close, but finally a Korean company with a cheaper (and less powerful solution) won. At that time, it was 2004, motherboards were huge comparing to this tiny piece of hardware and I had to deal with different voltage, power consumption, heat dissipation, safety, etc…. so I’m really in love with this tiny piece hardware that doesn’t need even a ventilator or a big dissipation mechanism.

A quick note about symmetric encryption and entropy

I had the discussion with some friends about symmetric encryption, and about the problem of security that it supposes that the encryption resulting of a block of data, for the same password, will be always the same. I write it here the explanations I tell them and what I do, so I can just send this link instead of explaining the same every time. :)

I am using different kinds of encryption for my new 2014’s Messenger C-Client.

The main advantage of symmetric encryption for me, is that is cheap in terms of CPU usage.

When I send files thought my messenger all the data blocks have to be encrypted, and so, if I send the Ubuntu 14.10 ISO image, this is a lot of data to encrypt, and in order to get a good throughput and performance I needed to find an outstanding optimal way to do it, that is at the same time secure.

Many encryption techniques and algorithms can be used, and all at the same time over the same packets to make them more secure. In this article I will describe only an improvement to the symmetric encryption to make it no predictable and very fast, so cheap to use.

So I introduce the concept here of noise, or entropy, in symmetric data packets.

Imagine that you want to send a chunk of data that is:

SGVsbG8sIEkgc2VlIHRoYXQgeW91IHRyeSB0aGluZ3MuIDopIENvbmdyYXRz

For clarity, only base64 are used in this sample, no binary data.

The problem is that every time that you send this chunk of data, you’ll get the same encrypted string. This is predictable and can lead to someone to reverse engineer your password.

So taking in consideration this sample Java 1.7 source code (note that variables use MT notation) that just encrypts for a given key:

import java.security.NoSuchAlgorithmException;
import java.security.SecureRandom;
import javax.xml.bind.DatatypeConverter;
import java.util.UUID;
import javax.crypto.Cipher;
import javax.crypto.IllegalBlockSizeException;
import javax.crypto.KeyGenerator;
import javax.crypto.NoSuchPaddingException;
import javax.crypto.SecretKey;

/**
 * @author Carles Mateo
 */
public abstract class Security {

    public static String encrypt(String s_key, String s_str) {
        try {
            
            /** encryption cypher */
            Cipher o_ecipher; 
          
            SecureRandom seed = SecureRandom.getInstance("SHA1PRNG");

            seed.setSeed(s_key.getBytes());

            KeyGenerator keygen = KeyGenerator.getInstance("DES");

            keygen.init(seed);

            SecretKey key = keygen.generateKey();

            o_ecipher = Cipher.getInstance("DES");
            o_ecipher.init(Cipher.ENCRYPT_MODE, key);
            
            // Encode the string into bytes using utf-8
            byte[] utf8 = s_str.getBytes("UTF8");

            // Encrypt
            byte[] enc = o_ecipher.doFinal(utf8);
            
            // Encode bytes to base64 to get a string
            String s_encrypted_encoded = encodeBase64(enc);
            
            return s_encrypted_encoded;
        } catch (javax.crypto.BadPaddingException e) {
        } catch (NoSuchAlgorithmException e) { 
        } catch (NoSuchPaddingException e) { 
        } catch (IllegalBlockSizeException e) {
        } catch (java.io.IOException e) {
        } catch (Exception e) {
            System.out.println(e.getMessage());
        }

        return null;
    }
}

Then here is the cheap way that consist in making the original data always be different, that I came with.

Instead of sending just the data, add a bit of entropy or noise, like an UUID, so our packet would be:

f47ac10b-58cc-4372-a567-0e02b2c3d479|SGVsbG8sIEkgc2VlIHRoYXQgeW91IHRyeSB0aGluZ3MuIDopIENvbmdyYXRz

If you’re reading this article you already know, but just in case, the f47ac10b-58cc-4372-a567-0e02b2c3d479 is an UUID.

The pipe | is for clarity, as the UUID has a length fix it is not really needed to manipulate the string.

Having this UUID in here is also cool, as it will allow us to do things like using this UUID as the password for the next packet. So if our messenger starts with the user’s password, from there the next packets could use this UUID as password (and the next one, the previously random generated UUID and so on…) so every packet has a new password, that must be known from the previous history in order to get all the packets description.

So if a man-in-the-middle is capturing all the data sent, hoping to being able to break the encryption in a near future, it will have to be sure to record all the packets to decode the sequence easily, otherwise will have to fight to break every single packet.

So UUID is random, but is not noise at all. It is useful entropy.

Another thing that we can do to is set a timestamp on the packet, that way, even if a man in the middle is able to clone and later send the initial packet to the Server, it will be discarded.

For example:

f47ac10b-58cc-4372-a567-0e02b2c3d479|1428002755|SGVsbG8sIEkgc2VlIHRoYXQgeW91IHRyeSB0aGluZ3MuIDopIENvbmdyYXRz

The second field is the unix timestamp, so our Server that has tracked the time or the Client’s computer and knows its rid, knows that if he gets a timestamp in the packet that has more than 2 seconds of difference, it will discard the packet and end the connection. So no one in-the-middle will be able to success cloning a login packet, and injecting it to the Server.

So having a basic encapsulation, that is cheap, for transport, the data inside can be also in addition be encrypted with asymmetric and other mechanisms that guarantee that only Client1 and Client2 can decrypt it, and not the Server, while the Server ensures packets are compliant and no noise/attacks are sent though the Server to Clients.

Stopping definitively the massive Distributed DoS attack

I explain some final information and tricks and definitive guidance so you can totally stop the DDoS attacks affecting so many sites.

After effectively mitigating the previous DDoS Torrent attack to a point that was no longer harmful, even with thousands of requests, I had another surprise.

It was the morning when I had a brutal increase of the attacks and a savage spike of traffic and requests of different nature, so in addition to the BitTorent attack that never stopped but was harmful. I was on route to the work when the alarms raised into my smartphone. I stopped in a gas station, and started to fix it from the car, in the parking area.

Service was irresponsible, so first thing I did was to close the Firewall (from the Cloud provider panel) to HTTP and HTTPS to the Front Web Servers.

Immediately after that I was able to log in via SSH to the Servers. I tried to open HTTPS for our customers and I was surprised that most of the traffic was going through HTTPS. So I closed both, I called the CEO of the company to give status, and added a rule to the Firewall so the staff on the company would be able to work against the Backoffice Servers and browse the web while I was fixing all the mess. I also instructed my crew and gave instructions to support team to help the business users and to deal with customers issues while I was stopping the attack.

I saw a lot of new attacks in the access logs.

For example, requests like coming from Android SDK. Weird.

I blocked those by modifying the index.php (code in the previous article)

// Patch urgency Carles to stop an attack based on Torrent and exploiting sdk's // http://blog.carlesmateo.com if (isset($_GET['info_hash']) || (isset($_GET['format']) && isset($_GET['sdk'])) || (isset($_GET['format']) && isset($_GET['id']))) {

(Note: If your setup allows it and the HOST requested is another (attacks to ip), you can just check requestes HOST and block what’s different, or if not also implement an Apache rule to divert all the traffic to that route (like /announce for Bittorrent attack) to another file.)

Then, with the cron blocking the ip addresses all improved.

Problem was that I was receiving thousands of requests per second, and so, adding those thousands of Ip’s to IPTABLES was not efficient, in fact was so slow, that the process was taking more than an hour. The server continued overwhelmed with thousands of new ip’s per second while it was able to manage to block few per second (and with a performance adding ip’s degraded since the ip number 5,000). It was not the right strategy to fight back this attack.

Some examples of weird traffic over the http logs:

119.63.196.31 - - [01/Feb/2015:06:54:12 +0100] "GET /video/iiiOqybRvsM/images/av
atar/0097.jpg HTTP/1.1" 404 499 "-" "Baiduspider-image+(+http://www.baidu.com/se
arch/spider.htm)\\nReferer: http://image.baidu.com/i?ct=503316480&z=0&tn=baiduim
agedetail"
119.63.196.126 - - [01/Feb/2015:07:15:48 +0100] "GET /video/Ig3ebdqswQI/images/avatar/0114.jpg HTTP/1.1" 404 499 "-" "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)\\nReferer: http://image.baidu.com/i?ct=503316480&z=0&tn=baiduimagedetail"
180.153.206.20 - - [30/Jan/2015:06:34:34 +0100] "GET /contents/all/tokusyu/skin_detail.html HTTP/1.1" 404 492 "-" "Mozilla/4.0"
180.153.206.20 - - [30/Jan/2015:06:34:35 +0100] "GET /contents/gift/ban_gift/skin_gift.js HTTP/1.1" 404 490 "http://top.dhc.co.jp/contents/gift/ban_gift/skin_gift.js" "Mozilla/4.0"
180.153.206.20 - - [30/Jan/2015:06:34:36 +0100] "GET /contents/all/content/skin.js HTTP/1.1" 404 483 "http://top.dhc.co.jp/contents/all/content/skin.js" "Mozilla/4.0
220.181.108.105 - - [30/Jan/2015:06:35:22 +0100] "GET /image/DbLiteGraphic/201405/thumb_14773360.jpg?1414567871 HTTP/1.1" 404 505 "http://image.baidu.com/i?ct=503316480&z=0&tn=baiduimagedetail" "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)"
118.249.207.5 - - [30/Jan/2015:06:39:26 +0100] "GET /widgets.js HTTP/1.1" 404 0 "http://www.javlibrary.com/cn/vl_star.php?&mode=&s=azgcg&page=2" "Mozilla/5.0 (MSIE 9.0; qdesk 2.4.1266.203; Windows NT 6.1; WOW64; Trident/7.0; rv:11.0; QQBrowser/8.0.3197.400) like Gecko"
120.197.109.101 - - [30/Jan/2015:06:57:20 +0100] "GET /ads/2015/20minutes/publishing/stval/3001.html?pos=6 HTTP/1.1" 404 542 "-" "20minv3/6 CFNetwork/609.1.4 Darwin/13.0.0"
101.226.33.221 - - [30/Jan/2015:07:00:04 +0100] "GET /?LR_PUBLISHER_ID=29877&LR_SCHEMA=vast2-vpaid&LR_PARTNERS=758858&LR_CONTENT=1&LR_AUTOPLAY=1&LR_URL=http://play.retroogames.com/penguin-hockey-lite/?pc1oibq9f5&utm_source=4207933&entId=7a6c6f60cd27700031f2802842eb2457b46e15c0 HTTP/1.1" 200 11783 "http://play.retroogames.com/penguin-hockey-lite/?pc1oibq9f5&utm_source=4207933&entId=7a6c6f60cd27700031f2802842eb2457b46e15c0" "Mozilla/4.0"
180.153.206.20 - - [30/Jan/2015:07:03:10 +0100] "GET /?metric=csync&p=3030&s=6109924323866574861 HTTP/1.1" 200 11783 "http://v.youku.com/v_show/id_XODYzNzA0NTk2.html" "Mozilla/4.0"
180.153.114.199 - - [30/Jan/2015:07:05:18 +0100] "GET /?LR_PUBLISHER_ID=78817&LR_SCHEMA=vast2-vpaid&LR_PARTNERS=763110&LR_CONTENT=1&LR_AUTOPLAY=1&LR_URL=http://www.dnd.com.pk/free-download-malala-malala-yousafzai HTTP/1.1" 200 11783 "http://www.dnd.com.pk/free-download-malala-malala-yousafzai" "Mozilla/4.0"
60.185.249.230 - - [30/Jan/2015:07:08:20 +0100] "GET /scrape.php?info_hash=8%e3Q%9a%9c%85%bc%e3%1d%14%213wNi3%28%e9%f0. HTTP/1.1" 404 459 "-" "Transmission/2.77"
180.153.236.129 - - [30/Jan/2015:07:11:10 +0100] "GET /c/lf-centennial-services-hong-kong-limited/hk041664/ HTTP/1.1" 404 545 "http://hk.kompass.com/c/lf-centennial-services-hong-kong-limited/hk041664/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0); 360Spider(compatible; HaosouSpider; http://www.haosou.com/help/help_3_2.html)"
36.106.38.13 - - [30/Jan/2015:07:18:28 +0100] "GET /op/icon?id=com.wsw.en.gm.ArrowDefense HTTP/1.1" 404 0 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
180.153.114.199 - - [30/Jan/2015:07:19:12 +0100] "GET /?metric=csync&p=3030&s=6109985239380459533 HTTP/1.1" 200 11783 "http://vox-static.liverail.com/swf/v4/admanager.swf" "Mozilla/4.0"
180.153.206.20 - - [30/Jan/2015:07:39:56 +0100] "GET /?metric=csync HTTP/1.1" 20
0 11783 "http://3294027.fls.doubleclick.net/activityi;src=3294027;type=krde;cat=
krde_060;ord=94170148950.07014?" "Mozilla/4.0"
220.181.108.139 - - [30/Jan/2015:07:47:08 +0100] "GET /asset/assetId/5994450/size/large/ts/1408882797/type/library/client/WD-KJLAK/5994450_large.jpg?token=4e703a62ec08791e2b91ec1731be0d13&category=pres&action=thumb HTTP/1.1" 404 550 "http://www.passion-nyc.com/" "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)"

And over HTTPS:

120.197.26.43 - - [29/Jan/2015:00:04:25 +0100] "GET /c/5356/cc.js?ns=_cc5356 HTTP/1.1" 404 11676 "http://m.accuweather.com/zh/cn/guangzhou/102255/current-weather/102255?p=huawei2" "HUAWEI Y325-T00_TD/V1 Linux/3.4.5 Android/2.3.6 Release/03.26.2013 Browser/AppleWebKit533.1 Mobile Safari/533.1;"
120.197.26.43 - - [29/Jan/2015:00:05:32 +0100] "GET /c/5356/cc.js?ns=_cc5356 HTTP/1.1" 404 11674 "http://m.accuweather.com/zh/cn/guangzhou/102255/weather-forecast/102255" "HUAWEI Y325-T00_TD/V1 Linux/3.4.5 Android/2.3.6 Release/03.26.2013 Browser/AppleWebKit533.1 Mobile Safari/533.1;"
117.136.1.106 - - [29/Jan/2015:06:35:05 +0100] "GET /v2.2/237613769760602?format=json&sdk=android&fields=supports_attribution%2Csupports_implicit_sdk_logging%2Cgdpv4_nux_content%2Cgdpv4_nux_enabled%2Candroid_dialog_configs HTTP/1.1" 404 6304 "-" "FBAndroidSDK.3.20.0"
117.136.1.106 - - [29/Jan/2015:06:35:11 +0100] "GET /v2.2/237613769760602?format=json&sdk=android&fields=supports_attribution%2Csupports_implicit_sdk_logging%2Cgdpv4_nux_content%2Cgdpv4_nux_enabled%2Candroid_dialog_configs HTTP/1.1" 404 6308 "-" "FBAndroidSDK.3.20.0"
117.136.1.107 - - [29/Jan/2015:06:35:36 +0100] "GET /v2.2/1493038024241481?fields=name,supports_implicit_sdk_logging,gdpv4_nux_enabled,gdpv4_nux_content,ios_dialog_configs,app_events_feature_bitmask&format=json&sdk=ios HTTP/1.1" 404 6731 "-" "FBiOSSDK.3.22.0"
180.175.55.177 - public [29/Jan/2015:07:40:23 +0100] "GET /public/checksum?a=8&t=p&v=2.5.0&c=d62c8d04f74057d51872d5dc5cdab098 HTTP/1.0" 404 39668 "https://rc-regkeytool.appspot.com/public/checksum?a=8&t=p&v=2.5.0&c=d62c8d04f74057d51872d5dc5cdab098" "Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1C28 Safari/419.3"

So, I was seeing requests that typically do applications like Facebook, smartphones with Android going to google store, and a lot of other applications and sites traffics, that was going diverted to my Ip.

So I saw that this attack was an attack associated to the ip address. I could change the ip in an extreme case and gain some hours to react.

So if those requests were coming from legitimate applications I saw two possibilities:

1. A zombie network controlled by pirates through malware/virus, that changes the /etc/hosts so those devices go to my ip address. Unlikely scenario, as there was a lot of mobile traffic and is not the easiest target of those attacks

2. A Dns Spoofing attack targeting our ip’s

That is an attack that cheats to Dns servers to make them believe that a name resolves to a different ip. (that’s why ssl certificates are so important, and the warnings that the browser raises if the server doesn’t send the right certificate, it allows you to detect that you’re connecting to a fake/evil server and avoid logging etc… if you have been directed a bad ip by a dns poisoned)

I added some traces to get more information, basically I dump the $_SERVER to a file:

$s_server_dump = var_export($_SERVER, true);

file_put_contents($s_debug_log_file, $s_server_dump.”\n”, FILE_APPEND | LOCK_EX);

array (
  'REDIRECT_SCRIPT_URL' => '/announce',
  'REDIRECT_SCRIPT_URI' => 'http://jonnyfavorite3.appspot.com/announce',
  'REDIRECT_STATUS' => '200',
  'SCRIPT_URL' => '/announce',
  'SCRIPT_URI' => 'http://jonnyfavorite3.appspot.com/announce',
  'HTTP_HOST' => 'jonnyfavorite3.appspot.com',
  'HTTP_USER_AGENT' => 'Bittorrent',
  'HTTP_ACCEPT' => '*/*',
  'HTTP_CONNECTION' => 'closed',
  'PATH' => '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin',
  'SERVER_SIGNATURE' => '<address>Apache/2.4.7 (Ubuntu) Server at jonnyfavorite3.appspot.com Port 80</address>',
  'SERVER_SOFTWARE' => 'Apache/2.4.7 (Ubuntu)',
  'SERVER_NAME' => 'jonnyfavorite3.appspot.com',
  'SERVER_ADDR' => 'deleted',
  'SERVER_PORT' => '80',
  'REMOTE_ADDR' => '60.219.115.77',
  'DOCUMENT_ROOT' => 'deleted',
  'REQUEST_SCHEME' => 'http',
  'CONTEXT_PREFIX' => '',
  'CONTEXT_DOCUMENT_ROOT' => 'deleted',
  'SERVER_ADMIN' => 'deleted',
  'SCRIPT_FILENAME' => 'deleted',
  'REMOTE_PORT' => '27361',
  'REDIRECT_QUERY_STRING' => 'info_hash=%26%27%0F%C2%EB%03J%E2%1F%B0%28%2B%29d%7C%8C%FE%C8l%E9&peer_id=%2DSD0100%2Dj%7E%C3%1C%14%FFsj%DA9%0B%27&ip=10.155.31.47&port=9978&uploaded=2244978664&downloaded=2244978664&left=2570039513&numwant=200&key=511&compact=1',
  'REDIRECT_URL' => '/announce',
  'GATEWAY_INTERFACE' => 'CGI/1.1',
  'SERVER_PROTOCOL' => 'HTTP/1.0',
  'REQUEST_METHOD' => 'GET',
  'QUERY_STRING' => 'info_hash=%26%27%0F%C2%EB%03J%E2%1F%B0%28%2B%29d%7C%8C%FE%C8l%E9&peer_id=%2DSD0100%2Dj%7E%C3%1C%14%FFsj%DA9%0B%27&ip=10.155.31.47&port=9978&uploaded=2244978664&downloaded=2244978664&left=2570039513&numwant=200&key=511&compact=1',
  'REQUEST_URI' => '/announce?info_hash=%26%27%0F%C2%EB%03J%E2%1F%B0%28%2B%29d%7C%8C%FE%C8l%E9&peer_id=%2DSD0100%2Dj%7E%C3%1C%14%FFsj%DA9%0B%27&ip=10.155.31.47&port=9978&uploaded=2244978664&downloaded=2244978664&left=2570039513&numwant=200&key=511&compact=1',
  'SCRIPT_NAME' => '/index.php',
  'PHP_SELF' => '/index.php',
  'REQUEST_TIME_FLOAT' => 1422528394.036,
  'REQUEST_TIME' => 1422528394,
)

Some fields were replaced with ‘deleted’ for security.

Ok, so the client was sending a HOST header

jonnyfavorite3.appspot.com

This address resolves as:

ping jonnyfavorite3.appspot.com
PING appspot.l.google.com (74.125.24.141) 56(84) bytes of data.
64 bytes from de-in-f141.1e100.net (74.125.24.141): icmp_seq=1 ttl=52 time=1.53 ms

So to google appspot.

Next one:

array (
  'REDIRECT_SCRIPT_URL' => '/v2.2/1493038024241481',
  'REDIRECT_SCRIPT_URI' => 'https://graph.facebook.com/v2.2/1493038024241481',
  'REDIRECT_HTTPS' => 'on',
  'REDIRECT_SSL_TLS_SNI' => 'graph.facebook.com',
  'REDIRECT_STATUS' => '200',
  'SCRIPT_URL' => '/v2.2/1493038024241481',
  'SCRIPT_URI' => 'https://graph.facebook.com/v2.2/1493038024241481',
  'HTTPS' => 'on',
  'SSL_TLS_SNI' => 'graph.facebook.com',
  'HTTP_HOST' => 'graph.facebook.com',
  'HTTP_ACCEPT_ENCODING' => 'gzip, deflate',
  'CONTENT_TYPE' => 'multipart/form-data; boundary=3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f',
  'HTTP_ACCEPT_LANGUAGE' => 'zh-cn',
  'HTTP_CONNECTION' => 'keep-alive',
  'HTTP_ACCEPT' => '*/*',
  'HTTP_USER_AGENT' => 'FBiOSSDK.3.22.0',
  'PATH' => '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin',
  'SERVER_SIGNATURE' => '<address>Apache/2.4.7 (Ubuntu) Server at graph.facebook.com Port 443</address>',
  'SERVER_SOFTWARE' => 'Apache/2.4.7 (Ubuntu)',
  'SERVER_NAME' => 'graph.facebook.com',
  'SERVER_ADDR' => 'deleted',
  'SERVER_PORT' => '443',
  'REMOTE_ADDR' => '223.100.159.96',
  'DOCUMENT_ROOT' => 'deletd',
  'REQUEST_SCHEME' => 'https',
  'CONTEXT_PREFIX' => '',
  'CONTEXT_DOCUMENT_ROOT' => 'deleted',
  'SERVER_ADMIN' => 'deleted',
  'SCRIPT_FILENAME' => 'deleted',
  'REMOTE_PORT' => '24797',
  'REDIRECT_QUERY_STRING' => 'fields=name,supports_implicit_sdk_logging,gdpv4_nux_enabled,gdpv4_nux_content,ios_dialog_configs,app_events_feature_bitmask&format=json&sdk=ios',
  'REDIRECT_URL' => '/v2.2/1493038024241481',
  'GATEWAY_INTERFACE' => 'CGI/1.1',
  'SERVER_PROTOCOL' => 'HTTP/1.1',
  'REQUEST_METHOD' => 'GET',
  'QUERY_STRING' => 'fields=name,supports_implicit_sdk_logging,gdpv4_nux_enabled,gdpv4_nux_content,ios_dialog_configs,app_events_feature_bitmask&format=json&sdk=ios',
  'REQUEST_URI' => '/v2.2/1493038024241481?fields=name,supports_implicit_sdk_logging,gdpv4_nux_enabled,gdpv4_nux_content,ios_dialog_configs,app_events_feature_bitmask&format=json&sdk=ios',
  'SCRIPT_NAME' => '/index.php',
  'PHP_SELF' => '/index.php',
  'REQUEST_TIME_FLOAT' => 1422528399.4159999,
  'REQUEST_TIME' => 1422528399,
)

So that was it, connections were arriving to the Server/Load Balancer requesting addresses from Google, Facebook, thepiratebay main tracker, etc… tens of thousands of users simultaneously.

Clearly there was a kind of DNS spoofing attack. And looking at the ACCEPT_LANGUAGE I saw that clients were using Chineses language (zh-cn).

Changing the ip address will take some time, as some dns have 1 hour or several hours of TTL to improve page speed of customers, and both ip’s have to coexist for a while until all the client’s dns have refreshed, browsers have refreshed their cache of dns names, and dns of partners consuming our APIs have refreshed.

You can have a virtual host just blank, but for some projects your virtual hosts needs to listen for all the request, for example, imagine that you serve contents like barcelona.yourdomain.com and for newyork.yourdomain.com and whatever-changing.yourdomain.com like carlesmateo.yourdomaincom, anotheruser.yourdomain.com, anyuserwilhaveownurl.yourdomain.com, that your code process to get the information and render the pages, so it is not always possible to have a default one.

First thing I had to do was, as the attack came from the ip, was to do not allow any host requests that did not belong to the hosts served. That way my framework will not process any request that was not for the exact domains that I expect, and I will save the 404 process time.

Also, if the servers return a white page with few bytes (header), this will harm less the bandwidth if under a DDoS. The bandwidth available is one of the weak points when suffering a DDoS.

To block hosts that are not expected there are several things that can be done, for example:

Nginx can easily block those requests not matching the hosts.
Modify my script to look at
```
HTTP_HOST
```
So:
if (isset($_SERVER['HTTP_HOST']) && ($_SERVER['HTTP_HOST'] != 'www.mydomain.com' && $_SERVER['HTTP_HOST'] != 'mydomain.com') { exit(); }
That way any request that is not for the exact domain will be halted quickly.
You can also check instead strpost($_SERVER['HTTP_HOST'], ‘mydomain.com’) for variable HTTP_HOST
Create an Apache default virtual host and default ssl. With a white page the load was so small that even tens of thousands ip’s cannot affect enough the server. (Max load was around 2.5%)

Optimization: If you want to save IOPS and network bandwidth to the storage you can create a ramdisk, and have the index.html of the default virtual host in there. This will increase server speed and reduce overhead.

If you’re in a hurry to stop the attack you can also:

Change the Ip, get a new ip address, and change the dns to the new ip, and hope this one is not attacked
An emergency solution, drastic but effective, would be to block entire China’s ip ranges.

Some of the webs I helped reported having attacks from outside China, but nothing related, the gross attacks, 99%, come from China.

Sadly Amazon’s Cloud Firewall does not allow to deny certain ip addresses, only allow services through a default deny policy.

I reopened the firewall in 80 and 443, so traffic to everyone, checked that the Service was Ok and continued my way to the office.

When later I spoke with some friends, one of them told me about the Chinese government firewall causing this sort of problems worldwide!. Look at this interesting article.

http://blog.sucuri.net/2015/01/ddos-from-china-facebook-wordpress-and-twitter-users-receiving-sucuri-error-pages.html

It is not clear if those DDoS attacks are caused by the Chinese Firewall, by a bug, or by pirates exploiting it to attack sites for money. But is clear that all those attacks and traffic comes from China ip addresses. Obviously this DNS issues are causing the guys browsing on China to not being able of browsing Facebook, WordPres, twitter, etc…

Here you can see live the origin of the world DDoS attacks:

http://www.digitalattackmap.com/#anim=1&color=0&country=ALL&list=0&time=16465&view=map

And China being the source of most of the DDoS attacks.

Stopping a BitTorrent DDoS attack

After all the success about the article stopping an XMLRPC to WordPress site attack and thanks messages (I actually helped a company that was being thrown down every day and asked me for help) it’s the moment to explain how to stop an attack much more heavily in evilness.

The first sign I saw was that the server was more and more slower, what is nearly impossible as I setup a very good server, and it has a lot of good development techniques to not having bottlenecks.

I looked at the server and I saw like 3,000 SYN_SENT packets. Apparently we were under a SYN Flood attack.

netstat revealed more than 6k different ip addresses connecting to the Server.

Server had only 30 GB of RAM so, and started to be full, with more and more connections, and so more Apache processes to respond to the real users fast it was clear that it was going to struggle.

I improved the configuration of the Apache so the Server would be able to handle much more connections with less memory consumption and overhead, added some enhancements for blocking SYN Flood attacks, and restarted the Apache Server.

I reduced greatly the scope of the attacks but I knew that it would only be being worst. I was buying time while not disrupting the functioning of the website.

The next hours the attacks increased to having around 7,500 concurrent connections simultaneously. The memory was reaching its limits, so I decided it was time to upgrade the instance. I doubled the memory and added much more cores, to 36, by using one of the newest Amazon c4.8xlarge.

The good thing about Cloud is that you pay for the time you use the resources. So when the waters calm down again, I’m able to reduce the size of the instance and save some hundreds to the company.

I knew it was a matter of time. The server was stabilized at using 40 GB out of the 60 GB but I knew the pirates will keep trying to shutdown the service.

Once the SYN Flood was stopped and I was sure that the service was safe for a while, I was checking the logs to see if I can detect a pattern among the attacks. I did.

Most request that we were receiving where to a file called announce.php that obviously does not exist in the server, and so it was returning 404 error.

The user agent reported in many cases BitTorrent, or Torrent compatible product, and the url sending a hash, uploaded, downloaded, left… so I realized that somehow my Server was targeted by a Torrent attack, where they indicated that the Server was a Torrent tracker.

As the .htacess in frameworks like Laravel, Catalonia Framework… and CMS like WordPress, Joomla, ezpublish… try to read the file from filesystem and if it doesn’t exist index.php is served, then as first action I created a file /announce.php that simple did an exit();

Sample .htaccess from Laravel:

<IfModule mod_rewrite.c>
    <IfModule mod_negotiation.c>
        Options -MultiViews
    </IfModule>

    RewriteEngine On

    # Redirect Trailing Slashes...
    RewriteRule ^(.*)/$ /$1 [L,R=301]

    # Handle Front Controller...
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteRule ^ index.php [L]
</IfModule>

Sample code for announce.php would be like:

<?php
/**
 * Creator: Carles Mateo
 * Date: 2015-01-21 Time: 09:39
 */

// A cheap way to stop an attack based on requesting this file
http_response_code(406);
exit();

The response_code 406 was an attemp to see if the BitTorrent clients were sensible to headers and stop. But they didn’t.

With with simple addition of announce.php , with exit(), I achieved reducing the load on the Server from 90% to 40% in just one second.

The reason why a not found page was causing so many damage was that as the 404 error page from the Server is personalized, and offers alternative results (assuming the product you was looking for is no longer available), and before displaying all the Framework is loaded and the routes are checked to see if the url fits and so has some process to be done in the PHP side (it takes 100 ms to reply, is not much, but it was not necessary to waste so much CPU), even being very optimized, every single not found url was causing certain process and CPU waste. Since the attack had more than 7,000 different ip’s simultaneously coming to the Server it would be somewhat a problem at certain point and start returning 500 errors to the customers.

The logs were also showing other patterns, for example:

announce?info_hash…

So without the PHP extension. Those kind of requests would not go through my wall file announce.php but though index.php (as .htaccess tells what is not found is directed there).

I could change the .htaccess to send those requests to hell, but I wanted a more definitive solution, something that would prevent the Server from wasting CPU and the Servers to being able to resist an attack x1000 times harder.

At the end the common pattern was that the BitTorrent clients were requesting via GET a parameter called info_hash, so I blocked through there all the request.

I wrote this small program, and added it to index.php

// Patch urgency Carles to stop an attack based on Torrent
// http://blog.carlesmateo.com
if (isset($_GET['info_hash'])) {

    // In case you use CDN, proxy, or load balancer
    $s_ip_proxy = '';

    $s_ip_address = $_SERVER['REMOTE_ADDR'];


    // Warning if you use a CDN, a proxy server or a load balancer do not add the ip to the blacklisted
    if ($s_ip_proxy == '' || ($s_ip_proxy != '' && $s_ip_address != $s_ip_proxy)) {
        $s_date = date('Y-m-d');

        $s_ip_log_file = '/tmp/ip-to-blacklist-'.$s_date.'.log';
        file_put_contents($s_ip_log_file, $s_ip_address."\n", FILE_APPEND | LOCK_EX);
    }


    // 406 means 'Not Acceptable'
    http_response_code(406);
    exit();
}

Please note, this code can be added to any Software like Zend Framework, Symfony, Catalonia Framework, Joomla, WordPress, Drupal, ezpublish, Magento… just add those lines at the beginning of the public/index.php just before the action of the Framework starts. Only be careful that after a core update, you’ll have to reapply it.

After that I deleted the no-longer-needed announced.php

What the program does is, if you don’t have defined a proxy/CDN ip, to write the ip connecting with the Torrent request pattern to a log file called for example:

/tmp/ip-to-blacklist-2015-01-23.log

And also exit(), so stopping the execution and saving many CPU cycles.

The idea of the final date is to blacklist the ip’s only for 24 hours as we later will see.

With this I achieved reducing the CPU consumption to around 5-15% of CPU.

Then, there is the other part of stopping the attack, that is a bash program, that can be run from command line or added to cron to be launched, depending on the volume of the attacks, every 5 minutes, or every hour.

ip_blacklist.sh

#!/bin/bash
# Ip blacklister by Carles Mateo
s_DATE=$(date +%Y-%m-%d)
s_FILE=/tmp/ip-to-blacklist-$s_DATE.log
s_FILE_UNIQUE=/tmp/ip-to-blacklist-$s_DATE-unique.log
cat $s_FILE | sort | uniq > $s_FILE_UNIQUE

echo "Counting the ip addresses to block in $s_FILE_UNIQUE"
cat $s_FILE_UNIQUE | wc -l

sleep 3
# We clear the iptables rules
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT

# To list the rules sudo iptables -L
# /sbin/iptables -L INPUT -v -n
# Enable ssh for all (you can add a Firewall at Cloud provider level or enstrict the rule to your ip)
sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT

for s_ip_address in `cat $s_FILE_UNIQUE`
do
    echo "Blocking traffic from $s_ip_address"
    sudo iptables -A INPUT -s $s_ip_address -p tcp --destination-port 80 -j DROP
    sudo iptables -A INPUT -s $s_ip_address -p tcp --destination-port 443 -j DROP
done

# Ensure Accept traffic on Port 80 (HTTP) and 443 (HTTPS)
sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# To block the rest
# sudo iptables -A INPUT -j DROP

# User iptables -save and iptables -restore to make this changes permanent
# sudo sh -c "iptables-save > /etc/iptables.rules"
# sudo pre-up iptables-restore < /etc/iptables.rules
# https://help.ubuntu.com/community/IptablesHowTo

This scripts gets the list of ip’s addresses, gets the list of unique ip’s into another file, and then makes a loop and adds all of them to the iptables, the Firewall from Linux, and blocks them for accessing the web at port 80 (http) or 443 (https, ssl). You can block all the ports also if you want for those ip’s.

With this CPU use went to 0%.

Note: One of my colleagues, a wonderful SysAdmin at Ackstorm ISP, points that some of you may prefer using REJECT instead of DROP. An interesting conversation on serverfault about this.

After fixing the problem I looked over the Internet to locate any people reporting attacks like what I suffered. The most interesting I found was this article: BotTorrent: Misusing BitTorrent to Launch DDoS Attacks, from University of California, Irvine. (local copy on this website BotTorrent)

Basically any site on the Internet can be attacked at a large scale, as every user downloading Torrent will try to connect to the innocent Server to inform of the progress of the down/upload. If this attack is performed with hundreds of files, the attack means hundreds of thousands of ip’s connecting to the Server… the server will run out of connections, or memory, or bandwidth will be full from the bad traffic.

I saw that the attackers were using porno files that were highly downloaded and apparently telling the Torrent network that our Server was a Torrent tracker, so corroborating my hypothesis all the people downloading Torrents were sending updates to our Server, believing that our Server was a tracker. A trick from the sad pirates.

Some people, business users, asked me who could be interested in injuring other’s servers or disrupting other’s businesses without any immediate gain (like controlling your Servers to send Spam).

I told:

Competitors that hate you because you’re successful and want to disrupt your business (they pay to the pirates for doing attacks. I’ve helped companies that were let down by those pirates)
Investors that may want to buy you at a cheaper price (after badly trolling you for a week or two)
False “security” companies that will offer their services “casually” when you most need them and charge a high bill
Pirates that want to extort you

So bad people that instead that using their talent to create, just destroy and act bad being evil to others.

In other cases could be bad luck to have been assigned an Ip that previously had a Torrent tracker, it has not much sense for the Cloud as it is expensive, but it has that a Server with that ip was hacked and used as tracked for a while.

Also governments could be so wanting to disrupt services (like torrent) by clumsy redirecting dns to random ip’s, or entertainment companies trying to shutdown Torrent trackers could try to poison dns to stop users from using Bittorrent.

See the definitive solution in the next article.

Performance of several languages

Notes on 2017-03-26 18:57 CEST – Unix time: 1490547518 :

As several of you have noted, it would be much better to use a random value, for example, read by disk. This will be an improvement done in the next benchmark. Good suggestion thanks.
Due to my lack of time it took more than expected updating the article. I was in a long process with google, and now I’m looking for a new job.
I note that most of people doesn’t read the article and comment about things that are well indicated on it. Please before posting, read, otherwise don’t be surprise if the comment is not published. I’ve to keep the blog clean of trash.
I’ve left out few comments cause there were disrespectful. Mediocrity is present in the society, so simply avoid publishing comments that lack the basis of respect and good education. If a comment brings a point, under the point of view of Engineering, it is always published.

Thanks.

(This article was last updated on 2015-08-26 15:45 CEST – Unix time: 1440596711. See changelog at bottom)

One may think that Assembler is always the fastest, but is that true?.

If I write a code in Assembler in 32 bit instead of 64 bit, so it can run in 32 and 64 bit, will it be faster than the code that a dynamic compiler is optimizing in execution time to benefit from the architecture of my computer?.

What if a future JIT compiler is able to use all the cores to execute a single thread developed program?.

Are PHP, Python, or Ruby fast comparing to C++?. Does Facebook Hip Hop Virtual machine really speeds PHP execution?.

This article shows some results and shares my conclusions. It is as a base to discuss with my colleagues. Is not an end, we are always doing tests, looking for the edge, and looking at the root of the things in detail. And often things change from one version to the other. This article shows not an absolute truth, but brings some light into interesting aspects.

It could show the performance for the certain case used in the test, although generic core instructions have been selected. Many more tests are necessary, and some functions differ in the performance. But this article is a necessary starting for the discussion with my IT-extreme-lover friends and a necessary step for the next upcoming tests.

It brings very important data for Managers and Decision Makers, as choosing the adequate performance language can save millions in hardware (specially when you use the Cloud and pay per hour of use) or thousand hours in Map Reduce processes.

Acknowledgements and thanks

Credit for the great Eduard Heredia, for porting my C source code to:

Go
Ruby
Node.js

And for the nice discussions of the results, an on the optimizations and dynamic vs static compilers.

Thanks to Juan Carlos Moreno, CTO of ECManaged Cloud Software for suggesting adding Python and Ruby to the languages tested when we discussed my initial results.

Thanks to Joel Molins for the interesting discussions on Java performance and garbage collection.

Thanks to Cliff Click for his wonderful article on Java vs C performance that I found when I wanted to confirm some of my results and findings.

I was inspired to do my own comparisons by the benchmarks comparing different framework by techempower. It is amazing to see the results of the tests, like how C++ can serialize JSon 1,057,793 times per second and raw PHP only 180,147 (17%).

For the impatients

I present the results of the tests, and the conclusions, for those that doesn’t want to read about the details. For those that want to examine the code, and the versions of every compiler, and more in deep conclusions, this information is provided below.

Results

This image shows the results of the tests with every language and compiler.

All the tests are invoked from command line. All the tests use only one core. No tests for the web or frameworks have been made, are another scenarios worth an own article.

More seconds means a worst result. The worst is Bash, that I deleted from the graphics, as the bar was crazily high comparing to others.

* As later is discussed my initial Assembler code was outperformed by C binary because the final Assembler code that the compiler generated was better than mine.

After knowing why (later in this article is explained in detail) I could have reduced it to the same time than the C version as I understood the improvements made by the compiler.

Table of times:

Seconds executing	Language	Compiler used	Version
6 s.	Java	Oracle Java	Java JDK 8
6 s.	Java	Oracle Java	Java JDK 7
6 s.	Java	Open JDK	OpenJDK 7
6 s.	Java	Open JDK	OpenJDK 6
7 s.	Go	Go	Go v.1.3.1 linux/amd64
7 s.	Go	Go	Go v.1.3.3 linux/amd64
8 s.	Lua	LuaJit	Luajit 2.0.2
10 s.	C++	g++	g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
10 s.	C	gcc	gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
10 s. (* first version was 13 s. and then was optimized)	Assembler	nasm	NASM version 2.10.09 compiled on Dec 29 2013
10 s.	Nodejs	nodejs	Nodejs v0.12.4
14 s.	Nodejs	nodejs	Nodejs v0.10.25
18 s.	Go	Go	go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64
20 s.	Phantomjs	Phantomjs	phantomjs 1.9.0
21 s.	Phantomjs	Phantomjs	phantomjs 2.0.1-development
38 s.	PHP	Facebook HHVM	HipHop VM 3.4.0-dev (rel)
44 s.	Python	Pypy	Pypy 2.2.1 (Python 2.7.3 (2.2.1+dfsg-1, Nov 28 2013, 05:13:10))
52 s.	PHP	Facebook HHVM	HipHop VM 3.9.0-dev (rel)
52 s.	PHP	Facebook HHVM	HipHop VM 3.7.3 (rel)
128 s.	PHP	PHP	PHP 7.0.0alpha2 (cli) (built: Jul 3 2015 15:30:23)
278 s.	Lua	Lua	Lua 2.5.3
294 s.	Gambas3	Gambas3	3.7.0
316 s.	PHP	PHP	PHP 5.5.9-1ubuntu4.3 (cli) (built: Jul 7 2014 16:36:58)
317 s.	PHP	PHP	PHP 5.6.10 (cli) (built: Jul 3 2015 16:13:11)
323 s.	PHP	PHP	PHP 5.4.42 (cli) (built: Jul 3 2015 16:24:16)
436 s.	Perl	Perl	Perl 5.18.2
523 s.	Ruby	Ruby	ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
694 s.	Python	Python	Python 2.7.6
807 s.	Python	Python	Python 3.4.0
47630 s.	Bash		GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)

Conclusions and Lessons Learnt

There are languages that will execute faster than a native Assembler program, thanks to the JIT Compiler and to the ability to optimize the program at runtime for the architecture of the computer running the program (even if there is a small initial penalty of around two seconds from JIT when running the program, as it is being analysed, is it more than worth in our example)
Modern Java can be really fast in certain operations, it is the fastest in this test, thanks to the use of JIT Compiler technology and a very good implementation in it
Oracle’s Java and OpenJDK shows no difference in performance in this test
Script languages really sucks in performance. Python, Perl and Ruby are terribly slow. That costs a lot of money if you Scale as you need more Server in the Cloud
JIT compilers for Python: Pypy, and for Lua: LuaJit, make them really fly. The difference is truly amazing
The same language can offer a very different performance using one version or another, for example the go that comes from Ubuntu packets and the last version from official page that is faster, or Python 3.4 is much slower than Python 2.7 in this test
Bash is the worst language for doing the loop and inc operations in the test, lasting for more than 13 hours for the test
From command line PHP is much faster than Python, Perl and Ruby
Facebook Hip Hop Virtual Machine (HHVM) improves a lot PHP’s speed
It looks like the future of compilers is JIT.
Assembler is not always the fastest when executed. If you write a generic Assembler program with the purpose of being able to run in many platforms you’ll not use the most powerful instructions specific of an architecture, and so a JIT compiler can outperform your code. An static compiler can also outperform your code with very clever optimizations. People that write the compilers are really good. Unless you’re really brilliant with Assembler probably a C/C++ code beats the performance of your code. Even if you’re fantastic with Assembler it could happen that a JIT compiler notices that some executions can be avoided (like code not really used) and bring magnificent runtime optimizations. (for example a near JMP is much more less costly than a far JMP Assembler instruction. Avoiding dead code could result in a far JMP being executed as near JMP, saving many cycles per loop)
Optimizations really needs people dedicated to just optimizations and checking the speed of the newly added code for the running platforms
Node.js was a big surprise. It really performed well. It is promising. New version performs even faster
go is promising. Similar to C, but performance is much better thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)
Gambas 3 performed surprisingly fast. Better than PHP
You should be careful when using C/C++ optimization -O3 (and -O2) as sometimes it doesn’t work well (bugs) or as you may expect, for example by completely removing blocks of code if the compiler believes that has no utility (like loops)
Perl performance really change from using a for style or another. (See Perl section below)
Modern CPUs change the frequency to save energy. To run the tests is strictly recommended to use a dedicated machine, disabling the CPU governor and setting a frequency for all the cores, booting with a text only live system, without background services, not mounting disks, no swap, no network

(Please, before commenting read completely the article )

Explanations in details

Obviously an statically compiled language binary should be faster than an interpreted language.

C or C++ are much faster than PHP. And good code machine is much faster of course.

But there are also other languages that are not compiled as binary and have really fast execution.

For example, good Web Java Application Servers generate compiled code after the first request. Then it really flies.

For web C# or .NET in general, does the same, the IIS Application Server creates a native DLL after the first call to the script. And after this, as is compiled, the page is really fast.

With C statically linked you could generate binary code for a particular processor, but then it won’t work in other processors, so normally we write code that will work in all the processors at the cost of not using all the performance of the different CPUs or use another approach and we provide a set of different binaries for the different architectures. A set of directives doing one thing or other depending on the platform detected can also be done, but is hard, long and tedious job with a lot of special cases treatment. There is another approach that is dynamic linking, where certain things will be decided at run time and optimized for the computer that is running the program by the JIT (Just-in-time) Compiler.

Java, with JIT is able to offer optimizations for the CPU that is running the code with awesome results. And it is able to optimize loops and mathematics operations and outperform C/C++ and Assembler code in some cases (like in our tests) or to be really near in others. It sounds crazy but nowadays the JIT is able to know the result of several times executed blocks of code and to optimize that with several strategies, speeding the things incredible and to outperform a code written in Assembler. Demonstrations with code is provided later.

A new generation has grown knowing only how to program for the Web. Many of them never saw Assembler, neither or barely programmed in C++.

None of my Senior friends would assert that a technology is better than another without doing many investigations before. We are serious. There is so much to take in count, so much to learn always, that one has to be sure that is not missing things before affirming such things categorically. If you want to be taken seriously, you have to take many things in count.

Environment for the tests

Hardware and OS

Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz with 32 GB RAM and SSD Disk.

Ubuntu Desktop 14.04 LTS 64 bit

Software base and compilers

PHP versions

Shipped with my Ubuntu distribution:

php -v

PHP 5.5.9-1ubuntu4.3 (cli) (built: Jul  7 2014 16:36:58)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies

Compiled from sources:

PHP 5.6.10 (cli) (built: Jul  3 2015 16:13:11)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies

PHP 5.4.42 (cli) (built: Jul  3 2015 16:24:16)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2014 Zend Technologies

Java 8 version

java -showversion

java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

C++ version

g++ -v

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.2-19ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

Gambas 3

gbr3 --version

3.7.0

Go (downloaded from google)

go version

go version go1.3.1 linux/amd64

Go (Ubuntu packages)

go version

go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64

Nasm

nasm -v

NASM version 2.10.09 compiled on Dec 29 2013

Lua

lua -v

Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio

Luajit

luajit -v

LuaJIT 2.0.2 -- Copyright (C) 2005-2013 Mike Pall. http://luajit.org/

Nodejs

Installed with apt-get install nodejs:

nodejs --version

v0.10.25

Installed by compiling the sources:

node --version

v0.12.4

Phantomjs

Installed with apt-get install phantomjs:

phantomjs --version

1.9.0

Compiled from sources:

/path/phantomjs --version

2.0.1-development

Python 2.7

python --version

Python 2.7.6

Python 3

python3 --version

Python 3.4.0

Perl

perl -version

This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
(with 41 registered patches, see perl -V for more detail)

Bash

bash --version

GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Test: Time required for nested loops

This is the first sample. It is an easy-one.

The main idea is to generate a set of nested loops, with a simple counter inside.

When the counter reaches 51 it is set to 0.

This is done for:

Preventing overflow of the integer if growing without control
Preventing the compiler from optimizing the code (clever compilers like Java or gcc with -O3 flag for optimization, if it sees that the var is never used, it will see that the whole block is unnecessary and simply never execute it)

Doing only loops, the increment of a variable and an if, provides us with basic structures of the language that are easily transformed to Assembler. We want to avoid System calls also.

This is the base for the metrics on my Cloud Analysis of Performance cmips.net project.

Here I present the times for each language, later I analyze the details and the code.

Take in count that this code only executes in one thread / core.

C++

C++ result, it takes 10 seconds.

Code for the C++:

/*
* File:   main.cpp
* Author: Carles Mateo
*
* Created on August 27, 2014, 1:53 PM
*/

#include <cstdlib>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <ctime>

using namespace std;

typedef unsigned long long timestamp_t;

static timestamp_t get_timestamp()
{
    struct timeval now;
    gettimeofday (&now, NULL);
    return  now.tv_usec + (timestamp_t)now.tv_sec * 1000000;
}

int main(int argc, char** argv) {

    timestamp_t t0 = get_timestamp();

    // current date/time based on current system
    time_t now = time(0);

    // convert now to string form
    char* dt_now = ctime(&now);

    printf("Starting at %s\n", dt_now);

    int i_loop1 = 0;
    int i_loop2 = 0;
    int i_loop3 = 0;

    

    for (i_loop1 = 0; i_loop1 < 10; i_loop1++) {
        for (i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
            for (i_loop3 = 0; i_loop3 < 32000; i_loop3++) {
                i_counter++;

                if (i_counter > 50) {
                    i_counter = 0;
                }
            }
            // If you want to test how the compiler optimizes that, remove the comment
            //i_counter = 0;
         }
     }

    // This is another trick to avoid compiler's optimization. To use the var somewhere
    printf("Counter: %i\n", i_counter);

    timestamp_t t1 = get_timestamp();
    double secs = (t1 - t0) / 1000000.0L;
    time_t now_end = time(0);

    // convert now to string form
    char* dt_now_end = ctime(&now_end);

    printf("End time: %s\n", dt_now_end);

    return 0;
}

You can try to remove the part of code that makes the checks:

                /* if (i_counter > 50) {
                    i_counter = 0;
                }*/

And the use of the var, later:

    //printf("Counter: %i\n", i_counter);

Note: And adding a i_counter = 0; at the beginning of the loop to make sure that the counter doesn’t overflows. Then the C or C++ compiler will notice that this result is never used and so it will eliminate the code from the program, having as result and execution time of 0.0 seconds.

Java

The code in Java:

package cpu;

/**
 *
 * @author carles.mateo
 */
public class Cpu {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        
        int i_loop1 = 0;
        //int i_loop_main = 0;
        int i_loop2 = 0;
        int i_loop3 = 0;
        int i_counter = 0;
        
        String s_version = System.getProperty("java.version");
        
        System.out.println("Java Version: " + s_version);

        System.out.println("Starting cpu.java...");
        
        for (i_loop1 = 0; i_loop1 < 10; i_loop1++) {            
                for (i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
                    for (i_loop3 = 0; i_loop3 < 32000; i_loop3++) {
                        i_counter++;
                        
                        if (i_counter > 50) { 
                            i_counter = 0;
                        }
                    }
                }
        }
        
        System.out.println(i_counter);
        System.out.println("End");
    }
    
}

It is really interesting how Java, with JIT outperforms C++ and Assembler.

It takes only 6 seconds.

Go

The case of Go is interesting because I saw a big difference from the go shipped with Ubuntu, and the the go I downloaded from http://golang.org/dl/. I downloaded 1.3.1 and 1.3.3 offering the same performance. 7 seconds.

Source code for nested_loops.go

package main

import ("fmt"
        "time")

func main() {
   fmt.Printf("Starting: %s", time.Now().Local())
   var i_counter = 0;
   for i_loop1 := 0; i_loop1 < 10; i_loop1++ {
       for i_loop2 := 0; i_loop2 < 32000; i_loop2++ {
           for i_loop3 := 0; i_loop3 < 32000; i_loop3++ {
               i_counter++;
               if i_counter > 50 {
                   i_counter = 0;
               }
           }
       }
    }

   fmt.Printf("\nCounter: %#v", i_counter)
   fmt.Printf("\nEnd: %s\n", time.Now().Local())
}

Assembler

Here is the Assembler for Linux code, with SASM, that I created initially (bellow is optimized).

%include "io.inc"

section .text
global CMAIN
CMAIN:
    ;mov rbp, rsp; for correct debugging
    ; Set to 0, the faster way
    xor     esi, esi

DO_LOOP1:
    mov ecx, 10
LOOP1:
    mov ebx, ecx
    jmp DO_LOOP2
LOOP1_CONTINUE:
    mov ecx, ebx
    
    loop LOOP1
    jmp QUIT

DO_LOOP2:
    mov ecx, 32000
LOOP2:
    mov eax, ecx
    ;call DO_LOOP3
    jmp DO_LOOP3
LOOP2_CONTINUE:
    mov ecx, eax
        
    loop LOOP2
    jmp LOOP1_CONTINUE

DO_LOOP3:
    ; Set to 32000 loops    
    MOV ecx, 32000 
LOOP3:
    inc     esi
    cmp     esi, 50
    jg      COUNTER_TO_0
LOOP3_CONTINUE:

    loop LOOP3
    ;ret
    jmp LOOP2_CONTINUE
    
COUNTER_TO_0:
    ; Set to 0
    xor     esi, esi
    
    jmp LOOP3_CONTINUE
    
;    jmp QUIT

QUIT:
    xor eax, eax
    ret

It took 13 seconds to complete.

One interesting explanation on why binary C or C++ code is faster than Assembler, is because the C compiler generates better Assembler/binary code at the end. For example, the use of JMP is expensive in terms of CPU cycles and the compiler can apply other optimizations and tricks that I’m not aware of, like using faster registers, while in my code I use ebx, ecx, esi, etc… (for example, imagine that using cx is cheaper than using ecx or rcx and I’m not aware but the guys that created the Gnu C compiler are)

To be sure of what’s going on I switched in the LOOP3 the JE and the JMP of the code, for groups of 50 instructions, INC ESI, one after the other and the time was reduced to 1 second.

(In C also was reduced even a bit more when doing the same)

To know what’s the translation of the C code into Assembler when compiled, you can do:

objdump --disassemble nested_loops

Look for the section main and you’ll get something like:

0000000000400470 <main>:
400470:    bf 0a 00 00 00           mov    $0xa,%edi
400475:    31 c9                    xor    %ecx,%ecx
400477:    be 00 7d 00 00           mov    $0x7d00,%esi
40047c:    0f 1f 40 00              nopl   0x0(%rax)
400480:    b8 00 7d 00 00           mov    $0x7d00,%eax
400485:    0f 1f 00                 nopl   (%rax)
400488:    83 c2 01                 add    $0x1,%edx
40048b:    83 fa 33                 cmp    $0x33,%edx
40048e:    0f 4d d1                 cmovge %ecx,%edx
400491:    83 e8 01                 sub    $0x1,%eax
400494:    75 f2                    jne    400488 <main+0x18>
400496:    83 ee 01                 sub    $0x1,%esi
400499:    75 e5                    jne    400480 <main+0x10>
40049b:    83 ef 01                 sub    $0x1,%edi
40049e:    75 d7                    jne    400477 <main+0x7>
4004a0:    48 83 ec 08              sub    $0x8,%rsp
4004a4:    be 34 06 40 00           mov    $0x400634,%esi
4004a9:    bf 01 00 00 00           mov    $0x1,%edi
4004ae:    31 c0                    xor    %eax,%eax
4004b0:    e8 ab ff ff ff           callq  400460 <__printf_chk@plt>
4004b5:    31 c0                    xor    %eax,%eax
4004b7:    48 83 c4 08              add    $0x8,%rsp
4004bb:    c3                       retq

Note: this is in the AT&T syntax and not in the Intel. That means that add $0x1,%edx is adding 1 to EDX registerg (origin, destination).

As you can see the C compiler has created a very differed Assembler version respect what I created.
For example at 400470 it uses EDI register to store 10, so to control the number of the outer loop.
It uses ESI to store 32000 (Hexadecimal 0x7D00), so the second loop.
And EAX for the inner loop, at 400480.
It uses EDX for the counter, and compares to 50 (Hexa 0x33) at 40048B.
In 40048E it uses the CMOVGE (Mov if Greater or Equal), that is an instruction that was introduced with the P6 family processors, to move the contents of ECX to EDX if it was (in the CMP) greater or equal to 50. As in 400475 a XOR ECX, ECX was performed, EXC contained 0.
And it cleverly used SUB and JNE (JNE means Jump if not equal and it jumps if ZF = 0, it is equivalent to JNZ Jump if not Zero).
It uses between 4 and 16 clocks, and the jump must be -128 to +127 bytes of the next instruction. As you see Jump is very costly.

Looks like the biggest improvement comes from the use of CMOVGE, so it saves two jumps that my original Assembler code was performing.
Those two jumps multiplied per 32000 x 32000 x 10 times, are a lot of Cpu clocks.

So, with this in mind, as this Assembler code takes 10 seconds, I updated the graph from 13 seconds to 10 seconds.

Lua

This is the initial code:

local i_counter = 0

local i_time_start = os.clock()

for i_loop1=0,9 do
    for i_loop2=0,31999 do
        for i_loop3=0,31999 do
            i_counter = i_counter + 1
            if i_counter > 50 then
                i_counter = 0
            end
        end
    end
end

local i_time_end = os.clock()
print(string.format("Counter: %i\n", i_counter))
print(string.format("Total seconds: %.2f\n", i_time_end - i_time_start))

In the case of Lua theoretically one could take great advantage of the use of local inside a loop, so I tried the benchmark with modifications to the loop:

for i_loop1=0,9 do
    for i_loop2=0,31999 do
        local l_i_counter = i_counter
        for i_loop3=0,31999 do
             l_i_counter = l_i_counter + 1
             if l_i_counter > 50 then
                 l_i_counter = 0
             end
        end
        i_counter = l_i_counter
    end
end

I ran it with LuaJit and saw no improvements on the performance.

Node.js

var s_date_time = new Date();
console.log('Starting: ' + s_date_time);

var i_counter = 0;

for (var $i_loop1 = 0; $i_loop1 < 10; $i_loop1++) {
   for (var $i_loop2 = 0; $i_loop2 < 32000; $i_loop2++) {
       for (var $i_loop3 = 0; $i_loop3 < 32000; $i_loop3++) {
           i_counter++;
           if (i_counter > 50) {
               i_counter = 0;
           }
       }
   } 
}

var s_date_time_end = new Date();

console.log('Counter: ' + i_counter + '\n');

console.log('End: ' + s_date_time_end + '\n');

Execute with:

nodejs nested_loops.js

Phantomjs

The same code as nodejs adding to the end:

phantom.exit(0);

In the case of Phantom it performs the same in both versions 1.9.0 and 2.0.1-development compiled from sources.

PHP

The interesting thing on PHP is that you can write your own extensions in C, so you can have the easy of use of PHP and create functions that really brings fast performance in C, and invoke them from PHP.

<?php

$s_date_time = date('Y-m-d H:i:s');

echo 'Starting: '.$s_date_time."\n";

$i_counter = 0;

for ($i_loop1 = 0; $i_loop1 < 10; $i_loop1++) {
   for ($i_loop2 = 0; $i_loop2 < 32000; $i_loop2++) {
       for ($i_loop3 = 0; $i_loop3 < 32000; $i_loop3++) {
           $i_counter++;
           if ($i_counter > 50) {
               $i_counter = 0;
           }
       }
   } 
}

$s_date_time_end = date('Y-m-d H:i:s');

echo 'End: '.$s_date_time_end."\n";

Facebook’s Hip Hop Virtual Machine is a very powerful alternative, that is JIT powered.

Downloading the code and compiling it is just easy, just:

git clone https://github.com/facebook/hhvm.git
cd hhvm
rm -r third-party
git submodule update --init --recursive
./configure
make

Or just grab precompiled packages from https://github.com/facebook/hhvm/wiki/Prebuilt%20Packages%20for%20HHVM

Python

from datetime import datetime
import time

print ("Starting at: " + str(datetime.now()))
s_unixtime_start = str(time.time())

i_counter = 0

# From 0 to 31999
for i_loop1 in range(0, 10):
    for i_loop2 in range(0,32000):
         for i_loop3 in range(0,32000):
             i_counter += 1
             if ( i_counter > 50 ) :
                 i_counter = 0

print ("Ending at: " + str(datetime.now()))
s_unixtime_end = str(time.time())

i_seconds = long(s_unixtime_end) - long(s_unixtime_start)
s_seconds = str(i_seconds)

print ("Total seconds:" + s_seconds)

Ruby

#!/usr/bin/ruby -w

time1 = Time.new

puts "Starting : " + time1.inspect

i_counter = 0;

for i_loop1 in 0..9
    for i_loop2 in 0..31999
        for i_loop3 in 0..31999
            i_counter = i_counter + 1
            if i_counter > 50
                i_counter = 0
            end
        end
    end
end

time1 = Time.new

puts "End : " + time1.inspect

Perl

The case of Perl was very interesting one.

This is the current code:

#!/usr/bin/env perl

print "$s_datetime Starting calculations...\n";
$i_counter=0;

$i_unixtime_start=time();

for my $i_loop1 (0 .. 9) {
    for my $i_loop2 (0 .. 31999) {
        for my $i_loop3 (0 .. 31999) {
            $i_counter++;
            if ($i_counter > 50) {
                $i_counter = 0;
            }
        }
    }
}

$i_unixtime_end=time();

$i_seconds=$i_unixtime_end-$i_unixtime_start;

print "Counter: $i_counter\n";
print "Total seconds: $i_seconds";

But before I created one, slightly different, with the for loops like in the C style:

#!/usr/bin/env perl

$i_counter=0;

$i_unixtime_start=time();

for (my $i_loop1=0; $i_loop1 < 10; $i_loop1++) {
    for (my $i_loop2=0; $i_loop2 < 32000; $i_loop2++) {
        for (my $i_loop3=0; $i_loop3 < 32000; $i_loop3++) {
            $i_counter++;
            if ($i_counter > 50) {
                $i_counter = 0;
            }
        }
    }
}

$i_unixtime_end=time();

$i_seconds=$i_unixtime_end-$i_unixtime_start;

print "Total seconds: $i_seconds";

I repeated this test, with the same version of Perl, due to the comment of a reader (thanks mpapec) that told:

“In this particular case perl style loops are about 45% faster than original code (v5.20)”

And effectively and surprisingly the time passed from 796 seconds to 436 seconds.

So graphics are updated to reflect the result of 436 seconds.

Bash

#!/bin/bash
echo "Bash version ${BASH_VERSION}..."
date

let "s_time_start=$(date +%s)"
let "i_counter=0"

for i_loop1 in {0..9}
do
     echo "."
     date
     for i_loop2 in {0..31999}
     do
         for i_loop3 in {0..31999}
         do
             ((i_counter++))
             if [[ $i_counter > 50 ]]
             then
                 let "i_counter=0"
             fi
         done
#((var+=1))
#((var=var+1))
#((var++))
#let "var=var+1"
#let "var+=1"
#let "var++"
     done
done

let "s_time_end=$(date +%2)"

let "s_seconds = s_time_end - s_time_start"
echo "Total seconds: $s_seconds"

# Just in case it overflows
date

Gambas 3

Gambas is a language and an IDE to create GUI applications for Linux.
It is very similar to Visual Basic, but better, and it is not a clone.

I created a command line application and it performed better than PHP. There has been done an excellent job with the compiler.

Note: in the screenshot the first test ran for few seconds more than in the second. This was because I deliberately put the machine under some load and I/O during the tests. The valid value for the test, confirmed with more iterations is the second one, done under the same conditions (no load) than the previous tests.

' Gambas module file MMain.module

Public Sub Main()

    ' @author Carles Mateo http://blog.carlesmateo.com
    
    Dim i_loop1 As Integer
    Dim i_loop2 As Integer
    Dim i_loop3 As Integer
    Dim i_counter As Integer
    Dim s_version As String
    
    i_loop1 = 0
    i_loop2 = 0
    i_loop3 = 0
    i_counter = 0
    
    s_version = System.Version
    
    Print "Performance Test by Carles Mateo blog.carlesmateo.com"    
    Print "Gambas Version: " & s_version

    Print "Starting..." & Now()
    
    For i_loop1 = 0 To 9
        For i_loop2 = 0 To 31999
            For i_loop3 = 0 To 31999
                i_counter = i_counter + 1
                
                If (i_counter > 50) Then
                    i_counter = 0
                Endif
            Next
        Next
    Next
    
    Print i_counter
    Print "End " & Now()

End

Changelog

2015-08-26 15:45

Thanks to the comment of a reader, thanks Daniel, pointing a mistake. The phrase I mentioned was on conclusions, point 14, and was inaccurate. The original phrase told “go is promising. Similar to C, but performance is much better thanks to the use of JIT“. The allusion to JIT is incorrect and has been replaced by this: “thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)”

2015-07-17 17:46

Benchmarked Facebook HHVM 3.9 (dev., the release date is August 3 2015) and HHVM 3.7.3, they take 52 seconds.

Re-benchmarked Facebook HHVM 3.4, before it was 72 seconds, it takes now 38 seconds. I checked the screen captures from 2014 to discard an human error. Looks like a turbo frequency issue on the tests computer, with the CPU governor making it work bellow the optimal speed or a CPU-hungry/IO process that triggered during the tests and I didn’t detect it. Thinking about forcing a fixed CPU speed for all the cores for the tests, like 2.4 Ghz and booting a live only text system without disk access and network to prevent Ubuntu launching processes in the background.

2015-07-05 13:16

Added performance of Phantomjs 1.9.0 installed via apt-get install phantomjs in Ubuntu, and Phantomjs 2.0.1-development.

Added performance of nodejs 0.12.04 (compiled).

Added bash to the graphic. It has so bad performance that I had to edit the graphic to fit in (color pink) in order prevent breaking the scale.

2015-07-03 18:32

Added benchmarks for PHP 7 alpha 2, PHP 5.6.10 and PHP 5.4.42.

2015-07-03 15:13
Thanks to the contribution of a reader (thanks mpapec!) I tried with Perl for style, resulting in passing from 796 seconds to 436 seconds.
(I used the same Perl version: Perl 5.18.2)
Updated test value for Perl.
Added new graphics showing the updated value.

Thanks to the contribution of a reader (thanks junk0xc0de!) added some additional warnings and explanations about the dangers of using -O3 (and -O2) if C/C++.

Updated the Lua code, to print i_counter and do the if i_counter > 50
This makes it take a bit longer, few cents, but passing from 7.8 to 8.2 seconds.
Updated graphics.

Stopping and investigating a WordPress xmlrpc.php attack

One of my Servers got heavily attacked for several days. I describe here the steps I took to stop this.

The attack consisted in several connections per second to the Server, to path /xmlrpc.php.

This is a WordPress file to control the pingback, when someone links to you.

My Server it is a small Amazon instance, a m1.small with only one core and 1,6 GB RAM, magnetic disks and that scores a discrete 203 CMIPS (my slow laptop scores 460 CMIPS).

Those massive connections caused the server to use more and more RAM, and while the xmlrpc requests were taking many seconds to reply, so more and more processes of Apache were spawned. That lead to more memory consumption, and to use all the available RAM and start using swap, with a heavy performance impact until all the memory was exhausted and the mysql processes stopped.

I saw that I was suffering an attack after the shutdown of MySql. I checked the CloudWatch Statistics from Amazon AWS and it was clear that I was receiving many -out of normal- requests. The I/O was really high too.

This statistics are from today to three days ago, look at the spikes when the attack was hitting hard and how relaxed the Server is now (plain line).

First I decided to simply rename the xmlrpc.php file as a quick solution to stop the attack but the number of http connections kept growing and then I saw very suspicious queries to the database.

Those queries, in addition to what I’ve seen in the Apache’s error log suggested me that may be the Server was hacked by a WordPress/plugin bug and that now they were trying to hide from the database’s logs. (Specially the DELETE FROM wp_useronline WHERE user_ip = the Ip of the attacker)

[Tue Aug 26 11:47:08 2014] [error] [client 94.102.49.179] Error in WordPress Database Lost connection to MySQL server during query a la consulta SELECT option_value FROM wp_options WHERE option_name = 'uninstall_plugins' LIMIT 1 feta per include('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), include_once('/plugins/captcha/captcha.php'), register_uninstall_hook, get_option
[Tue Aug 26 11:47:09 2014] [error] [client 94.102.49.179] Error in WordPress Database Lost connection to MySQL server during query a la consulta SELECT option_value FROM wp_options WHERE option_name = 'uninstall_plugins' LIMIT 1 feta per include('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), include_once('/plugins/captcha/captcha.php'), register_uninstall_hook, get_option
[Tue Aug 26 11:47:10 2014] [error] [client 94.102.49.179] Error in WordPress Database Lost connection to MySQL server during query a la consulta SELECT option_value FROM wp_options WHERE option_name = 'widget_wppp' LIMIT 1 feta per include('wp-load.php'), require_once('wp-config.php'), require_once('wp-settings.php'), do_action('plugins_loaded'), call_user_func_array, wppp_check_upgrade, get_option

The error log was very ugly.

The access log was not reassuring, as it shown many attacks like that:

94.102.49.179 - - [26/Aug/2014:10:34:58 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
94.102.49.179 - - [26/Aug/2014:10:34:59 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
127.0.0.1 - - [26/Aug/2014:10:35:09 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.2.22 (Ubuntu) (internal dummy connection)"
94.102.49.179 - - [26/Aug/2014:10:34:59 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
94.102.49.179 - - [26/Aug/2014:10:34:59 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
94.102.49.179 - - [26/Aug/2014:10:35:00 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
94.102.49.179 - - [26/Aug/2014:10:34:59 +0000] "POST /xmlrpc.php HTTP/1.0" 200 598 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

Was difficult to determine if the Server was receiving SQL injections so I wanted to be sure.

Note: The connection from 127.0.0.1 with OPTIONS is created by Apache when spawns another Apache.

As I had super fresh backups in another Server I was not afraid of the attack dropping the database.

I was a bit suspicious also because the /readme.html file mentioned that the version of WordPress is 3.6. In other installations it tells correctly that the version is the 3.9.2 and this file is updated with the auto-update. I was thinking about a possible very sophisticated trojan attack able to modify wp-includes/version.php and set fake $wp_version = ‘3.9.2’;
Later I realized that this blog had WordPress in Catalan, my native language, and discovered that the guys that do the translations forgot to update this file (in new installations it comes not updated, and so showing 3.6). I have alerted them.

In fact later I did a diff of all the files of my WordPress installation against the official WordPress 3.9.2-ca and later a did a diff between the WordPress 3.9.2-ca and the WordPress 3.9.2 (English – default), and found no differences. My Server was Ok. But at this point, at the beginning of the investigation I didn’t know that yet.

With the info I had (queries, times, attack, readme telling v. 3.6…) I balanced the possibility to be in front of something and I decided that I had an unique opportunity to discover how they do to inject those Sql, or discover if my Server was compromised and how. The bad point is that it was the same Amazon’s Server where this blog resides, and I wanted the attack to continue so I could get more information, so during two days I was recording logs and doing some investigations, so sorry if you visited my blog and database was down, or the Server was going extremely slow. I needed that info. It was worth it.

First I changed the Apache config so the massive connections impacted a bit less the Server and so I could work on it while the attack was going on.

I informed my group of Senior friends on what’s going on and two SysAdmins gave me some good suggestions on other logs to watch and on how to stop the attack, and later a Developer joined me to look at the logs and pointed possible solutions to stop the attack. But basically all of them suggested on how to block the incoming connections with iptables and to do things like reinstalling WordPress, disabling xmlrpc.php in .htaccess, changing passwords or moving wp-admin/ to another place, but the point is that I wanted to understand exactly what was going on and how.

I checked the logs, certificates, etc… and no one other than me was accessing the Server. I also double-checked the Amazon’s Firewall to be sure that no unnecessary ports were left open. Everything was Ok.

I took a look at the Apache logs for the site and all the attacks were coming from the same Ip:

94.102.49.179

It is an Ip from a dedicated Servers company called ecatel.net. I reported them the abuse to the abuse address indicated in the ripe.net database for the range.

I found that many people have complains about this provider and reports of them ignoring the requests to stop the spam use from their servers, so I decided that after my tests I will block their entire network from being able to access my sites.

All the requests shown in the access.log pointed to requests to /xmlrpc.php. It was the only path requested by the attacker so that Ip did nothing more apparently.

I added some logging to WordPress xmlrpc.php file:

if ($_SERVER['REMOTE_ADDR'] == '94.102.49.179') {
    error_log('XML POST: '.serialize($_POST));
    error_log('XML GET: '.serialize($_GET));
    error_log('XML REQUEST: '.serialize($_REQUEST));
    error_log('XML SERVER: '.serialize($_SERVER));
    error_log('XML FILES: '.serialize($_FILES));
    error_log('XML ENV: '.serialize($_ENV));
    error_log('XML RAW: '.$HTTP_RAW_POST_DATA);
    error_log('XML ALL_HEADERS: '.serialize(getallheaders()));
}

This was the result, it is always the same:

[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML POST: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML GET: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML REQUEST: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML SERVER: a:24:{s:9:"HTTP_HOST";s:24:"barcelona.afterstart.com";s:12:"CONTENT_TYPE";s:8:"text/xml";s:14:"CONTENT_LENGTH";s:3:"287";s:15:"HTTP_USER_AGENT";s:50:"Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)";s:15:"HTTP_CONNECTION";s:5:"close";s:4:"PATH";s:28:"/usr/local/bin:/usr/bin:/bin";s:16:"SERVER_SIGNATURE";s:85:"<address>Apache/2.2.22 (Ubuntu) Server at barcelona.afterstart.com Port 80</address>\n";s:15:"SERVER_SOFTWARE";s:22:"Apache/2.2.22 (Ubuntu)";s:11:"SERVER_NAME";s:24:"barcelona.afterstart.com";s:11:"SERVER_ADDR";s:14:"[this-is-removed]";s:11:"SERVER_PORT";s:2:"80";s:11:"REMOTE_ADDR";s:13:"94.102.49.179";s:13:"DOCUMENT_ROOT";s:29:"/var/www/barcelona.afterstart.com";s:12:"SERVER_ADMIN";s:19:"webmaster@localhost";s:15:"SCRIPT_FILENAME";s:40:"/var/www/barcelona.afterstart.com/xmlrpc.php";s:11:"REMOTE_PORT";s:5:"40225";s:17:"GATEWAY_INTERFACE";s:7:"CGI/1.1";s:15:"SERVER_PROTOCOL";s:8:"HTTP/1.0";s:14:"REQUEST_METHOD";s:4:"POST";s:12:"QUERY_STRING";s:0:"";s:11:"REQUEST_URI";s:11:"/xmlrpc.php";s:11:"SCRIPT_NAME";s:11:"/xmlrpc.php";s:8:"PHP_SELF";s:11:"/xmlrpc.php";s:12:"REQUEST_TIME";i:1409338974;}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML FILES: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML ENV: a:0:{}
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML RAW: <?xmlversion="1.0"?><methodCall><methodName>pingback.ping</methodName><params><param><value><string>http://seretil.me/</string></value></param><param><value><string>http://barcelona.afterstart.com/2013/09/27/afterstart-barcelona-2013-09-26/</string></value></param></params></methodCall>
[Fri Aug 29 19:02:54 2014] [error] [client 94.102.49.179] XML ALL_HEADERS: a:5:{s:4:"Host";s:24:"barcelona.afterstart.com";s:12:"Content-type";s:8:"text/xml";s:14:"Content-length";s:3:"287";s:10:"User-agent";s:50:"Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)";s:10:"Connection";s:5:"close";}

So nothing in $_POST, nothing in $_GET, nothing in $_REQUEST, nothing in $_SERVER, no files submitted, but a text/xml Posted (that was logged by storing: $HTTP_RAW_POST_DATA):

<?xmlversion="1.0"?><methodCall><methodName>pingback.ping</methodName><params><param><value><string>http://seretil.me/</string></value></param><param><value><string>http://barcelona.afterstart.com/2013/09/27/afterstart-barcelona-2013-09-26/</string></value></param></params></methodCall>

I show you in a nicer formatted aspect:So basically they were trying to register a link to seretil dot me.

I tried and this page, hosted in CloudFare, is not working.

The problem is that responding to this spam xmlrpc request took around 16 seconds to the Server. And I was receiving several each second.

I granted access to my Ip only on the port 80 in the Firewall, restarted Apache, restarted MySql and submitted the same malicious request to the Server, and it even took 16 seconds in all my tests:

cat http_post.txt | nc barcelona.afterstart.com 80

I checked and confirmed that the logs from the attacker were showing the same Content-Length and http code.

Other guys tried xml request as well but did one time or two and leaved.

The problem was that this robot was, and still sending many requests per second for days.

May be the idea was to knock down my Server, but I doubted it as the address selected is the blog of one Social Event for Senior Internet Talents that I organize: afterstart.com. It has not special interest, I do not see a political, hateful or other motivation to attack the blog from this project.

Ok, at this point it was clear that the Ip address was a robot, probably running from an infected or hacked Server, and was trying to publish a Spam link to a site (that was down). I had to clarify those strange queries in the logs.

I reviewed the WPUsersOnline plugin and I saw that the strange queries (and inefficient) that I saw belonged to WPUsersOnline plugin.

The thing was that when I renamed the xmlrpc.php the spamrobot was still posting to that file. According to WordPress .htaccess file any file that is not found on the filesystem is redirected to index.php.

So what was happening is that all the massive requests sent to xmlrpc.php were being attended by index.php, then showing an error message that page not found, but the WPUsersOnline plugin was deleting those connections. And was doing it many times, overloading also the Database.

Also I was able to reproduce the behaviour by myself, isolating by firewalling the WebServer from other Ips other than mine and doing the same post by myself many times per second.

I checked against a friend’s blog but in his Server xmlrpc.php responds in 1,5 seconds. My friend’s Server is a Digital Ocean Virtual Server with 2 cores and SSD Disks. My magnetic disks on Amazon only bring around 40 MB/second. I’ve to check in detail why my friend’s Server responds so much faster.

Checked the integrity of my databases, just in case, and were perfect. Nothing estrange with collations and the only errors in the /var/log/mysql/error.log was due to MySql crashing when the Server ran out of memory.

Rechecked in my Server, now it takes 12 seconds.

I disabled 80% of the plugins but the times were the same. The Statistics show how the things changed -see the spikes before I definitively patched the Server to block request from that Spam-robot ip, to the left-.

I checked against another WordPress that I have in the same Server and it only takes 1,5 seconds to reply. So I decided to continue investigating why this WordPress took so long to reply.

As I said before I checked that the files from my WordPress installation were the same as the original distribution, and they were. Having discarded different files the thing had to be in the database.

Even when I checked the MySql it told me that all the tables were OK, having seen that the WPUserOnline deletes all the registers older than 5 minutes, I guessed that this could lead to fragmentation, so I decided to do OPTIMIZE TABLE on all the tables of the database for the WordPress failing, with InnoDb it is basically recreating the Tables and the Indexes.

I tried then the call via RPC and my Server replied in three seconds. Much better.

Looking with htop, when I call the xmlrpc.php the CPU uses between 50% and 100%.

I checked the logs and the robot was gone. He leaved or the provider finally blocked the Server. I don’t know.

Everything became clear, it was nothing more than a sort of coincidences together. Deactivating the plugin the DELETE queries disappeared, even under heavy load of the Server.

It only was remain to clarify why when I send a call to xmlrpc to this blog, it replies in 1,5 seconds, and when I request to the Barcelona.afterstart.com it takes 3 seconds.

I activated the log of queries in mysql. To do that edit /etc/mysql/my.cnf and uncomment:

general_log_file        = /var/log/mysql/mysql.log
general_log             = 1

Then I checked the queries, and in the case of my blog it performs many less queries, as I was requesting to pingback to an url that was not existing, and WordPress does this query:

SELECT   wp_posts.* FROM wp_posts  WHERE 1=1  AND ( ( YEAR( post_date ) = 2013 AND MONTH( post_date ) = 9 AND DAYOFMONTH( post_date ) = 27 ) ) AND wp_posts.post_name = 'afterstart-barcelona-2013-09-26-meet' AND wp_posts.post_type = 'post'  ORDER BY wp_posts.post_date DESC

As the url afterstart-barcelona-2013-09-26-meet with the dates indicated does not exist in my other blog, the execution ends there and does not perform the rest of the queries, that in the case of Afterstart blog were:

40 Query     SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (81) ORDER BY meta_id ASC
40 Query     SELECT ID, post_name, post_parent, post_type
FROM wp_posts
WHERE post_name IN ('http%3a','','seretil-me')
AND post_type IN ('page','attachment')
40 Query     SELECT   wp_posts.* FROM wp_posts  WHERE 1=1  AND (wp_posts.ID = '0') AND wp_posts.post_type = 'page'  ORDER BY wp_posts.post_date DESC
40 Query     SELECT * FROM wp_comments WHERE comment_post_ID = 81 AND comment_author_url = 'http://seretil.me/'

To confirm my theory I tried the request to my blog, with a valid url, and it lasted for 3-4 seconds, the same than Afterstart’s blog. Finally I double-checked with the blog of my friend and was slower than before. I got between 1,5 and 6 seconds, with a lot of 2 seconds response. (he has PHP 5.5 and OpCache that improves a bit, but the problem is in the queries to the database)

Honestly, the guys creating WordPress should cache this queries instead of performing 20 live queries, that are always the same, before returning the error message. Using Cache Lite or Stash, or creating an InMemory table for using as Cache, or of course allowing the use of Memcached would eradicate the DoS component of this kind of attacks. As the xmlrpc pingback feature hits the database with a lot of queries to end not allowing the publishing.

While I was finishing those tests (remember that the attacker ip has gone) another attacker from the same network tried, but I had patched the Server to ignore it:

94.102.52.157 - - [31/Aug/2014:02:06:16 +0000] "POST /xmlrpc.php HTTP/1.0" 200 189 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

This was trying to get a link published to a domain called socksland dot net that is a domain registered in Russia and which page is not working.

As I had all the information I wanted I finally blocked the network from the provider to access my Server ever again.

Unfortunatelly Amazon’s Firewall does not allow to block a certain Ip or range.
So you can block at Iptables level or in .htaccess file or in the code.
I do not recommend blocking at code level because sadly WordPress has many files accessible from outside so you would have to add your code at the beginning of all the files and because when there is a WordPress version update you’ll loss all your customizations.
But I recommend proceeding to patch your code to avoid certain Ip’s if you use a CDN. As the POST will be sent directly to your Server, and the Ip’s are Ip’s from the CDN -and you can’t block them-. You have to look at the Header: X-Forwarded-For that indicates the Ip’s the proxies have passed by, and also the Client’s Ip.

I designed a program that is able to patch any PHP project to check for blacklisted Ip’s (even though a proxy) with minimal performance impact. It works with WordPress, drupal, joomla, ezpublish and Framework like Zend, Symfony, Catalonia… and I patched my code to block those unwanted robot’s requests.

A solution that will work for you probably is to disable the pingback functionality, there are several plugins that do that. Disabling completely xmlrpc is not recommended as WordPress uses it for several things (JetPack, mobile, validation…)

The same effect as adding the plugin that disables the xmlrpc pingback can be achieved by editing the functions.php from your Theme and adding:

add_filter( 'xmlrpc_methods', 'remove_xmlrpc_pingback_ping' );
function remove_xmlrpc_pingback_ping( $methods ) {
    unset( $methods['pingback.ping'] );
    
    return $methods;
}

Update: 2016-02-24 14:40 CEST
I got also a heavy dictionary attack against wp-login.php .Despite having a Captcha plugin, that makes it hard to hack, it was generating some load on the system.
What I did was to rename the wp-login.php to another name, like wp-login-carles.php and in wp-login.php having a simply exit();

<?php
exit();

Take in count that this will work only until WordPress is updated to the next version. Then you have to reapply the renaming trick.

Improving performance in PHP

Article created: 2014-08-11 Last Update: 2014-08-11

This year I was invited to speak at the PHP Conference at Berlin 2014.

It was really nice, but I had to decline as I was working hard in a Start up, and I hadn’t the required time in order to prepare the nice conference I wanted and that people deserves.

However, having time, I decided to write an article about what I would had speak at the conference.

I will cover improving performance in a single server, and Scaling out multi-Server architecture, focusing on the needs of growing and Start up projects. Many of those techniques can be used to improve performance with other languages, not just with PHP.

Many of my friends are very good Developing, but know nothing about Architecture and Scaling. Hope this approach the two worlds, Development ad Operatings, into a DevOps bridge.

Improving performance on a single server

Hosting

Choose a good hosting. And if you can afford it choose a dedicated server.

Shared hostings are really bad. Some of them kill your http and mysql instances if you reach certain CPU use (really few), while others share the same hardware between 100+ users serving your pages sloooooow. Others cap the amount of queries that your MySql will handle per hour at so ridiculous few amount that even Drupal or WordPress are unable to complete a request in development.

Other ISP (Internet Service Providers) have poor Internet bandwidth, and so you web will load slow to users.

Some companies invest hundreds of thousands in developing a web, and then spend 20 € a year in the hosting. Less than the cost of a dinner.

You can use a decent dedicated server from 50 to 99 €/month and you will celebrate this decision every day.

Take in count that virtualization wastes between 20% and 30% of the CPU power. And if there are several virtual machines the loss will be more because you loss the benefits of the CPU caching for optimizing parallel instructions execution and prediction. Also if the hypervisor host allows to allocate more RAM than physically available and at some point it swaps, the performance of all the VM’s will be much worst.

If you have a VM and it swaps, in most providers the swap goes over the network so there is an additional bottleneck and performance penalty.

To compare the performance of dedicated servers and instances from different Cloud Providers you can take a look at my project cmips.net

Improve your Server

If your Sever has few RAM, add more. And if your project is running slow and you can afford a better Server, do it.

Using SSD disk will incredibly improve the performance on I/O operations and on swap operations. (but please, do backups and keep them in another place)

If you use a CMS like ezpublish with http_cache enabled probably you will prefer to have a Server with faster cores, rather tan a Server with one or more CPU’s plenty of cores, but slower cores, and that last for a longer time to render the page to the http cache.

That may seem obvious but often companies invest 320 hours in optimizing the code 2%, at a cost of let’s say 50 €/h * 320 hours = 16.000 €, while hiring a better Server would had bring between a 20% to 1000% improvement at a cost of additional 50€/month only or at the cost of 100 € of increasing the RAM memory.

The point here is that the hardware is cheap, while the time of the Engineers is expensive. And good Engineers are really hard to find.

And you probably, as a CEO or PO, prefer to use the talent to warranty a nice time to market for your project, or adding more features, rather than wasting this time in refactorizing.

Even with the most optimal code in the universe, if your project is successful at certain point you’ll have to scale. So adding more Servers. To save a Server now at the cost of slowing the business has not any sense.

Upgrade you PHP version

Many projects still use PHP 5.3, and 5.4.

Latest versions of PHP bring more and more performance. If you use old versions of PHP you can have a Quick Win by just upgrading to the last PHP version.

Use OpCache (or other cache accelerator)

OpCache is shipped with PHP 5.5 by default now, so it is the recommended option. It is though to substitute APC.

To activate OpCache edit php.ini and add:

Linux/Unix:

zend_extension=/path/to/opcache.so

Windows:

zend_extension=C:\path\to\php_opcache.dll

It will greatly improve your PHP performance.

Ensure that OpCache in Production has the optimal config for Production, that will be different from Development Environment.

Note: If you plan to use it with XDebug in Development environments, load OpCache before XDebug.

Disable Profiling and xdebug in Production

In Production disable the profiling, xdebug, and if you use a Framework ensure the Development/Debug features are disabled in Production.

Ensure your logs are not full of warnings

Check that Production logs are not full of warnings.

I’ve seen systems were every seconds 200 warnings were written to logs, the same all the time, and that obviously was slowing down the system.

Typical warnings like this can be easily fixed:

Message: date() [function.date]: It is not safe to rely on the system’s timezone settings. You are required to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected ‘UTC’ for ‘8.0/no DST’ instead

Profile in Development

To detect where your slow code is, profile it in Development to see where it is spent the most CPU/time.

Check the slow-queries if you use MySql.

Cache html to disk

Imagine you have a sort of craigslist and you are displaying all the categories, and the number of new messages in this landing page. To do that you are performing many queries to the database, SELECT COUNTs, etc… every time a user visits your page. That certainly will overload your database with actually few concurrent visitors.

Instead of querying the Database all the time, do cache the generated page for a while.

This can be achieved by checking if the cache html file exists, and checking the TTL, and generating a new page if needed.

A simple sample would be:

<?php
    // Cache pages for 5 minutes
    $i_cache_TTL = 300;

    $b_generate_cache = false;

    
    $s_cache_file = '/tmp/index.cache.html';


    if (file_exists($s_cache_file)) {
        // Get creation date
        $i_file_timestamp = filemtime($s_cache_file);
        $i_time_now = microtime(true);
        if ($i_time_now > ($i_file_timestamp + $i_cache_TTL)) {
            $b_generate_cache = true;
        } else {
            // Up to date, get from the disk
            $o_fh = fopen($s_cache_file, "rb");
            $s_html = stream_get_contents($o_fh);
            fclose($o_fh);
            
            // If the file was empty something went wrong (disk full?), so don't use it
            if (strlen($s_html) == 0) {
                $b_generate_cache = true;
            } else {
                // Print the page and exit
                echo $s_html;
                exit();
            }
        }
    } else {
        $b_generate_cache = true;
    }

    ob_start();

    // Render your page normally here
    // ....

    $s_html = ob_get_clean();

    if ($b_generate_cache == true) {
        // Create the file with fresh contents
        $o_fp = fopen($s_cache_file, 'w');
        if (fwrite($o_fp, $s_html) === false) {
            // Error. Impossible to write to disk
            // throw new Exception('CacheCantWrite');
        }
        fclose($o_fp);
    }

    // Send the page to the browser
    echo $s_html;

This sample is simple, and works for many cases, but presents problems.

Imagine for example that the page takes 5 seconds to be generated with a single request, and you have high traffic in that page, let’s say 500 requests per second.

What will happen when the cache expires is that the first user will trigger the cache generation, and the second, and the third…. so all of the 500 requests * 5 seconds will be hitting the database to generate the cache, but… if creating the page per one requests takes 5 seconds, doing this 2,500 times will not last 5 seconds… so your process will enter in a vicious state where the first queries have not ended after minutes, and more and more queries are being added to the queue until:

a) Apache runs out of childs/processes, per configuration

b) Mysql runs out of connections, per configuration

c) Linux runs out of memory, and processes crashes/are killed

Not to mention the users or the API client, waiting infinitely for the http request to complete, and other processes reading a partial file (size bigger than 0 but incomplete).

Different strategies can be used to prevent that, like:

a) using semaphores to lock access to the cache generation (only one process at time)

b) using a .lock file to indicate that the file is being generated, and so next requests serving from the cache until the cache generation process ends the task, also writing to a buffer like acachefile.buffer (to prevent incomplete content being read) and finally when is complete renaming to the final name and removing the .lock

c) using memcached, or similar, to keep an index in memory of what pages are being generated now, and why not, keeping the cached files there instead of a filesystem

d) using crons to generate the cache files, so they run hourly and you ensure only one process generates the cache files

If you use crons, a cheap way to generate the .html content is that the crons curls/wget your webpage. I don’t recommend this as has some problems, like if that web request fails for any reason, you’ll have cached an error instead of content.

I prefer preparing my projects to being able of rendering the content being invoked from HTTP/S or from command line. But if you use curl because is cheap and easy and time to market is important for your project, then be sure that you check that your backend code writes an Status OK in the HTML that the cron can check to ensure that the content has been properly generated. (some crons only check for http status, like 200, but if your database or a xml gateway you use fails you will likely get a 200 and won’t detect that you’re caching pages with “error I can’t connect to the database” instead)

Many Frameworks have their own cache implementation that prevent corruption that could come by several processes writing to the same file at the same time, or from PHP dying in the middle of the render.

You can see a more complex MVC implementation, with Views, from my Framework Catalonia here:

By serving .html files instead of executing PHP with logic and performing queries to the database you will be able to serve hundreds of thousands requests per day with a single machine and really fast -that’s important for SEO also-.

I’ve done this in several Start ups with wonderful results, and my Framework Catalonia also incorporates this functionality very easily to use.

Note: This is only one of the techniques to save the load of the Database Servers. Many more come later.

Cache languages to disk

If you have an application that is multi-language, or if your point for the Strings (sections, pages, campaigns..) to be edited by Marketing is the Database, there is no need to query it all the time.

Simply provide a tool to “generate language files”.

Your languages files can be Javascript files loaded by the page, or can be PHP files generated.

For example, the file common_footer_en.php could be generated reading from Database and be like that:

<php
/* Autogenerated English translations file common_footer_en.php
   on 2014-08-10 02:22 from the database */
$st_translations['seconds']                = 'seconds';
$st_translations['Time']                   = 'Time';
$st_translations['Vars used']              = 'Vars used in these templates';
$st_translations['Total Var replacements'] = 'Total replaced';
$st_translations['Exec time']              = 'Execution time';
$st_translations['Cached controller']      = 'Cached controller';

So the PHP file is going to be generated when someone at your organization updates the languages, and your code is including it normally like with any other PHP file.

Use the Crons

You can set cron jobs to do many operations, like map reduce, counting in the database or effectively deleting the data that the user selected to delete.

Imagine that you have classified portal, and you want to display the number of announces for that category. You can have a table NUM_ANNOUNCES to store the number of announces, and update it hourly. Then your database will only do the counting once per hour, and your application will be reading the number from the table NUM_ANNOUNCES.

The Cron can also be used to make expire old announces. That way you can avoid a user having to wait for that clean up taking process when you have a http request to PHP.

A cron file can be invoked by:

php -f cron.php

By:

./cron.php

If you give permissions of execution with chmod +x and set the first line in cron.php as:

#!/usr/bin/env php

Or you can do a trick, that is emulate a http request from bash, by invoking a url with curl or with wget. Set the .htaccess so the folder for the cron tasks can only be executed from localhost for adding security.

This last trick has the inconvenient that the calling has the same problems as any http requests: restarting Apache will kill the process, the connection can be closed by timed out (e.g. if process is taking more seconds than the max. execution time, etc…)

Use Ramdisk for PHP files

With Linux is very easy to setup a RamDisk.

You can setup a RamDisk and rsync all your web .PHP files at system boot time, and when deploying changes, and config Apache to use the Ramdisk folder for the website.

That way for every request to the web, PHP files will be served from RAM directly, saving the slow disk access. Even with OpCache active, is a great improvement.

At these times were one Gigabyte of memory is really cheap there is a huge difference from reading files from disk, and getting them from memory. (Reading and writing to RAM memory is many many many times faster than magnetic disks, and many times faster than SSD disks)

Also .js, .css, images… can be served from a Ram disk folder, depending on how big your web is.

Ramdisk for /tmp

If your project does operations on disk, like resizing images, compressing files, reading/writing large CSV files, etcetera you can greatly improve the performance by setting the /tmp folder to a Ramdisk.

If your PHP project receives file uploads they will also benefit (a bit) from storing the temporal files to RAM instead to the disk.

Use Cache Lite

Cache Lite is a Pear extension that allows you to keep data in a local cache of the Web Server.

You can cache .html pages, or you can cache Queries and their result.

Example from http://pear.php.net/manual/en/package.caching.cache-lite.cache-lite.save.php:

<?php
require_once "Cache/Lite.php";

$options = array(
    'cacheDir' => '/tmp/',
    'lifeTime' => 7200,
    'pearErrorMode' => CACHE_LITE_ERROR_DIE
);
$cache = new Cache_Lite($options);

if ($data = $cache->get('id_of_the_page')) {

    // Cache hit !
    // Content is in $data
    echo $data;

} else { 
    
    // No valid cache found (you have to make and save the page)
    $data = '<html><head><title>test</title></head><body><p>this is a test</p></body></html>';
    echo $data;
    $cache->save($data);
    
}

It is nice that Cache Lite handles the TTL and keeps the info stored in different sub-directories in order to keep a decent performance. (As you may know many files in the same directory slows the access much).

Use HHVM (HipHop Virtual Machine) from Facebook

Facebook Engineers are always trying to optimize what is run on the Servers.

Faster code means, less machines. Even 1% of CPU use improvement means a lot of Servers less. Less Servers to maintain, less money wasted, less space on the Data Centers…

So they created the HHVM HipHop Virtual Machine that is able to run PHP code, much much faster than PHP. And is compatible with most of the Frameworks and Open Source projects.

They also created the Hack language that is an improved PHP, with type hinting.

So you can use HHVM to make your code run faster with the same Server and without investing a single penny.

Use C extensions

You can create and use your own C extensions.

C extensions will bring really fast execution. Just to get the idea:
I built a PHP extension to compare the performance from calculating the Bernoulli number with PHP and with the .so extension created in C.
In my Core i7 times were:
PHP:
Computed in 13.872583150864 s
PHP calling the C compiled extension:
Computed in 0.038495063781738 s

That’s 360.37 times faster using the C extension. Not bad.

Use Zephir

Zephir is a an Open Source language, very similar to PHP, that allows to create and maintain easily extensions for PHP.

Use Phalcon

Phalcon is a Web MVC Framework implemented as C extension, so it offers a high performance.

The views syntax are very very similar to Twig.

Tutorial – Creating a Poll application in 15 minutes with Phalcon from Phalcon Framework on Vimeo.

Check if you’re using the correct Engine for MySql

Many Developers create the tables and never worry about that. And many are using MyIsam by default. It was the by default Engine prior to MySql 5.5.

While MyIsam can bring good performance in some certain cases, my recommendation is to use InnoDb.

Normally you’ll have a gain in performance with MyIsam if you’ve a table were you only write or only read, but in all the other cases InnoDb is expected to be much more performant and safe.

MyIsam tables also get corruption from time to time and need manually fixing and writing to disks are not so reliable than InnoDb.

As MyIsam uses table-locking for updates and deletes to any existing row, it is easy to see that if you’re in a web environment with multiple users, blocking the table -so the other operations have to wait- will make things be slow.

If you have to use Joins clearly you will benefit from using InnoDb also.

Use InMemory Engine from MySql

MySql has a very powerful Engine called InMemory.

The InMemory Engine will store things in RAM and loss the data when MySql is restarted.

However is very fast and very easy to use.

Imagine that you have a travel application that constantly looks at which country belongs the city specified by user. A Quickwin would be to INSERT all this data in the InMemory Engine of MySql when it is started, and do just one change in your code: to use that Table.

Really easy. Quick improvement.

Use curl asynchronously

If your PHP has to communicate with other systems using curl, you can do the http/s call, and instead of waiting for a response let your PHP do more things in the meantime, and then check the results.

You can also call to multiple curl calls in parallel, and so avoid doing one by one in serial.

Here you have a sample.

Serialize

Guess that you have a query that returns 1000 results. Then you add one by one to an array.

Probably you’re going to have substantial gain if you keep in the database a single row, with the array serialized.

So an array like:

$st_places = Array(‘Barcelona’, ‘Dublin’, ‘Edinburgh’, ‘San Francisco’, ‘London’, ‘Berlin’, ‘Andorra la Vella’, ‘Prats de Lluçanès’);

Would be serialized to an string like:

a:8:{i:0;s:9:”Barcelona”;i:1;s:6:”Dublin”;i:2;s:9:”Edinburgh”;i:3;s:13:”San Francisco”;i:4;s:6:”London”;i:5;s:6:”Berlin”;i:6;s:16:”Andorra la Vella”;i:7;s:19:”Prats de Lluçanès”;}

This can be easily stored as String and unserialized later back to an array.

Note: In Internet we have a lot of encodings, Hebrew, Japanese… languages. Be careful with encodings when serializing, using JSon, XML, storing in databases without UTF support, etc…

Use Memcached to store common things

Memcached is a NoSql database in memory that can run in cluster.

The idea is to keep things there, in order to offload the load of the database. And as everything is in RAM it really runs fast.

You can use Memcached to cache Queries and their results also.

For example:

You have query SELECT * FROM translations WHERE section=’MAIN’.

Then you look if that String exists as key in the Memcached, and if it exists you fetch the results (that are serialized) and you avoid the query. If it doesn’t exist, you do normally the query to the database, serialize the array and store it in the Memcached with a TTL (Time to Live) using the Query (String) as primary key. For security you may prefer to hash the query with MD5 or SHA-1 and use the hash as key instead of using it plain.

When the TTL is reached the validity of the data would have expired and so it’s time to reinsert the contents in the next query.

Be careful, I’ve seen projects that were caching private data from users without isolating the key properly, so other users were getting the info from other users.

For example, if the key used was ‘Name’ and the value ‘Carles Mateo’ obviously the next user that fetch the key ‘Name’ would get my name and not theirs.

If you store private data of users in Memcache, it is a nice idea to append the owner of that info to the hash. E.g. using key: 10701577-FFADCEDBCCDFFFA10C

Where ‘10701577’ would be the user_id of the owner of the info, and ‘FFADCEDBCCDFFFA10C’ a hash of the query.

Before I suggested that you can keep a table of counting for the announces in a classified portal. This number can be stored in the Memcached instead.

You can store also common things, like translations, or cities like in the example before, rate of change for a currency exchanging website…

The most common way to store things there is serialized or Json encoded.

Be aware of the memory limits of Memcached and contrl the cache hitting ratio to avoid inserting data, and losing it constantly because is used few and Memcached has few memory.

You can also use Redis.

Use jQuery for Production (small file) and minimized files for js

Use the Production jQuery library in Production, I mean do not use the bigger file Development jQuery library for Production.

There are product that eliminate all the necessary spaces in .js and .css files, and so are served much faster. These process is called minify.

It is important to know that in many emerging markets in the world, like Brazil, they have slow DSL lines. Many 512 Kbit/secons, and even modem connections!.

Activate compression in the Server

If you send large text files, or Jsons, you’ll benefit from activating the compression at the Server.

It consumes some CPU, but many times it brings an important improvement in speed serving the pages to the users.

Use a CDN

You can use a Content Delivery Network to offload your Servers from sending plain texts, html, images, videos, js, css…

You can delegate this to the CDN, they have very speedy Internet lines and Servers, so your Servers can concentrate into doing only BackEnd operations.

The most well known are Akamai and Amazon Cloud Front.

Please take attention to the documentation, a common mistake is to send Cache Headers to the CDN servers, while they’ll use this headers to set the cache TTL and ignore their web configuration parameters. (For example s-maxage, like: Cache-Control: public, s-maxage=600)

This sample header:

HTTP/1.1 200 OK
Server: nginx
Date: Wed, 20 Aug 2014 10:50:21 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Vary: Accept-Encoding
Cache-Control: max-age=0, public, s-maxage=10800
Vary: X-User-Hash,Accept-Encoding
X-Location-Id: 2
X-Content-Digest: ezlocation/2/end5139244ced4b25606ef0a39235982b1662d01cc
Content-Length: 68250
Age: 3

You can take a look at any website by telneting to the port 80 and doing the request manually or easily by using lynx:

lynx -mime_header http://blog.carlesmateo.com | less

Do you need a Framework?

If you’re processing only BackEnd petitions, like in the video games industry, serving API’s, RESTful, etc… you probably don’t need a Framework.

The Frameworks are generic and use much more resources than you’re really need for a fast reply.

Many times using a heavy Framework has a cost of factor times, compared to use simply PHP.

Save database connections until really needed

Many Frameworks create a connection to the Database Server by default. But certain parts of your code application do not require to connect to the database.

For example, validating the data from a form. If there are missing fields, the PHP will not operate with the Database, just return an error via JSon or refreshing the page, informing that the required field is missing.

If a not logged user is requesting the dashboard page, there is no need to open a connection to the database (unless you want to write the access try to an error log in the database).

In fact opening connections by default makes easier for attackers to do DoS attacks.

With a Singleton pattern you can easily implement a Db class that handles this transparently for you.

Scaling out / Multi Server Environment

Memcached session

When you have several Web Servers you’ll need something more flexible than the default PHP handler (that stores to a file in the Web Server).

The most common is to store the Session, serialized, in a Memcached Cluster.

Use Cassandra

Apache Cassandra is a NoSql database that allows to Scale out very easily.

The main advantage is that scales linearly. If you have 4 nodes and add 4 more, your performance will be doubled. It has no single point of failure, is also resilient to node failures, it replicates the data among the nodes, splits the load over the nodes automatically and support distributed datacenter architectures.

To know more abiut NoSql and Cassandra, read my article: Upgrade your scalability with NoSql. And to start developing with Cassandra in PHP, python or Java read my contributed article: Begin developing with Cassandra.

Use MySql primary and secondaries

A easy way to split the load is to have a MySql primary Server, that handles the writes, and MySql secondary (or Slave) Servers handling the reads.

Every write sent to the Master is replicated into the Slaves. Then your application reads from the slaves.

You have to tell your code to do the writes to database to the primary Server, and the reads to the secondaries. You can have a Load Balancer so your code always ask the Load Balancer for the reads and it makes the connection to the less used server.

Do Database sharding

To shard the data consist into splitting the data according to a criteria.

For example, imagine we have 8 MySql Servers, named mysql0 to mysql7. If we want to insert or read data for user 1714, then the Server will be chosen from dividing the user_id, so 1714, between the number of Servers, and getting the MOD.

So 1714 % 8 gives 2. This means that the MySql Server to use is the mysql2.

For the user_id 16: 16 & 8 gives 0, so we would use mysql0. And so.

You can shard according to the email, or other fields as well. And you can have the same master and secondaries for the shards also.

When doing sharding in MySql you cannot do joins to data in other Servers. (but you can replicate all the data from the several shards in one big server in house, in your offices, and so query it and join if you need that for marketing purposes).

I always use my own sharding, but there is a very nice product from CodeFutures called dbshards. It handles the traffics transparently. I used it when in a video games Start up with very satisfying result.

Use Cassandra assync queries

Cassandra support asynchronous queries. That means you can send the query to the Server, and instead of waiting, do other jobs. And check for the result later, when is finished.

Consider using Hadoop + HBASE

A Cluster alternative to Cassandra.

Use a Load Balancer

You can put a Load Balancer or a Reverse Proxy in front of your Web Servers. The Load Balancer knows the state of the Web Servers, so it will remove a Web Server from the Array if it stops responding and everything will continue being served to the users transparently.

There are many ways to do Load Balancing: Round Robin, based on the load on the Web Servers, on the number of connections to each Web Server, by cookie…

To use a Cookie based Load Balancer is a very easy way to split the load for WordPress and Drupal Servers.

Imagine you have 10 Web Servers. In the .htaccess they set a rule to set a Cookie like:

SERVER_ID=WEB01

That was in the case of the first Web Server.

Second Server would have in the .htaccess to set a Cookie like:

SERVER_ID=WEB02

Etcetera

When for first time an user connects to the Load Balancer it sends the user to one of the 10 Web Servers. Then the Web Server sends its cookie to the browser of the Client. E.g. WEB07

After that, in the next requests from the client it will be redirected to the server by the Load Balancer to the Server that set the Cookie, so in this example WEB07.

The nice thing of this way of splitting the traffic is that you don’t have to change your code, nor handling the Sessions different.

If you use two Load Balancers you can have a heartbeat process in them and a Virtual Ip, and so in case your main Load Balancer become irresponsible the Virtual Ip will be mapping to the second Load Balancer in milliseconds. That provides HA.

Use http accelerators

Nginx, varnish, squid… to serve static content and offload the PHP Web Servers.

Auto-Scale in the Cloud

If you use the Cloud you can easily set Auto-Scaling for different parts of your core.

A quick win is to Scale the Web Servers.

As in the Cloud you pay per hour using a computer, you will benefit from cost reduction in you stop using the servers when you don’t need them, and you add more Servers when more users are coming to your sites.

Video game companies are a good example of hours of plenty use and valleys with few users, although as users come from all the planet it is most and most diluted.

Some cool tools to Auto-Scaling are: ECManaged, RightScale, Amazon CloudWatch.

Use Google Cloud

Actually the Performance of the Google Cloud to Scale without any precedent is great.

Opposite to other Clouds that are based on instances, Google Cloud offers the platform, that will spawn your code across so many servers as needed, transparently to you. It’s a black box.

Schedule operations with RabbitMQ

Or other Queue Manager.

The idea is to send the jobs to the Queue Manager, the PHP will continue working, and the jobs will be performed asynchronously and notify the end.

RabbitMQ is cool also because it can work in cluster and HA.

Use GlusterFs for NAS

GlusterFs (and other products) allow you to have a Distributed File System, that splits the load and the data across the Servers, and resist node failures.

If you have to have a shared folder for the user’s uploads, for example for the profile pictures, to have the PHP and general files locally in the Servers and the Shared folder in a GlusterFs is a nice option.

Avoid NFS for PHP files and config files

As told before try to have the PHP files in a RAM disk, or in the local disk (Linux caches well and also OpCache), and try to not write code that reads files from disk for determining config setup.

I remember a Start up incubator that had a very nice Server, but the PHP files were read from a mounted NFS folder.

That meant that on every request, the Server had to go over the network to fetch the files.

Sadly for the project’s performance the PHP was reading a file called ENVIRONMENT that contained “PROD” or “DEVEL”. And this was done in every single request.

Even worst, I discovered that the switch connecting the Web Server and the NFS Server was a cheap 10 Mbit one. So all the traffic was going at 10 Mbit/s. Nice bottleneck.

Improve your network architecture

You can use 10 GbE (10 Gigabit Ethernet) to connect the Servers. The Web Servers to the Databases, Memcached Cluster, Load Balancers, Storage, etc…

You will need 10 GbE cards and 10 GbE switchs supporting bonding.

Use bonding to aggregate 10 + 10 so having 20 Gigabit.

You can also use Fibre Channel, for example 10 Gb and aggregate them, like 10 + 10 so 20 Gbit for the connection between the Servers and the Storage.

The performance improvements that your infrastructure will experiment are amazing.