Tag Archives: Symfony

Stopping a BitTorrent DDoS attack

After all the success about the article stopping an XMLRPC to WordPress site attack and thanks messages (I actually helped a company that was being thrown down every day and asked me for help) it’s the moment to explain how to stop an attack much more heavily in evilness.

The first sign I saw was that the server was more and more slower, what is nearly impossible as I setup a very good server, and it has a lot of good development techniques to not having bottlenecks.

I looked at the server and I saw like 3,000 SYN_SENT packets. Apparently we were under a SYN Flood attack.

blog-carlesmateo-com-atack-to-the-web-2015-high-load-blacknetstat revealed more than 6k different ip addresses connecting to the Server.

Server had only 30 GB of RAM so, and started to be full, with more and more connections, and so more Apache processes to respond to the real users fast it was clear that it was going to struggle.

I improved the configuration of the Apache so the Server would be able to handle much more connections with less memory consumption and overhead, added some enhancements for blocking SYN Flood attacks, and restarted the Apache Server.

I reduced greatly the scope of the attacks but I knew that it would only be being worst. I was buying time while not disrupting the functioning of the website.

The next hours the attacks increased to having around 7,500 concurrent connections simultaneously. The memory was reaching its limits, so I decided it was time to upgrade the instance. I doubled the memory and added much more cores, to 36, by using one of the newest Amazon c4.8xlarge.


The good thing about Cloud is that you pay for the time you use the resources. So when the waters calm down again, I’m able to reduce the size of the instance and save some hundreds to the company.

I knew it was a matter of time. The server was stabilized at using 40 GB out of the 60 GB but I knew the pirates will keep trying to shutdown the service.

Once the SYN Flood was stopped and I was sure that the service was safe for a while, I was checking the logs to see if I can detect a pattern among the attacks. I did.


Most request that we were receiving where to a file called announce.php that obviously does not exist in the server, and so it was returning 404 error.

The user agent reported in many cases BitTorrent, or Torrent compatible product, and the url sending a hash, uploaded, downloaded, left… so I realized that somehow my Server was targeted by a Torrent attack, where they indicated that the Server was a Torrent tracker.

As the .htacess in frameworks like Laravel, Catalonia Framework… and CMS like WordPress, Joomla, ezpublish… try to read the file from filesystem and if it doesn’t exist index.php is served, then as first action I created a file /announce.php that simple did an exit();

Sample .htaccess from Laravel:

<IfModule mod_rewrite.c>
    <IfModule mod_negotiation.c>
        Options -MultiViews

    RewriteEngine On

    # Redirect Trailing Slashes...
    RewriteRule ^(.*)/$ /$1 [L,R=301]

    # Handle Front Controller...
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteRule ^ index.php [L]

Sample code for announce.php would be like:

 * Creator: Carles Mateo
 * Date: 2015-01-21 Time: 09:39

// A cheap way to stop an attack based on requesting this file

The response_code 406 was an attemp to see if the BitTorrent clients were sensible to headers and stop. But they didn’t.

With with simple addition of announce.php , with exit(), I achieved reducing the load on the Server from 90% to 40% in just one second.

The reason why a not found page was causing so many damage was that as the 404 error page from the Server is personalized, and offers alternative results (assuming the product you was looking for is no longer available), and before displaying all the Framework is loaded and the routes are checked to see if the url fits and so has some process to be done in the PHP side (it takes 100 ms to reply, is not much, but it was not necessary to waste so much CPU), even being very optimized, every single not found url was causing certain process and CPU waste. Since the attack had more than 7,000 different ip’s simultaneously coming to the Server it would be somewhat a problem at certain point and start returning 500 errors to the customers.

The logs were also showing other patterns, for example:


So without the PHP extension. Those kind of requests would not go through my wall file announce.php but though index.php (as .htaccess tells what is not found is directed there).

I could change the .htaccess to send those requests to hell, but I wanted a more definitive solution, something that would prevent the Server from wasting CPU and the Servers to being able to resist an attack x1000 times harder.

At the end the common pattern was that the BitTorrent clients were requesting via GET a parameter called info_hash, so I blocked through there all the request.

I wrote this small program, and added it to index.php

// Patch urgency Carles to stop an attack based on Torrent
// http://blog.carlesmateo.com
if (isset($_GET['info_hash'])) {

    // In case you use CDN, proxy, or load balancer
    $s_ip_proxy = '';

    $s_ip_address = $_SERVER['REMOTE_ADDR'];

    // Warning if you use a CDN, a proxy server or a load balancer do not add the ip to the blacklisted
    if ($s_ip_proxy == '' || ($s_ip_proxy != '' && $s_ip_address != $s_ip_proxy)) {
        $s_date = date('Y-m-d');

        $s_ip_log_file = '/tmp/ip-to-blacklist-'.$s_date.'.log';
        file_put_contents($s_ip_log_file, $s_ip_address."\n", FILE_APPEND | LOCK_EX);

    // 406 means 'Not Acceptable'


Please note, this code can be added to any Software like Zend Framework, Symfony, Catalonia Framework, Joomla, WordPress, Drupal, ezpublish, Magento… just add those lines at the beginning of the public/index.php just before the action of the Framework starts. Only be careful that after a core update, you’ll have to reapply it.

After that I deleted the no-longer-needed announced.php

What the program does is, if you don’t have defined a proxy/CDN ip, to write the ip connecting with the Torrent request pattern to a log file called for example:


And also exit(), so stopping the execution and saving many CPU cycles.

The idea of the final date is to blacklist the ip’s only for 24 hours as we later will see.

With this I achieved reducing the CPU consumption to around 5-15% of CPU.

Then, there is the other part of stopping the attack, that is a bash program, that can be run from command line or added to cron to be launched, depending on the volume of the attacks, every 5 minutes, or every hour.



# Ip blacklister by Carles Mateo
s_DATE=$(date +%Y-%m-%d)
cat $s_FILE | sort | uniq > $s_FILE_UNIQUE

echo "Counting the ip addresses to block in $s_FILE_UNIQUE"
cat $s_FILE_UNIQUE | wc -l

sleep 3
# We clear the iptables rules
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT

# To list the rules sudo iptables -L
# /sbin/iptables -L INPUT -v -n
# Enable ssh for all (you can add a Firewall at Cloud provider level or enstrict the rule to your ip)
sudo iptables -A INPUT -p tcp --dport ssh -j ACCEPT

for s_ip_address in `cat $s_FILE_UNIQUE`
    echo "Blocking traffic from $s_ip_address"
    sudo iptables -A INPUT -s $s_ip_address -p tcp --destination-port 80 -j DROP
    sudo iptables -A INPUT -s $s_ip_address -p tcp --destination-port 443 -j DROP

# Ensure Accept traffic on Port 80 (HTTP) and 443 (HTTPS)
sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# To block the rest
# sudo iptables -A INPUT -j DROP

# User iptables -save and iptables -restore to make this changes permanent
# sudo sh -c "iptables-save > /etc/iptables.rules"
# sudo pre-up iptables-restore < /etc/iptables.rules
# https://help.ubuntu.com/community/IptablesHowTo

This scripts gets the list of ip’s addresses, gets the list of unique ip’s into another file, and then makes a loop and adds all of them to the iptables, the Firewall from Linux, and blocks them for accessing the web at port 80 (http) or 443 (https, ssl). You can block all the ports also if you want for those ip’s.

With this CPU use went to 0%.

Note: One of my colleagues, a wonderful SysAdmin at Ackstorm ISP, points that some of you may prefer using REJECT instead of DROP. An interesting conversation on serverfault about this.

After fixing the problem I looked over the Internet to locate any people reporting attacks like what I suffered. The most interesting I found was this article: BotTorrent: Misusing BitTorrent to Launch DDoS Attacks, from University of California, Irvine. (local copy on this website BotTorrent)

Basically any site on the Internet can be attacked at a large scale, as every user downloading Torrent will try to connect to the innocent Server to inform of the progress of the down/upload. If this attack is performed with hundreds of files, the attack means hundreds of thousands of ip’s connecting to the Server… the server will run out of connections, or memory, or bandwidth will be full from the bad traffic.

I saw that the attackers were using porno files that were highly downloaded and apparently telling the Torrent network that our Server was a Torrent tracker, so corroborating my hypothesis all the people downloading Torrents were sending updates to our Server, believing that our Server was a tracker. A trick from the sad pirates.

Some people, business users, asked me who could be interested in injuring other’s servers or disrupting other’s businesses without any immediate gain (like controlling your Servers to send Spam).

I told:

  • Competitors that hate you because you’re successful and want to disrupt your business (they pay to the pirates for doing attacks. I’ve helped companies that were let down by those pirates)
  • Investors that may want to buy you at a cheaper price (after badly trolling you for a week or two)
  • False “security” companies that will offer their services “casually” when you most need them and charge a high bill
  • Pirates that want to extort you

So bad people that instead that using their talent to create, just destroy and act bad being evil to others.

In other cases could be bad luck to have been assigned an Ip that previously had a Torrent tracker, it has not much sense for the Cloud as it is expensive, but it has that a Server with that ip was hacked and used as tracked for a while.

Also governments could be so wanting to disrupt services (like torrent) by clumsy redirecting dns to random ip’s, or entertainment companies trying to shutdown Torrent trackers could try to poison dns to stop users from using Bittorrent.


See the definitive solution in the next article.

Performance of several languages

Notes on 2017-03-26 18:57 CEST – Unix time: 1490547518 :

  1. As several of you have noted, it would be much better to use a random value, for example, read by disk. This will be an improvement done in the next benchmark. Good suggestion thanks.
  2. Due to my lack of time it took more than expected updating the article. I was in a long process with google, and now I’m looking for a new job.
  3. I note that most of people doesn’t read the article and comment about things that are well indicated on it. Please before posting, read, otherwise don’t be surprise if the comment is not published. I’ve to keep the blog clean of trash.
  4. I’ve left out few comments cause there were disrespectful. Mediocrity is present in the society, so simply avoid publishing comments that lack the basis of respect and good education. If a comment brings a point, under the point of view of Engineering, it is always published.


(This article was last updated on 2015-08-26 15:45 CEST – Unix time: 1440596711. See changelog at bottom)

One may think that Assembler is always the fastest, but is that true?.

If I write a code in Assembler in 32 bit instead of 64 bit, so it can run in 32 and 64 bit, will it be faster than the code that a dynamic compiler is optimizing in execution time to benefit from the architecture of my computer?.

What if a future JIT compiler is able to use all the cores to execute a single thread developed program?.

Are PHP, Python, or Ruby fast comparing to C++?. Does Facebook Hip Hop Virtual machine really speeds PHP execution?.

This article shows some results and shares my conclusions. It is as a base to discuss with my colleagues. Is not an end, we are always doing tests, looking for the edge, and looking at the root of the things in detail. And often things change from one version to the other. This article shows not an absolute truth, but brings some light into interesting aspects.

It could show the performance for the certain case used in the test, although generic core instructions have been selected. Many more tests are necessary, and some functions differ in the performance. But this article is a necessary starting for the discussion with my IT-extreme-lover friends and a necessary step for the next upcoming tests.

It brings very important data for Managers and Decision Makers, as choosing the adequate performance language can save millions in hardware (specially when you use the Cloud and pay per hour of use) or thousand hours in Map Reduce processes.

Acknowledgements and thanks

Credit for the great Eduard Heredia, for porting my C source code to:

  • Go
  • Ruby
  • Node.js

And for the nice discussions of the results, an on the optimizations and dynamic vs static compilers.

Thanks to Juan Carlos Moreno, CTO of ECManaged Cloud Software for suggesting adding Python and Ruby to the languages tested when we discussed my initial results.

Thanks to Joel Molins for the interesting discussions on Java performance and garbage collection.

Thanks to Cliff Click for his wonderful article on Java vs C performance that I found when I wanted to confirm some of my results and findings.

I was inspired to do my own comparisons by the benchmarks comparing different framework by techempower. It is amazing to see the results of the tests, like how C++ can serialize JSon 1,057,793 times per second and raw PHP only 180,147 (17%).

For the impatients

I present the results of the tests, and the conclusions, for those that doesn’t want to read about the details. For those that want to examine the code, and the versions of every compiler, and more in deep conclusions, this information is provided below.


This image shows the results of the tests with every language and compiler.

All the tests are invoked from command line. All the tests use only one core. No tests for the web or frameworks have been made, are another scenarios worth an own article.

More seconds means a worst result. The worst is Bash, that I deleted from the graphics, as the bar was crazily high comparing to others.

* As later is discussed my initial Assembler code was outperformed by C binary because the final Assembler code that the compiler generated was better than mine.

After knowing why (later in this article is explained in detail) I could have reduced it to the same time than the C version as I understood the improvements made by the compiler.


Table of times:

Seconds executing Language Compiler used Version
6 s. Java Oracle Java Java JDK 8
6 s. Java Oracle Java Java JDK 7
6 s. Java Open JDK OpenJDK 7
6 s. Java Open JDK OpenJDK 6
7 s. Go Go Go v.1.3.1 linux/amd64
7 s. Go Go Go v.1.3.3 linux/amd64
8 s. Lua LuaJit Luajit 2.0.2
10 s. C++ g++ g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
10 s. C gcc gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
10 s.
(* first version was 13 s. and then was optimized)
Assembler nasm NASM version 2.10.09 compiled on Dec 29 2013
10 s. Nodejs nodejs Nodejs v0.12.4
14 s. Nodejs nodejs Nodejs v0.10.25
18 s. Go Go go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64
20 s. Phantomjs Phantomjs phantomjs 1.9.0
21 s. Phantomjs Phantomjs phantomjs 2.0.1-development
38 s. PHP Facebook HHVM HipHop VM 3.4.0-dev (rel)
44 s. Python Pypy Pypy 2.2.1 (Python 2.7.3 (2.2.1+dfsg-1, Nov 28 2013, 05:13:10))
52 s. PHP Facebook HHVM HipHop VM 3.9.0-dev (rel)
52 s. PHP Facebook HHVM HipHop VM 3.7.3 (rel)
128 s. PHP PHP PHP 7.0.0alpha2 (cli) (built: Jul 3 2015 15:30:23)
278 s. Lua Lua Lua 2.5.3
294 s. Gambas3 Gambas3 3.7.0
316 s. PHP PHP PHP 5.5.9-1ubuntu4.3 (cli) (built: Jul 7 2014 16:36:58)
317 s. PHP PHP PHP 5.6.10 (cli) (built: Jul 3 2015 16:13:11)
323 s. PHP PHP PHP 5.4.42 (cli) (built: Jul 3 2015 16:24:16)
436 s. Perl Perl Perl 5.18.2
523 s. Ruby Ruby ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
694 s. Python Python Python 2.7.6
807 s. Python Python Python 3.4.0
47630 s. Bash GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)


Conclusions and Lessons Learnt

  1. There are languages that will execute faster than a native Assembler program, thanks to the JIT Compiler and to the ability to optimize the program at runtime for the architecture of the computer running the program (even if there is a small initial penalty of around two seconds from JIT when running the program, as it is being analysed, is it more than worth in our example)
  2. Modern Java can be really fast in certain operations, it is the fastest in this test, thanks to the use of JIT Compiler technology and a very good implementation in it
  3. Oracle’s Java and OpenJDK shows no difference in performance in this test
  4. Script languages really sucks in performance. Python, Perl and Ruby are terribly slow. That costs a lot of money if you Scale as you need more Server in the Cloud
  5. JIT compilers for Python: Pypy, and for Lua: LuaJit, make them really fly. The difference is truly amazing
  6. The same language can offer a very different performance using one version or another, for example the go that comes from Ubuntu packets and the last version from official page that is faster, or Python 3.4 is much slower than Python 2.7 in this test
  7. Bash is the worst language for doing the loop and inc operations in the test, lasting for more than 13 hours for the test
  8. From command line PHP is much faster than Python, Perl and Ruby
  9. Facebook Hip Hop Virtual Machine (HHVM) improves a lot PHP’s speed
  10. It looks like the future of compilers is JIT.
  11. Assembler is not always the fastest when executed. If you write a generic Assembler program with the purpose of being able to run in many platforms you’ll not use the most powerful instructions specific of an architecture, and so a JIT compiler can outperform your code. An static compiler can also outperform your code with very clever optimizations. People that write the compilers are really good. Unless you’re really brilliant with Assembler probably a C/C++ code beats the performance of your code. Even if you’re fantastic with Assembler it could happen that a JIT compiler notices that some executions can be avoided (like code not really used) and bring magnificent runtime optimizations. (for example a near JMP is much more less costly than a far JMP Assembler instruction. Avoiding dead code could result in a far JMP being executed as near JMP, saving many cycles per loop)
  12. Optimizations really needs people dedicated to just optimizations and checking the speed of the newly added code for the running platforms
  13. Node.js was a big surprise. It really performed well. It is promising. New version performs even faster
  14. go is promising. Similar to C, but performance is much better thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)
  15. Gambas 3 performed surprisingly fast. Better than PHP
  16. You should be careful when using C/C++ optimization -O3 (and -O2) as sometimes it doesn’t work well (bugs) or as you may expect, for example by completely removing blocks of code if the compiler believes that has no utility (like loops)
  17. Perl performance really change from using a for style or another. (See Perl section below)
  18. Modern CPUs change the frequency to save energy. To run the tests is strictly recommended to use a dedicated machine, disabling the CPU governor and setting a frequency for all the cores, booting with a text only live system, without background services, not mounting disks, no swap, no network

(Please, before commenting read completely the article )

Explanations in details

Obviously an statically compiled language binary should be faster than an interpreted language.

C or C++ are much faster than PHP. And good code machine is much faster of course.

But there are also other languages that are not compiled as binary and have really fast execution.

For example, good Web Java Application Servers generate compiled code after the first request. Then it really flies.

For web C# or .NET in general, does the same, the IIS Application Server creates a native DLL after the first call to the script. And after this, as is compiled, the page is really fast.

With C statically linked you could generate binary code for a particular processor, but then it won’t work in other processors, so normally we write code that will work in all the processors at the cost of not using all the performance of the different CPUs or use another approach and we provide a set of different binaries for the different architectures. A set of directives doing one thing or other depending on the platform detected can also be done, but is hard, long and tedious job with a lot of special cases treatment. There is another approach that is dynamic linking, where certain things will be decided at run time and optimized for the computer that is running the program by the JIT (Just-in-time) Compiler.

Java, with JIT is able to offer optimizations for the CPU that is running the code with awesome results. And it is able to optimize loops and mathematics operations and outperform C/C++ and Assembler code in some cases (like in our tests) or to be really near in others. It sounds crazy but nowadays the JIT is able to know the result of several times executed blocks of code and to optimize that with several strategies, speeding the things incredible and to outperform a code written in Assembler. Demonstrations with code is provided later.

A new generation has grown knowing only how to program for the Web. Many of them never saw Assembler, neither or barely programmed in C++.

None of my Senior friends would assert that a technology is better than another without doing many investigations before. We are serious. There is so much to take in count, so much to learn always, that one has to be sure that is not missing things before affirming such things categorically. If you want to be taken seriously, you have to take many things in count.

Environment for the tests

Hardware and OS

Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz with 32 GB RAM and SSD Disk.

Ubuntu Desktop 14.04 LTS 64 bit

Software base and compilers

PHP versions

Shipped with my Ubuntu distribution:

php -v
PHP 5.5.9-1ubuntu4.3 (cli) (built: Jul  7 2014 16:36:58)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies

Compiled from sources:

PHP 5.6.10 (cli) (built: Jul  3 2015 16:13:11)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies
PHP 5.4.42 (cli) (built: Jul  3 2015 16:24:16)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2014 Zend Technologies


Java 8 version

java -showversion
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

C++ version

g++ -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.2-19ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

Gambas 3

gbr3 --version

Go (downloaded from google)

go version
go version go1.3.1 linux/amd64

Go (Ubuntu packages)

go version
go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64


nasm -v
NASM version 2.10.09 compiled on Dec 29 2013


lua -v
Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio


luajit -v
LuaJIT 2.0.2 -- Copyright (C) 2005-2013 Mike Pall. http://luajit.org/


Installed with apt-get install nodejs:

nodejs --version

Installed by compiling the sources:

node --version


Installed with apt-get install phantomjs:

phantomjs --version

Compiled from sources:

/path/phantomjs --version

Python 2.7

python --version
Python 2.7.6

Python 3

python3 --version
Python 3.4.0


perl -version
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
(with 41 registered patches, see perl -V for more detail)


bash --version
GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Test: Time required for nested loops

This is the first sample. It is an easy-one.

The main idea is to generate a set of nested loops, with a simple counter inside.

When the counter reaches 51 it is set to 0.

This is done for:

  1. Preventing overflow of the integer if growing without control
  2. Preventing the compiler from optimizing the code (clever compilers like Java or gcc with -O3 flag for optimization, if it sees that the var is never used, it will see that the whole block is unnecessary and simply never execute it)

Doing only loops, the increment of a variable and an if, provides us with basic structures of the language that are easily transformed to Assembler. We want to avoid System calls also.

This is the base for the metrics on my Cloud Analysis of Performance cmips.net project.

Here I present the times for each language, later I analyze the details and the code.

Take in count that this code only executes in one thread / core.


C++ result, it takes 10 seconds.

Code for the C++:

* File:   main.cpp
* Author: Carles Mateo
* Created on August 27, 2014, 1:53 PM

#include <cstdlib>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <ctime>

using namespace std;

typedef unsigned long long timestamp_t;

static timestamp_t get_timestamp()
    struct timeval now;
    gettimeofday (&now, NULL);
    return  now.tv_usec + (timestamp_t)now.tv_sec * 1000000;

int main(int argc, char** argv) {

    timestamp_t t0 = get_timestamp();

    // current date/time based on current system
    time_t now = time(0);

    // convert now to string form
    char* dt_now = ctime(&now);

    printf("Starting at %s\n", dt_now);

    int i_loop1 = 0;
    int i_loop2 = 0;
    int i_loop3 = 0;


    for (i_loop1 = 0; i_loop1 < 10; i_loop1++) {
        for (i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
            for (i_loop3 = 0; i_loop3 < 32000; i_loop3++) {

                if (i_counter > 50) {
                    i_counter = 0;
            // If you want to test how the compiler optimizes that, remove the comment
            //i_counter = 0;

    // This is another trick to avoid compiler's optimization. To use the var somewhere
    printf("Counter: %i\n", i_counter);

    timestamp_t t1 = get_timestamp();
    double secs = (t1 - t0) / 1000000.0L;
    time_t now_end = time(0);

    // convert now to string form
    char* dt_now_end = ctime(&now_end);

    printf("End time: %s\n", dt_now_end);

    return 0;


You can try to remove the part of code that makes the checks:

                /* if (i_counter > 50) {
                    i_counter = 0;

And the use of the var, later:

    //printf("Counter: %i\n", i_counter);

Note: And adding a i_counter = 0; at the beginning of the loop to make sure that the counter doesn’t overflows. Then the C or C++ compiler will notice that this result is never used and so it will eliminate the code from the program, having as result and execution time of 0.0 seconds.


The code in Java:

package cpu;

 * @author carles.mateo
public class Cpu {

     * @param args the command line arguments
    public static void main(String[] args) {
        int i_loop1 = 0;
        //int i_loop_main = 0;
        int i_loop2 = 0;
        int i_loop3 = 0;
        int i_counter = 0;
        String s_version = System.getProperty("java.version");
        System.out.println("Java Version: " + s_version);

        System.out.println("Starting cpu.java...");
        for (i_loop1 = 0; i_loop1 < 10; i_loop1++) {            
                for (i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
                    for (i_loop3 = 0; i_loop3 < 32000; i_loop3++) {
                        if (i_counter > 50) { 
                            i_counter = 0;

It is really interesting how Java, with JIT outperforms C++ and Assembler.

It takes only 6 seconds.

Netbeans with Java IDE executing with OpenJDK 1.6 in 6 seconds


The case of Go is interesting because I saw a big difference from the go shipped with Ubuntu, and the the go I downloaded from http://golang.org/dl/. I downloaded 1.3.1 and 1.3.3 offering the same performance. 7 seconds.

blog-carlesmateo-com-go1-3-3-linux-amd64-performance-37Source code for nested_loops.go

package main

import ("fmt"

func main() {
   fmt.Printf("Starting: %s", time.Now().Local())
   var i_counter = 0;
   for i_loop1 := 0; i_loop1 < 10; i_loop1++ {
       for i_loop2 := 0; i_loop2 < 32000; i_loop2++ {
           for i_loop3 := 0; i_loop3 < 32000; i_loop3++ {
               if i_counter > 50 {
                   i_counter = 0;

   fmt.Printf("\nCounter: %#v", i_counter)
   fmt.Printf("\nEnd: %s\n", time.Now().Local())


Here is the Assembler for Linux code, with SASM, that I created initially (bellow is optimized).

%include "io.inc"

section .text
global CMAIN
    ;mov rbp, rsp; for correct debugging
    ; Set to 0, the faster way
    xor     esi, esi

    mov ecx, 10
    mov ebx, ecx
    jmp DO_LOOP2
    mov ecx, ebx
    loop LOOP1
    jmp QUIT

    mov ecx, 32000
    mov eax, ecx
    ;call DO_LOOP3
    jmp DO_LOOP3
    mov ecx, eax
    loop LOOP2

    ; Set to 32000 loops    
    MOV ecx, 32000 
    inc     esi
    cmp     esi, 50
    jg      COUNTER_TO_0

    loop LOOP3
    ; Set to 0
    xor     esi, esi
;    jmp QUIT

    xor eax, eax

It took 13 seconds to complete.

One interesting explanation on why binary C or C++ code is faster than Assembler, is because the C compiler generates better Assembler/binary code at the end. For example, the use of JMP is expensive in terms of CPU cycles and the compiler can apply other optimizations and tricks that I’m not aware of, like using faster registers, while in my code I use ebx, ecx, esi, etc… (for example, imagine that using cx is cheaper than using ecx or rcx and I’m not aware but the guys that created the Gnu C compiler are)

blog-carlesmateo-com-sasm-assembler-linux-64-bits-code-12-13-secondsTo be sure of what’s going on I switched in the LOOP3 the JE and the JMP of the code, for groups of 50 instructions, INC ESI, one after the other and the time was reduced to 1 second.

(In C also was reduced even a bit more when doing the same)

To know what’s the translation of the C code into Assembler when compiled, you can do:

objdump --disassemble nested_loops

Look for the section main and you’ll get something like:

0000000000400470 <main>:
400470:    bf 0a 00 00 00           mov    $0xa,%edi
400475:    31 c9                    xor    %ecx,%ecx
400477:    be 00 7d 00 00           mov    $0x7d00,%esi
40047c:    0f 1f 40 00              nopl   0x0(%rax)
400480:    b8 00 7d 00 00           mov    $0x7d00,%eax
400485:    0f 1f 00                 nopl   (%rax)
400488:    83 c2 01                 add    $0x1,%edx
40048b:    83 fa 33                 cmp    $0x33,%edx
40048e:    0f 4d d1                 cmovge %ecx,%edx
400491:    83 e8 01                 sub    $0x1,%eax
400494:    75 f2                    jne    400488 <main+0x18>
400496:    83 ee 01                 sub    $0x1,%esi
400499:    75 e5                    jne    400480 <main+0x10>
40049b:    83 ef 01                 sub    $0x1,%edi
40049e:    75 d7                    jne    400477 <main+0x7>
4004a0:    48 83 ec 08              sub    $0x8,%rsp
4004a4:    be 34 06 40 00           mov    $0x400634,%esi
4004a9:    bf 01 00 00 00           mov    $0x1,%edi
4004ae:    31 c0                    xor    %eax,%eax
4004b0:    e8 ab ff ff ff           callq  400460 <__printf_chk@plt>
4004b5:    31 c0                    xor    %eax,%eax
4004b7:    48 83 c4 08              add    $0x8,%rsp
4004bb:    c3                       retq

Note: this is in the AT&T syntax and not in the Intel. That means that add $0x1,%edx is adding 1 to EDX registerg (origin, destination).

As you can see the C compiler has created a very differed Assembler version respect what I created.
For example at 400470 it uses EDI register to store 10, so to control the number of the outer loop.
It uses ESI to store 32000 (Hexadecimal 0x7D00), so the second loop.
And EAX for the inner loop, at 400480.
It uses EDX for the counter, and compares to 50 (Hexa 0x33) at 40048B.
In 40048E it uses the CMOVGE (Mov if Greater or Equal), that is an instruction that was introduced with the P6 family processors, to move the contents of ECX to EDX if it was (in the CMP) greater or equal to 50. As in 400475 a XOR ECX, ECX was performed, EXC contained 0.
And it cleverly used SUB and JNE (JNE means Jump if not equal and it jumps if ZF = 0, it is equivalent to JNZ Jump if not Zero).
It uses between 4 and 16 clocks, and the jump must be -128 to +127 bytes of the next instruction. As you see Jump is very costly.

Looks like the biggest improvement comes from the use of CMOVGE, so it saves two jumps that my original Assembler code was performing.
Those two jumps multiplied per 32000 x 32000 x 10 times, are a lot of Cpu clocks.

So, with this in mind, as this Assembler code takes 10 seconds, I updated the graph from 13 seconds to 10 seconds.


This is the initial code:

local i_counter = 0

local i_time_start = os.clock()

for i_loop1=0,9 do
    for i_loop2=0,31999 do
        for i_loop3=0,31999 do
            i_counter = i_counter + 1
            if i_counter > 50 then
                i_counter = 0

local i_time_end = os.clock()
print(string.format("Counter: %i\n", i_counter))
print(string.format("Total seconds: %.2f\n", i_time_end - i_time_start))

In the case of Lua theoretically one could take great advantage of the use of local inside a loop, so I tried the benchmark with modifications to the loop:

for i_loop1=0,9 do
    for i_loop2=0,31999 do
        local l_i_counter = i_counter
        for i_loop3=0,31999 do
             l_i_counter = l_i_counter + 1
             if l_i_counter > 50 then
                 l_i_counter = 0
        i_counter = l_i_counter

I ran it with LuaJit and saw no improvements on the performance.


var s_date_time = new Date();
console.log('Starting: ' + s_date_time);

var i_counter = 0;

for (var $i_loop1 = 0; $i_loop1 < 10; $i_loop1++) {
   for (var $i_loop2 = 0; $i_loop2 < 32000; $i_loop2++) {
       for (var $i_loop3 = 0; $i_loop3 < 32000; $i_loop3++) {
           if (i_counter > 50) {
               i_counter = 0;

var s_date_time_end = new Date();

console.log('Counter: ' + i_counter + '\n');

console.log('End: ' + s_date_time_end + '\n');

Execute with:

nodejs nested_loops.js


The same code as nodejs adding to the end:


In the case of Phantom it performs the same in both versions 1.9.0 and 2.0.1-development compiled from sources.


The interesting thing on PHP is that you can write your own extensions in C, so you can have the easy of use of PHP and create functions that really brings fast performance in C, and invoke them from PHP.


$s_date_time = date('Y-m-d H:i:s');

echo 'Starting: '.$s_date_time."\n";

$i_counter = 0;

for ($i_loop1 = 0; $i_loop1 < 10; $i_loop1++) {
   for ($i_loop2 = 0; $i_loop2 < 32000; $i_loop2++) {
       for ($i_loop3 = 0; $i_loop3 < 32000; $i_loop3++) {
           if ($i_counter > 50) {
               $i_counter = 0;

$s_date_time_end = date('Y-m-d H:i:s');

echo 'End: '.$s_date_time_end."\n";

Facebook’s Hip Hop Virtual Machine is a very powerful alternative, that is JIT powered.

Downloading the code and compiling it is just easy, just:

git clone https://github.com/facebook/hhvm.git
cd hhvm
rm -r third-party
git submodule update --init --recursive

Or just grab precompiled packages from https://github.com/facebook/hhvm/wiki/Prebuilt%20Packages%20for%20HHVM


from datetime import datetime
import time

print ("Starting at: " + str(datetime.now()))
s_unixtime_start = str(time.time())

i_counter = 0

# From 0 to 31999
for i_loop1 in range(0, 10):
    for i_loop2 in range(0,32000):
         for i_loop3 in range(0,32000):
             i_counter += 1
             if ( i_counter > 50 ) :
                 i_counter = 0

print ("Ending at: " + str(datetime.now()))
s_unixtime_end = str(time.time())

i_seconds = long(s_unixtime_end) - long(s_unixtime_start)
s_seconds = str(i_seconds)

print ("Total seconds:" + s_seconds)


#!/usr/bin/ruby -w

time1 = Time.new

puts "Starting : " + time1.inspect

i_counter = 0;

for i_loop1 in 0..9
    for i_loop2 in 0..31999
        for i_loop3 in 0..31999
            i_counter = i_counter + 1
            if i_counter > 50
                i_counter = 0

time1 = Time.new

puts "End : " + time1.inspect


The case of Perl was very interesting one.

This is the current code:

#!/usr/bin/env perl

print "$s_datetime Starting calculations...\n";


for my $i_loop1 (0 .. 9) {
    for my $i_loop2 (0 .. 31999) {
        for my $i_loop3 (0 .. 31999) {
            if ($i_counter > 50) {
                $i_counter = 0;



print "Counter: $i_counter\n";
print "Total seconds: $i_seconds";

But before I created one, slightly different, with the for loops like in the C style:

#!/usr/bin/env perl



for (my $i_loop1=0; $i_loop1 < 10; $i_loop1++) {
    for (my $i_loop2=0; $i_loop2 < 32000; $i_loop2++) {
        for (my $i_loop3=0; $i_loop3 < 32000; $i_loop3++) {
            if ($i_counter > 50) {
                $i_counter = 0;



print "Total seconds: $i_seconds";

I repeated this test, with the same version of Perl, due to the comment of a reader (thanks mpapec) that told:

In this particular case perl style loops are about 45% faster than original code (v5.20)

And effectively and surprisingly the time passed from 796 seconds to 436 seconds.

So graphics are updated to reflect the result of 436 seconds.


echo "Bash version ${BASH_VERSION}..."

let "s_time_start=$(date +%s)"
let "i_counter=0"

for i_loop1 in {0..9}
     echo "."
     for i_loop2 in {0..31999}
         for i_loop3 in {0..31999}
             if [[ $i_counter > 50 ]]
                 let "i_counter=0"
#let "var=var+1"
#let "var+=1"
#let "var++"

let "s_time_end=$(date +%2)"

let "s_seconds = s_time_end - s_time_start"
echo "Total seconds: $s_seconds"

# Just in case it overflows

Gambas 3

Gambas is a language and an IDE to create GUI applications for Linux.
It is very similar to Visual Basic, but better, and it is not a clone.

I created a command line application and it performed better than PHP. There has been done an excellent job with the compiler.

blog-carlesmateo-com-gbr3-gambas-performanceNote: in the screenshot the first test ran for few seconds more than in the second. This was because I deliberately put the machine under some load and I/O during the tests. The valid value for the test, confirmed with more iterations is the second one, done under the same conditions (no load) than the previous tests.

' Gambas module file MMain.module

Public Sub Main()

    ' @author Carles Mateo http://blog.carlesmateo.com
    Dim i_loop1 As Integer
    Dim i_loop2 As Integer
    Dim i_loop3 As Integer
    Dim i_counter As Integer
    Dim s_version As String
    i_loop1 = 0
    i_loop2 = 0
    i_loop3 = 0
    i_counter = 0
    s_version = System.Version
    Print "Performance Test by Carles Mateo blog.carlesmateo.com"    
    Print "Gambas Version: " & s_version

    Print "Starting..." & Now()
    For i_loop1 = 0 To 9
        For i_loop2 = 0 To 31999
            For i_loop3 = 0 To 31999
                i_counter = i_counter + 1
                If (i_counter > 50) Then
                    i_counter = 0
    Print i_counter
    Print "End " & Now()



2015-08-26 15:45

Thanks to the comment of a reader, thanks Daniel, pointing a mistake. The phrase I mentioned was on conclusions, point 14, and was inaccurate. The original phrase told “go is promising. Similar to C, but performance is much better thanks to the use of JIT“. The allusion to JIT is incorrect and has been replaced by this: “thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)”

2015-07-17 17:46

Benchmarked Facebook HHVM 3.9 (dev., the release date is August 3 2015) and HHVM 3.7.3, they take 52 seconds.

Re-benchmarked Facebook HHVM 3.4, before it was 72 seconds, it takes now 38 seconds. I checked the screen captures from 2014 to discard an human error. Looks like a turbo frequency issue on the tests computer, with the CPU governor making it work bellow the optimal speed or a CPU-hungry/IO process that triggered during the tests and I didn’t detect it. Thinking about forcing a fixed CPU speed for all the cores for the tests, like 2.4 Ghz and booting a live only text system without disk access and network to prevent Ubuntu launching processes in the background.

2015-07-05 13:16

Added performance of Phantomjs 1.9.0 installed via apt-get install phantomjs in Ubuntu, and Phantomjs 2.0.1-development.

Added performance of nodejs 0.12.04 (compiled).

Added bash to the graphic. It has so bad performance that I had to edit the graphic to fit in (color pink) in order prevent breaking the scale.

2015-07-03 18:32

Added benchmarks for PHP 7 alpha 2, PHP 5.6.10 and PHP 5.4.42.

2015-07-03 15:13
Thanks to the contribution of a reader (thanks mpapec!) I tried with Perl for style, resulting in passing from 796 seconds to 436 seconds.
(I used the same Perl version: Perl 5.18.2)
Updated test value for Perl.
Added new graphics showing the updated value.

Thanks to the contribution of a reader (thanks junk0xc0de!) added some additional warnings and explanations about the dangers of using -O3 (and -O2) if C/C++.

Updated the Lua code, to print i_counter and do the if i_counter > 50
This makes it take a bit longer, few cents, but passing from 7.8 to 8.2 seconds.
Updated graphics.