Performance of several languages

Notes on 2017-03-26 18:57 CEST – Unix time: 1490547518 :

  1. As several of you have noted, it would be much better to use a random value, for example, read by disk. This will be an improvement done in the next benchmark. Good suggestion thanks.
  2. Due to my lack of time it took more than expected updating the article. I was in a long process with google, and now I’m looking for a new job.
  3. I note that most of people doesn’t read the article and comment about things that are well indicated on it. Please before posting, read, otherwise don’t be surprise if the comment is not published. I’ve to keep the blog clean of trash.
  4. I’ve left out few comments cause there were disrespectful. Mediocrity is present in the society, so simply avoid publishing comments that lack the basis of respect and good education. If a comment brings a point, under the point of view of Engineering, it is always published.

Thanks.

(This article was last updated on 2015-08-26 15:45 CEST – Unix time: 1440596711. See changelog at bottom)

One may think that Assembler is always the fastest, but is that true?.

If I write a code in Assembler in 32 bit instead of 64 bit, so it can run in 32 and 64 bit, will it be faster than the code that a dynamic compiler is optimizing in execution time to benefit from the architecture of my computer?.

What if a future JIT compiler is able to use all the cores to execute a single thread developed program?.

Are PHP, Python, or Ruby fast comparing to C++?. Does Facebook Hip Hop Virtual machine really speeds PHP execution?.

This article shows some results and shares my conclusions. It is as a base to discuss with my colleagues. Is not an end, we are always doing tests, looking for the edge, and looking at the root of the things in detail. And often things change from one version to the other. This article shows not an absolute truth, but brings some light into interesting aspects.

It could show the performance for the certain case used in the test, although generic core instructions have been selected. Many more tests are necessary, and some functions differ in the performance. But this article is a necessary starting for the discussion with my IT-extreme-lover friends and a necessary step for the next upcoming tests.

It brings very important data for Managers and Decision Makers, as choosing the adequate performance language can save millions in hardware (specially when you use the Cloud and pay per hour of use) or thousand hours in Map Reduce processes.

Acknowledgements and thanks

Credit for the great Eduard Heredia, for porting my C source code to:

  • Go
  • Ruby
  • Node.js

And for the nice discussions of the results, an on the optimizations and dynamic vs static compilers.

Thanks to Juan Carlos Moreno, CTO of ECManaged Cloud Software for suggesting adding Python and Ruby to the languages tested when we discussed my initial results.

Thanks to Joel Molins for the interesting discussions on Java performance and garbage collection.

Thanks to Cliff Click for his wonderful article on Java vs C performance that I found when I wanted to confirm some of my results and findings.

I was inspired to do my own comparisons by the benchmarks comparing different framework by techempower. It is amazing to see the results of the tests, like how C++ can serialize JSon 1,057,793 times per second and raw PHP only 180,147 (17%).

For the impatients

I present the results of the tests, and the conclusions, for those that doesn’t want to read about the details. For those that want to examine the code, and the versions of every compiler, and more in deep conclusions, this information is provided below.

Results

This image shows the results of the tests with every language and compiler.

All the tests are invoked from command line. All the tests use only one core. No tests for the web or frameworks have been made, are another scenarios worth an own article.

More seconds means a worst result. The worst is Bash, that I deleted from the graphics, as the bar was crazily high comparing to others.

* As later is discussed my initial Assembler code was outperformed by C binary because the final Assembler code that the compiler generated was better than mine.

After knowing why (later in this article is explained in detail) I could have reduced it to the same time than the C version as I understood the improvements made by the compiler.

blog-carlesmateo-com-performance-several-languages-php7-phantomjs-nodejs-java-bash-go-perl-luajit-hhvm3_9-scale_mod5

Table of times:

Seconds executing Language Compiler used Version
6 s. Java Oracle Java Java JDK 8
6 s. Java Oracle Java Java JDK 7
6 s. Java Open JDK OpenJDK 7
6 s. Java Open JDK OpenJDK 6
7 s. Go Go Go v.1.3.1 linux/amd64
7 s. Go Go Go v.1.3.3 linux/amd64
8 s. Lua LuaJit Luajit 2.0.2
10 s. C++ g++ g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
10 s. C gcc gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
10 s.
(* first version was 13 s. and then was optimized)
Assembler nasm NASM version 2.10.09 compiled on Dec 29 2013
10 s. Nodejs nodejs Nodejs v0.12.4
14 s. Nodejs nodejs Nodejs v0.10.25
18 s. Go Go go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64
20 s. Phantomjs Phantomjs phantomjs 1.9.0
21 s. Phantomjs Phantomjs phantomjs 2.0.1-development
38 s. PHP Facebook HHVM HipHop VM 3.4.0-dev (rel)
44 s. Python Pypy Pypy 2.2.1 (Python 2.7.3 (2.2.1+dfsg-1, Nov 28 2013, 05:13:10))
52 s. PHP Facebook HHVM HipHop VM 3.9.0-dev (rel)
52 s. PHP Facebook HHVM HipHop VM 3.7.3 (rel)
128 s. PHP PHP PHP 7.0.0alpha2 (cli) (built: Jul 3 2015 15:30:23)
278 s. Lua Lua Lua 2.5.3
294 s. Gambas3 Gambas3 3.7.0
316 s. PHP PHP PHP 5.5.9-1ubuntu4.3 (cli) (built: Jul 7 2014 16:36:58)
317 s. PHP PHP PHP 5.6.10 (cli) (built: Jul 3 2015 16:13:11)
323 s. PHP PHP PHP 5.4.42 (cli) (built: Jul 3 2015 16:24:16)
436 s. Perl Perl Perl 5.18.2
523 s. Ruby Ruby ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
694 s. Python Python Python 2.7.6
807 s. Python Python Python 3.4.0
47630 s. Bash GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)

 

Conclusions and Lessons Learnt

  1. There are languages that will execute faster than a native Assembler program, thanks to the JIT Compiler and to the ability to optimize the program at runtime for the architecture of the computer running the program (even if there is a small initial penalty of around two seconds from JIT when running the program, as it is being analysed, is it more than worth in our example)
  2. Modern Java can be really fast in certain operations, it is the fastest in this test, thanks to the use of JIT Compiler technology and a very good implementation in it
  3. Oracle’s Java and OpenJDK shows no difference in performance in this test
  4. Script languages really sucks in performance. Python, Perl and Ruby are terribly slow. That costs a lot of money if you Scale as you need more Server in the Cloud
  5. JIT compilers for Python: Pypy, and for Lua: LuaJit, make them really fly. The difference is truly amazing
  6. The same language can offer a very different performance using one version or another, for example the go that comes from Ubuntu packets and the last version from official page that is faster, or Python 3.4 is much slower than Python 2.7 in this test
  7. Bash is the worst language for doing the loop and inc operations in the test, lasting for more than 13 hours for the test
  8. From command line PHP is much faster than Python, Perl and Ruby
  9. Facebook Hip Hop Virtual Machine (HHVM) improves a lot PHP’s speed
  10. It looks like the future of compilers is JIT.
  11. Assembler is not always the fastest when executed. If you write a generic Assembler program with the purpose of being able to run in many platforms you’ll not use the most powerful instructions specific of an architecture, and so a JIT compiler can outperform your code. An static compiler can also outperform your code with very clever optimizations. People that write the compilers are really good. Unless you’re really brilliant with Assembler probably a C/C++ code beats the performance of your code. Even if you’re fantastic with Assembler it could happen that a JIT compiler notices that some executions can be avoided (like code not really used) and bring magnificent runtime optimizations. (for example a near JMP is much more less costly than a far JMP Assembler instruction. Avoiding dead code could result in a far JMP being executed as near JMP, saving many cycles per loop)
  12. Optimizations really needs people dedicated to just optimizations and checking the speed of the newly added code for the running platforms
  13. Node.js was a big surprise. It really performed well. It is promising. New version performs even faster
  14. go is promising. Similar to C, but performance is much better thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)
  15. Gambas 3 performed surprisingly fast. Better than PHP
  16. You should be careful when using C/C++ optimization -O3 (and -O2) as sometimes it doesn’t work well (bugs) or as you may expect, for example by completely removing blocks of code if the compiler believes that has no utility (like loops)
  17. Perl performance really change from using a for style or another. (See Perl section below)
  18. Modern CPUs change the frequency to save energy. To run the tests is strictly recommended to use a dedicated machine, disabling the CPU governor and setting a frequency for all the cores, booting with a text only live system, without background services, not mounting disks, no swap, no network

(Please, before commenting read completely the article )

Explanations in details

Obviously an statically compiled language binary should be faster than an interpreted language.

C or C++ are much faster than PHP. And good code machine is much faster of course.

But there are also other languages that are not compiled as binary and have really fast execution.

For example, good Web Java Application Servers generate compiled code after the first request. Then it really flies.

For web C# or .NET in general, does the same, the IIS Application Server creates a native DLL after the first call to the script. And after this, as is compiled, the page is really fast.

With C statically linked you could generate binary code for a particular processor, but then it won’t work in other processors, so normally we write code that will work in all the processors at the cost of not using all the performance of the different CPUs or use another approach and we provide a set of different binaries for the different architectures. A set of directives doing one thing or other depending on the platform detected can also be done, but is hard, long and tedious job with a lot of special cases treatment. There is another approach that is dynamic linking, where certain things will be decided at run time and optimized for the computer that is running the program by the JIT (Just-in-time) Compiler.

Java, with JIT is able to offer optimizations for the CPU that is running the code with awesome results. And it is able to optimize loops and mathematics operations and outperform C/C++ and Assembler code in some cases (like in our tests) or to be really near in others. It sounds crazy but nowadays the JIT is able to know the result of several times executed blocks of code and to optimize that with several strategies, speeding the things incredible and to outperform a code written in Assembler. Demonstrations with code is provided later.

A new generation has grown knowing only how to program for the Web. Many of them never saw Assembler, neither or barely programmed in C++.

None of my Senior friends would assert that a technology is better than another without doing many investigations before. We are serious. There is so much to take in count, so much to learn always, that one has to be sure that is not missing things before affirming such things categorically. If you want to be taken seriously, you have to take many things in count.

Environment for the tests

Hardware and OS

Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz with 32 GB RAM and SSD Disk.

Ubuntu Desktop 14.04 LTS 64 bit

Software base and compilers

PHP versions

Shipped with my Ubuntu distribution:

php -v
PHP 5.5.9-1ubuntu4.3 (cli) (built: Jul  7 2014 16:36:58)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies

Compiled from sources:

PHP 5.6.10 (cli) (built: Jul  3 2015 16:13:11)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies
PHP 5.4.42 (cli) (built: Jul  3 2015 16:24:16)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2014 Zend Technologies

 

Java 8 version

java -showversion
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

C++ version

g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.2-19ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

Gambas 3

gbr3 --version
3.7.0

Go (downloaded from google)

go version
go version go1.3.1 linux/amd64

Go (Ubuntu packages)

go version
go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64

Nasm

nasm -v
NASM version 2.10.09 compiled on Dec 29 2013

Lua

lua -v
Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio

Luajit

luajit -v
LuaJIT 2.0.2 -- Copyright (C) 2005-2013 Mike Pall. http://luajit.org/

Nodejs

Installed with apt-get install nodejs:

nodejs --version
v0.10.25

Installed by compiling the sources:

node --version
v0.12.4

Phantomjs

Installed with apt-get install phantomjs:

phantomjs --version
1.9.0

Compiled from sources:

/path/phantomjs --version
2.0.1-development

Python 2.7

python --version
Python 2.7.6

Python 3

python3 --version
Python 3.4.0

Perl

perl -version
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
(with 41 registered patches, see perl -V for more detail)

Bash

bash --version
GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Test: Time required for nested loops

This is the first sample. It is an easy-one.

The main idea is to generate a set of nested loops, with a simple counter inside.

When the counter reaches 51 it is set to 0.

This is done for:

  1. Preventing overflow of the integer if growing without control
  2. Preventing the compiler from optimizing the code (clever compilers like Java or gcc with -O3 flag for optimization, if it sees that the var is never used, it will see that the whole block is unnecessary and simply never execute it)

Doing only loops, the increment of a variable and an if, provides us with basic structures of the language that are easily transformed to Assembler. We want to avoid System calls also.

This is the base for the metrics on my Cloud Analysis of Performance cmips.net project.

Here I present the times for each language, later I analyze the details and the code.

Take in count that this code only executes in one thread / core.

C++

C++ result, it takes 10 seconds.

Code for the C++:

/*
* File:   main.cpp
* Author: Carles Mateo
*
* Created on August 27, 2014, 1:53 PM
*/

#include <cstdlib>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <ctime>

using namespace std;

typedef unsigned long long timestamp_t;

static timestamp_t get_timestamp()
{
    struct timeval now;
    gettimeofday (&now, NULL);
    return  now.tv_usec + (timestamp_t)now.tv_sec * 1000000;
}

int main(int argc, char** argv) {

    timestamp_t t0 = get_timestamp();

    // current date/time based on current system
    time_t now = time(0);

    // convert now to string form
    char* dt_now = ctime(&now);

    printf("Starting at %s\n", dt_now);

    int i_loop1 = 0;
    int i_loop2 = 0;
    int i_loop3 = 0;

    

    for (i_loop1 = 0; i_loop1 < 10; i_loop1++) {
        for (i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
            for (i_loop3 = 0; i_loop3 < 32000; i_loop3++) {
                i_counter++;

                if (i_counter > 50) {
                    i_counter = 0;
                }
            }
            // If you want to test how the compiler optimizes that, remove the comment
            //i_counter = 0;
         }
     }

    // This is another trick to avoid compiler's optimization. To use the var somewhere
    printf("Counter: %i\n", i_counter);

    timestamp_t t1 = get_timestamp();
    double secs = (t1 - t0) / 1000000.0L;
    time_t now_end = time(0);

    // convert now to string form
    char* dt_now_end = ctime(&now_end);

    printf("End time: %s\n", dt_now_end);

    return 0;
}

blog-carlesmateo-com-test-nested-loops-cpp-netbeans-10seconds

You can try to remove the part of code that makes the checks:

                /* if (i_counter > 50) {
                    i_counter = 0;
                }*/

And the use of the var, later:

    //printf("Counter: %i\n", i_counter);

Note: And adding a i_counter = 0; at the beginning of the loop to make sure that the counter doesn’t overflows. Then the C or C++ compiler will notice that this result is never used and so it will eliminate the code from the program, having as result and execution time of 0.0 seconds.

Java

The code in Java:

package cpu;

/**
 *
 * @author carles.mateo
 */
public class Cpu {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        
        int i_loop1 = 0;
        //int i_loop_main = 0;
        int i_loop2 = 0;
        int i_loop3 = 0;
        int i_counter = 0;
        
        String s_version = System.getProperty("java.version");
        
        System.out.println("Java Version: " + s_version);

        System.out.println("Starting cpu.java...");
        
        for (i_loop1 = 0; i_loop1 < 10; i_loop1++) {            
                for (i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
                    for (i_loop3 = 0; i_loop3 < 32000; i_loop3++) {
                        i_counter++;
                        
                        if (i_counter > 50) { 
                            i_counter = 0;
                        }
                    }
                }
        }
        
        System.out.println(i_counter);
        System.out.println("End");
    }
    
}

It is really interesting how Java, with JIT outperforms C++ and Assembler.

It takes only 6 seconds.

Netbeans with Java IDE executing with OpenJDK 1.6 in 6 seconds

Go

The case of Go is interesting because I saw a big difference from the go shipped with Ubuntu, and the the go I downloaded from http://golang.org/dl/. I downloaded 1.3.1 and 1.3.3 offering the same performance. 7 seconds.

blog-carlesmateo-com-go1-3-3-linux-amd64-performance-37Source code for nested_loops.go

package main

import ("fmt"
        "time")

func main() {
   fmt.Printf("Starting: %s", time.Now().Local())
   var i_counter = 0;
   for i_loop1 := 0; i_loop1 < 10; i_loop1++ {
       for i_loop2 := 0; i_loop2 < 32000; i_loop2++ {
           for i_loop3 := 0; i_loop3 < 32000; i_loop3++ {
               i_counter++;
               if i_counter > 50 {
                   i_counter = 0;
               }
           }
       }
    }

   fmt.Printf("\nCounter: %#v", i_counter)
   fmt.Printf("\nEnd: %s\n", time.Now().Local())
}

Assembler

Here is the Assembler for Linux code, with SASM, that I created initially (bellow is optimized).

%include "io.inc"

section .text
global CMAIN
CMAIN:
    ;mov rbp, rsp; for correct debugging
    ; Set to 0, the faster way
    xor     esi, esi

DO_LOOP1:
    mov ecx, 10
LOOP1:
    mov ebx, ecx
    jmp DO_LOOP2
LOOP1_CONTINUE:
    mov ecx, ebx
    
    loop LOOP1
    jmp QUIT

DO_LOOP2:
    mov ecx, 32000
LOOP2:
    mov eax, ecx
    ;call DO_LOOP3
    jmp DO_LOOP3
LOOP2_CONTINUE:
    mov ecx, eax
        
    loop LOOP2
    jmp LOOP1_CONTINUE

DO_LOOP3:
    ; Set to 32000 loops    
    MOV ecx, 32000 
LOOP3:
    inc     esi
    cmp     esi, 50
    jg      COUNTER_TO_0
LOOP3_CONTINUE:

    loop LOOP3
    ;ret
    jmp LOOP2_CONTINUE
    
COUNTER_TO_0:
    ; Set to 0
    xor     esi, esi
    
    jmp LOOP3_CONTINUE
    
;    jmp QUIT

QUIT:
    xor eax, eax
    ret

It took 13 seconds to complete.

One interesting explanation on why binary C or C++ code is faster than Assembler, is because the C compiler generates better Assembler/binary code at the end. For example, the use of JMP is expensive in terms of CPU cycles and the compiler can apply other optimizations and tricks that I’m not aware of, like using faster registers, while in my code I use ebx, ecx, esi, etc… (for example, imagine that using cx is cheaper than using ecx or rcx and I’m not aware but the guys that created the Gnu C compiler are)

blog-carlesmateo-com-sasm-assembler-linux-64-bits-code-12-13-secondsTo be sure of what’s going on I switched in the LOOP3 the JE and the JMP of the code, for groups of 50 instructions, INC ESI, one after the other and the time was reduced to 1 second.

(In C also was reduced even a bit more when doing the same)

To know what’s the translation of the C code into Assembler when compiled, you can do:

objdump --disassemble nested_loops

Look for the section main and you’ll get something like:

0000000000400470 <main>:
400470:    bf 0a 00 00 00           mov    $0xa,%edi
400475:    31 c9                    xor    %ecx,%ecx
400477:    be 00 7d 00 00           mov    $0x7d00,%esi
40047c:    0f 1f 40 00              nopl   0x0(%rax)
400480:    b8 00 7d 00 00           mov    $0x7d00,%eax
400485:    0f 1f 00                 nopl   (%rax)
400488:    83 c2 01                 add    $0x1,%edx
40048b:    83 fa 33                 cmp    $0x33,%edx
40048e:    0f 4d d1                 cmovge %ecx,%edx
400491:    83 e8 01                 sub    $0x1,%eax
400494:    75 f2                    jne    400488 <main+0x18>
400496:    83 ee 01                 sub    $0x1,%esi
400499:    75 e5                    jne    400480 <main+0x10>
40049b:    83 ef 01                 sub    $0x1,%edi
40049e:    75 d7                    jne    400477 <main+0x7>
4004a0:    48 83 ec 08              sub    $0x8,%rsp
4004a4:    be 34 06 40 00           mov    $0x400634,%esi
4004a9:    bf 01 00 00 00           mov    $0x1,%edi
4004ae:    31 c0                    xor    %eax,%eax
4004b0:    e8 ab ff ff ff           callq  400460 <__printf_chk@plt>
4004b5:    31 c0                    xor    %eax,%eax
4004b7:    48 83 c4 08              add    $0x8,%rsp
4004bb:    c3                       retq

Note: this is in the AT&T syntax and not in the Intel. That means that add $0x1,%edx is adding 1 to EDX registerg (origin, destination).

As you can see the C compiler has created a very differed Assembler version respect what I created.
For example at 400470 it uses EDI register to store 10, so to control the number of the outer loop.
It uses ESI to store 32000 (Hexadecimal 0x7D00), so the second loop.
And EAX for the inner loop, at 400480.
It uses EDX for the counter, and compares to 50 (Hexa 0x33) at 40048B.
In 40048E it uses the CMOVGE (Mov if Greater or Equal), that is an instruction that was introduced with the P6 family processors, to move the contents of ECX to EDX if it was (in the CMP) greater or equal to 50. As in 400475 a XOR ECX, ECX was performed, EXC contained 0.
And it cleverly used SUB and JNE (JNE means Jump if not equal and it jumps if ZF = 0, it is equivalent to JNZ Jump if not Zero).
It uses between 4 and 16 clocks, and the jump must be -128 to +127 bytes of the next instruction. As you see Jump is very costly.

Looks like the biggest improvement comes from the use of CMOVGE, so it saves two jumps that my original Assembler code was performing.
Those two jumps multiplied per 32000 x 32000 x 10 times, are a lot of Cpu clocks.

So, with this in mind, as this Assembler code takes 10 seconds, I updated the graph from 13 seconds to 10 seconds.

Lua

This is the initial code:

local i_counter = 0

local i_time_start = os.clock()

for i_loop1=0,9 do
    for i_loop2=0,31999 do
        for i_loop3=0,31999 do
            i_counter = i_counter + 1
            if i_counter > 50 then
                i_counter = 0
            end
        end
    end
end

local i_time_end = os.clock()
print(string.format("Counter: %i\n", i_counter))
print(string.format("Total seconds: %.2f\n", i_time_end - i_time_start))

In the case of Lua theoretically one could take great advantage of the use of local inside a loop, so I tried the benchmark with modifications to the loop:

for i_loop1=0,9 do
    for i_loop2=0,31999 do
        local l_i_counter = i_counter
        for i_loop3=0,31999 do
             l_i_counter = l_i_counter + 1
             if l_i_counter > 50 then
                 l_i_counter = 0
             end
        end
        i_counter = l_i_counter
    end
end

I ran it with LuaJit and saw no improvements on the performance.

Node.js

var s_date_time = new Date();
console.log('Starting: ' + s_date_time);

var i_counter = 0;

for (var $i_loop1 = 0; $i_loop1 < 10; $i_loop1++) {
   for (var $i_loop2 = 0; $i_loop2 < 32000; $i_loop2++) {
       for (var $i_loop3 = 0; $i_loop3 < 32000; $i_loop3++) {
           i_counter++;
           if (i_counter > 50) {
               i_counter = 0;
           }
       }
   } 
}

var s_date_time_end = new Date();

console.log('Counter: ' + i_counter + '\n');

console.log('End: ' + s_date_time_end + '\n');

Execute with:

nodejs nested_loops.js

Phantomjs

The same code as nodejs adding to the end:

phantom.exit(0);

In the case of Phantom it performs the same in both versions 1.9.0 and 2.0.1-development compiled from sources.

PHP

The interesting thing on PHP is that you can write your own extensions in C, so you can have the easy of use of PHP and create functions that really brings fast performance in C, and invoke them from PHP.

<?php

$s_date_time = date('Y-m-d H:i:s');

echo 'Starting: '.$s_date_time."\n";

$i_counter = 0;

for ($i_loop1 = 0; $i_loop1 < 10; $i_loop1++) {
   for ($i_loop2 = 0; $i_loop2 < 32000; $i_loop2++) {
       for ($i_loop3 = 0; $i_loop3 < 32000; $i_loop3++) {
           $i_counter++;
           if ($i_counter > 50) {
               $i_counter = 0;
           }
       }
   } 
}

$s_date_time_end = date('Y-m-d H:i:s');

echo 'End: '.$s_date_time_end."\n";

Facebook’s Hip Hop Virtual Machine is a very powerful alternative, that is JIT powered.

Downloading the code and compiling it is just easy, just:

git clone https://github.com/facebook/hhvm.git
cd hhvm
rm -r third-party
git submodule update --init --recursive
./configure
make

Or just grab precompiled packages from https://github.com/facebook/hhvm/wiki/Prebuilt%20Packages%20for%20HHVM

Python

from datetime import datetime
import time

print ("Starting at: " + str(datetime.now()))
s_unixtime_start = str(time.time())

i_counter = 0

# From 0 to 31999
for i_loop1 in range(0, 10):
    for i_loop2 in range(0,32000):
         for i_loop3 in range(0,32000):
             i_counter += 1
             if ( i_counter > 50 ) :
                 i_counter = 0

print ("Ending at: " + str(datetime.now()))
s_unixtime_end = str(time.time())

i_seconds = long(s_unixtime_end) - long(s_unixtime_start)
s_seconds = str(i_seconds)

print ("Total seconds:" + s_seconds)

Ruby

#!/usr/bin/ruby -w

time1 = Time.new

puts "Starting : " + time1.inspect

i_counter = 0;

for i_loop1 in 0..9
    for i_loop2 in 0..31999
        for i_loop3 in 0..31999
            i_counter = i_counter + 1
            if i_counter > 50
                i_counter = 0
            end
        end
    end
end

time1 = Time.new

puts "End : " + time1.inspect

Perl

The case of Perl was very interesting one.

This is the current code:

#!/usr/bin/env perl

print "$s_datetime Starting calculations...\n";
$i_counter=0;

$i_unixtime_start=time();

for my $i_loop1 (0 .. 9) {
    for my $i_loop2 (0 .. 31999) {
        for my $i_loop3 (0 .. 31999) {
            $i_counter++;
            if ($i_counter > 50) {
                $i_counter = 0;
            }
        }
    }
}

$i_unixtime_end=time();

$i_seconds=$i_unixtime_end-$i_unixtime_start;

print "Counter: $i_counter\n";
print "Total seconds: $i_seconds";

But before I created one, slightly different, with the for loops like in the C style:

#!/usr/bin/env perl

$i_counter=0;

$i_unixtime_start=time();

for (my $i_loop1=0; $i_loop1 < 10; $i_loop1++) {
    for (my $i_loop2=0; $i_loop2 < 32000; $i_loop2++) {
        for (my $i_loop3=0; $i_loop3 < 32000; $i_loop3++) {
            $i_counter++;
            if ($i_counter > 50) {
                $i_counter = 0;
            }
        }
    }
}

$i_unixtime_end=time();

$i_seconds=$i_unixtime_end-$i_unixtime_start;

print "Total seconds: $i_seconds";

I repeated this test, with the same version of Perl, due to the comment of a reader (thanks mpapec) that told:

In this particular case perl style loops are about 45% faster than original code (v5.20)

And effectively and surprisingly the time passed from 796 seconds to 436 seconds.

So graphics are updated to reflect the result of 436 seconds.

Bash

#!/bin/bash
echo "Bash version ${BASH_VERSION}..."
date

let "s_time_start=$(date +%s)"
let "i_counter=0"

for i_loop1 in {0..9}
do
     echo "."
     date
     for i_loop2 in {0..31999}
     do
         for i_loop3 in {0..31999}
         do
             ((i_counter++))
             if [[ $i_counter > 50 ]]
             then
                 let "i_counter=0"
             fi
         done
#((var+=1))
#((var=var+1))
#((var++))
#let "var=var+1"
#let "var+=1"
#let "var++"
     done
done

let "s_time_end=$(date +%2)"

let "s_seconds = s_time_end - s_time_start"
echo "Total seconds: $s_seconds"

# Just in case it overflows
date

Gambas 3

Gambas is a language and an IDE to create GUI applications for Linux.
It is very similar to Visual Basic, but better, and it is not a clone.

I created a command line application and it performed better than PHP. There has been done an excellent job with the compiler.

blog-carlesmateo-com-gbr3-gambas-performanceNote: in the screenshot the first test ran for few seconds more than in the second. This was because I deliberately put the machine under some load and I/O during the tests. The valid value for the test, confirmed with more iterations is the second one, done under the same conditions (no load) than the previous tests.

' Gambas module file MMain.module

Public Sub Main()

    ' @author Carles Mateo http://blog.carlesmateo.com
    
    Dim i_loop1 As Integer
    Dim i_loop2 As Integer
    Dim i_loop3 As Integer
    Dim i_counter As Integer
    Dim s_version As String
    
    i_loop1 = 0
    i_loop2 = 0
    i_loop3 = 0
    i_counter = 0
    
    s_version = System.Version
    
    Print "Performance Test by Carles Mateo blog.carlesmateo.com"    
    Print "Gambas Version: " & s_version

    Print "Starting..." & Now()
    
    For i_loop1 = 0 To 9
        For i_loop2 = 0 To 31999
            For i_loop3 = 0 To 31999
                i_counter = i_counter + 1
                
                If (i_counter > 50) Then
                    i_counter = 0
                Endif
            Next
        Next
    Next
    
    Print i_counter
    Print "End " & Now()

End

Changelog

2015-08-26 15:45

Thanks to the comment of a reader, thanks Daniel, pointing a mistake. The phrase I mentioned was on conclusions, point 14, and was inaccurate. The original phrase told “go is promising. Similar to C, but performance is much better thanks to the use of JIT“. The allusion to JIT is incorrect and has been replaced by this: “thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)”

2015-07-17 17:46

Benchmarked Facebook HHVM 3.9 (dev., the release date is August 3 2015) and HHVM 3.7.3, they take 52 seconds.

Re-benchmarked Facebook HHVM 3.4, before it was 72 seconds, it takes now 38 seconds. I checked the screen captures from 2014 to discard an human error. Looks like a turbo frequency issue on the tests computer, with the CPU governor making it work bellow the optimal speed or a CPU-hungry/IO process that triggered during the tests and I didn’t detect it. Thinking about forcing a fixed CPU speed for all the cores for the tests, like 2.4 Ghz and booting a live only text system without disk access and network to prevent Ubuntu launching processes in the background.

2015-07-05 13:16

Added performance of Phantomjs 1.9.0 installed via apt-get install phantomjs in Ubuntu, and Phantomjs 2.0.1-development.

Added performance of nodejs 0.12.04 (compiled).

Added bash to the graphic. It has so bad performance that I had to edit the graphic to fit in (color pink) in order prevent breaking the scale.

2015-07-03 18:32

Added benchmarks for PHP 7 alpha 2, PHP 5.6.10 and PHP 5.4.42.

2015-07-03 15:13
Thanks to the contribution of a reader (thanks mpapec!) I tried with Perl for style, resulting in passing from 796 seconds to 436 seconds.
(I used the same Perl version: Perl 5.18.2)
Updated test value for Perl.
Added new graphics showing the updated value.

Thanks to the contribution of a reader (thanks junk0xc0de!) added some additional warnings and explanations about the dangers of using -O3 (and -O2) if C/C++.

Updated the Lua code, to print i_counter and do the if i_counter > 50
This makes it take a bit longer, few cents, but passing from 7.8 to 8.2 seconds.
Updated graphics.

49 thoughts on “Performance of several languages

  1. Razvan Popovici

    Any decently intelligent compiler will write something like:

    let i_counter be 37
    print i_counter

    Your code does not depend on any external output or random events. Everything may be computed ahead, on compile time. Gcc / Visual C++ actually do this sometimes.

    Second, your code doesn’t use any data structure. So it is more or less a CPU/registry thing. Access to memory wastes a lot of CPU cycles.

    Third, your code cannot be optimized with any MMX/AVX instruction set. Cherry picking?

    The reason Assembler scored bad is that your assembly is weak, compared with the output produced by a compiler, with -O3 option. Don’t take it as an offence, our code looks well written and logical from human perspective, like a -O0 output.

    Reply
  2. Saleem

    Very interesting and informative comparison. There is no doubt JIT boosts performance especially in case of Java. However, with some tweaks I’m able to pull C++ execution time to less than 1ms. I’m taking help of OpenMP and -O2 flag.

    #include
    #include
    #include

    using namespace std;
    using namespace std::chrono;

    int main(int argc, char** argv) {

    auto t1 = high_resolution_clock::now();

    auto f = []() -> unsigned long long {
    unsigned long long i_counter = 0;
    #pragma omp parallel for reduction(+:i_counter)
    for (auto i_loop1 = 0; i_loop1 < 10; i_loop1++) {
    for (auto i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
    for (auto i_loop3 = 0; i_loop3 < 32000; i_loop3++) {
    ++i_counter;
    }
    }
    }

    return i_counter %= 51;
    };

    auto i_counter = f();
    auto t2 = high_resolution_clock::now();

    std::chrono::duration milis = t2 – t1;

    cout << "i_counter = " << i_counter << endl;
    cout << "f() took " << milis.count() << "ms " << endl;

    return 0;
    }

    Reply
  3. wtfuzz

    This is extremely misleading.

    Java eliminates useless outer loops. Try printing the loop variables i_loop1,2,3 after the loops, and you will see execution times double.

    $ time java cpu/cpu
    Java Version: 1.7.0_101
    Starting cpu.java…
    10
    32000
    32000
    37
    End

    real 0m17.059s
    user 0m17.048s
    sys 0m0.008s

    On the C compiler, add -march=native and -O3 (verified that it does NOT remove any of the loops in objdump, since the i_counter variable is referenced in the printf() afterwards), and it executes in half the time that Java does.

    A JIT can’t magically make a CPU run faster than it is. Period. Statements like “X is faster than C or assembly” irk me 🙂 That isn’t to say that language X is a better use case than Y for level of effort:performance tradeoffs, but for pure performance critical code of something actually useful other than a useless loop, tuned native asm/C/C++ will always outperform.

    All you’ve proven is that you can’t see what the java compiler, or any of the other compilers decided to discard during optimization.

    Reply
  4. kgcode

    A few questions:

    [1] Did you disable Intel SpeedStep during the running of the benchmarks?

    [2] Doesn’t printf, which you have inside the timed area of the C code, ultimately make a system call? It does in most environments I’ve worked with.

    [3] The C code calls to printf cause the format strings to be parsed character by character, while the Java call to println it’s designed this way, and the assembly language code doesn’t do any printing at all. Although the printing isn’t done inside the loops, it does have an effect on the fixed overhead of the timed code. Wouldn’t it make sense to eliminate all printing inside the timed area in all languages?

    [4] The gettimeofday function can be affected by discontinuous changes to the current time, so if the system happened to sync its time with a time server during the benchmark run, the measured time could be affected. Would it make more sense to use clock_gettime?

    Reply
    1. kgcode

      Typo: In [3], I meant to say “the Java call to println isn’t designed this way.”

      Reply
  5. Ethan Madden

    I’ve edited your python example to be a bit more pythonic and added some performance improvements. It dropped the runtime on my local machine from 1407 seconds to a mere 411 in python 2, and had similar improvements in python 3. Feel free to grab some or all of it from the link below and try it on your box.

    https://gist.github.com/jetpacktuxedo/a2d4ef619b580eedc1d8

    Reply
  6. Danilo Dias

    Important:

    All the languages that use LLVM (example: LuaJit, Julia, Gambas+Jit) apparently they have the same result.

    To speed the gambas code just put the word “FAST” on the top of the source file.

    Reply
  7. Danilo Dias

    Example for Julia Language – http://www.julialang.org
    The same benchmark result of LuaJit

    ———————————–

    function doTest()
    i_counter = 0

    i_time_start = now()

    for i_loop1 = 0:9
    for i_loop2 = 0:31999
    for i_loop3 = 0:31999
    i_counter = i_counter + 1
    if i_counter > 50
    i_counter = 0
    end
    end
    end
    end

    i_time_end = now()
    println(“Counter: “, i_counter)
    println(“Total seconds: “, (i_time_end – i_time_start)/1000)
    end

    doTest()

    Reply
  8. Danilo Dias

    Example for R Language:
    http://www.r-project.org

    ———————————————–

    library(compiler)
    enableJIT(3)

    i_counter <- 0

    i_time_start <- Sys.time()

    for(i_loop1 in 0:9)
    {
    for(i_loop2 in 0:31999)
    {
    for(i_loop3 in 0:31999)
    {
    i_counter 50)
    {
    i_counter <- 0
    }
    }
    }
    }

    i_time_end <- Sys.time()
    m <- paste("Counter:", i_counter)
    print(m)
    m <- paste("Total seconds:", (i_time_end – i_time_start))
    print(m)

    Reply
  9. b0blee

    I was wondering about the effect of using unnecessary variables in your Ruby script. You didn’t write loops the way most Ruby programmers would. The syntax was very Java-ish. Would the program run faster if you had written, for example:

    10.times do
    32000.times do
    32000.times do
    i_counter += 1
    if i_counter > 50
    i_counter = 0
    end
    end
    end
    end

    I tried it and, of course, it is still slow, but do using a language’s best practices make a significant difference?

    Reply
  10. Asu

    COME ON. Don’t compare language performances with an useless nested loop benchmark.
    Disabling optimizations for gcc is absolutely unfair and stupid, because 1) the code performance relies A LOT on optimization and 2) you let Java and most of the other JIT VMs do optimizations you’re not letting C++ compilers to do.

    Sure, -O3 will rule useless for loops out
    http://goo.gl/xRUDpw

    But see a Fibonacci function (of course making sure the variable is not known compile time) :
    http://goo.gl/dgJGJx
    http://goo.gl/sPozjv

    No optimizations results into way more costy instructions.

    Now let’s compile my bytecode VM (no wish to make ad here, but because it’s a real case situation without compile-time known stuff) in -O3 and compare it with -O0 with a program counting from -10 000 000 to 0 (it is GUARANTEED that the program is not known compile-time) :
    http://i.imgur.com/VpzUdCc.png

    That’s one hell of a difference.

    And about bugs caused by -O2 and -O3, it’s pretty simple : If you don’t rely on uninitialized variables’ value, you probably won’t have zero issue (Unless you’re doing it on purpose and I don’t even know how you could).

    If you can’t use optimization for X compiler, then don’t put it in your results. Or look for a better benchmark.

    Reply
    1. Carles Mateo Post author

      Hi Asu,

      Thanks for your comments and for the links to http://gcc.godbolt.org that personally I didn’t know.

      I’d previously replied, all your points in the article and in other comments.
      The beginning of the article says:
      “This article shows some results and shares my conclusions. It is as a base to discuss with my colleagues. Is not an end, we are always doing tests, looking for the edge, and looking at the root of the things in detail. And often things change from one version to the other. This article shows not an absolute truth, but brings some light into interesting aspects.

      It could show the performance for the certain case used in the test, although generic core instructions have been selected. Many more tests are necessary, and some functions differ in the performance. But this article is a necessary starting for the discussion”

      So that’s it, a necessary beginning for starting discussions.

      Implementing Fibonacci with a value not known by the compiler is a fair point for me and a good test to do. It could bring interesting results.
      I wanted specifically to leave functions out of the equation for a simple reason: to avoid bias by the stack management and the improvements that languages apply to functions.
      And I wanted to use basic operations that are easily translated to assembler to compare the assembly output and real performance from one language to another.
      Also I used only a core.

      Whether specific improvements could be applied to the different languages, that was not the point of the test. If a language is not able to perform a basic loop and inc and if with performance one cannot expect much about them.
      This is the case for Bash, Python, Ruby or Perl. And it’s nice to know it.
      Also it is important to remark the improvements that bring strategies like JIT. Python with a JIT is totally a different thing, and fast.

      Best,
      Carles

      Reply
  11. Rodrigo

    Hi,
    1. A better code for python with a 50% performance incrise
    from datetime import datetime
    import time

    print (“Starting at: ” + str(datetime.now()))
    s_unixtime_start = str(time.time())

    i_counter = 0

    def f():
    i = 0
    # From 0 to 31999
    i_loop1=0
    i_loop2=0
    i_loop3=0
    while (i_loop1 < 10):
    while (i_loop2 < 32000):
    while (i_loop3 50 ):
    i = 0
    i_loop3+=1
    i_loop2+=1
    i_loop1+=1
    return i

    i_counter = f()
    print(i_counter)
    print (“Ending at: ” + str(datetime.now()))
    s_unixtime_end = str(time.time())

    i_seconds = long(float(s_unixtime_end)) – long(float(s_unixtime_start))
    s_seconds = str(i_seconds)

    print (“Total seconds:” + s_seconds)

    2. You made a great job, but it only show to me the improvment of JIT in nested loop. The two outer loop is inrelevant, and the bytecode must optimize it. I sugest put some work in the loop1, and some work in loop2, with a string concatenation for example and the result will be diferent.

    3. Take a look in the site http://benchmarksgame.alioth.debian.org/, a much better benchmark with algorytmics imlemented in many language, the you will see a result much diferent you found.

    Thanks

    Reply
    1. Rodrigo

      I send a wrong python code

      from datetime import datetime
      import time

      def f():
      i = 0
      # From 0 to 31999
      for i_loop1 in xrange(0,10):
      for i_loop2 in xrange(0,32000):
      for i_loop3 in xrange(0,32000):
      i=(0 if (i>50) else (i+1))
      return i

      def main():

      print (“Starting at: ” + str(datetime.now()))
      s_unixtime_start = str(time.time())
      print(f())
      print (“Ending at: ” + str(datetime.now()))
      s_unixtime_end = str(time.time())

      i_seconds = long(float(s_unixtime_end)) – long(float(s_unixtime_start))
      s_seconds = str(i_seconds)

      print (“Total seconds:” + s_seconds)

      main()

      Reply
  12. dc

    All this very impressive work to test just loops ? This is a simple “toy” benchmark which, unfortunately, has no corresponding with real world code. Java being faster than C/C++… ? Let Java handle a “real” program, using more than 6 times the memory needed by any other language and hit the GC problem… While your work is very clean and serious, I would like to see some more complex code to be tested. Then, you’ll certainly see much different results.

    Reply
    1. Raffaello

      so based on your benchmark C and C++ even assembler is slower than a JIT/ JAVA.
      so we should build our OS in JAVA….. but why it is sound so weird?

      notice that you bench-marked multiple different code using 3 main for loops.
      for asm if you don’t take advantages of MMX at least and cpu caching, mem alignment (how the jit probably internally are doing) make no sense to write asm in 8086 instruction set in 2014, so it is WRONG your ASM CODE compare to JIT!

      Same thing to write for C and C++, you have to use optimization at least O2, even more there other option like speed in favour to size of code, unrolling loop and use of advance CPU instruction (so make it compile for at least a Pentium CPU ???)

      so my point is: YOU CANNOT COMPARE JIT LANGUAGES WITH BINARY COMPILED LANGUAGES COMPILED IN DEBUG 8086 INSTRUCTION SET MODE!!!!
      please.

      Reply
  13. Daniel

    I stop reading when you say golang is faster because of jit.
    It does not use any jit for the code you post.

    Reply
    1. Daniel

      What you did is:

      1. Write a program that can benefit a lots with loop-unrolling.
      2. Use a compiler-specific tricks to stop compiler unrolling your loop.
      3. Pick a jit (java jit) that can recognize your tricks.
      4. Claims jit is everything.

      golang does not use runtime code generation.
      Try golang 1.5, it is faster then your java.

      Reply
      1. Carles Mateo Post author

        Hi Daniel, I’ve explained this in other replies, so perhaps you will find additional info in the comments but I try to reply your points.

        1. No, I didn’t. I created a program that uses no system calls in its main calculation body, that uses basic instructions of the processor (loops, counters, ifs, MOV INC JNE) that can be ported to all the programming languages without distortion, and in addition tried to write a code that was not being affected by fail prediction branch to prevent for biased results.

        This article is a point of start for discussions, it doesn’t pretends to tell what is the truth, specially now that there are some many variables (optimizations, architectures) but brings interesting information and light for decision makers and a predictable base.

        2) No. In C++ if you use the -O3 the loops are removed. I transparently point this in the article. (I’ve another article where I created a sorting algorithm and the -O3 generated wrong optimizations generating wrong final calculations. That’s why I mention about the risk of bugs on optimization)

        3) No, this was not the idea.
        My point for decision makers Engineers and CTO’s: is it worth for your company to invest thousands of Dollars and months into producing optimized code in C++, when a JIT powered language like Java takes care of all the optimizations possible in the target computer?.
        The person leading the technology in that company has to balance this and choose.
        For some companies C++ will have sense and for other Java will be the answer. (Change C++ and Java by your preferred languages).

        But what everyone will have clear is that using Bash for processing those big ma preduce tasks will not be the most efficient. 😉
        Seriously, the point is that a Star up running 50 Servers with a script language can just save all this money and use only two. This can be the difference for that Company from being successful or having to close due to too much expensive exploitation costs. *

        In the case of PHP for Facebook it is clear that using HHVM (JIT powered) saves tens of thousands of Servers for doing the same job than PHP.

        4) JIT is not everything, but you can’t deny that is great, and is much more worry-free than producing binaries for 32 bit, for 64 bit, for arm 32, for arm 64, specific assembler code for that any specific versions of an architecture… Is much more cheaper. I love binary code, and is very fast, but has some problems too, and Engineers have to be aware of all that information.
        Some of my SysAdmins friends were running Python, and after reading the article they changed to pypy that is a JIT Python, that brings an execution much much faster, saving many Server for running the same tasks.
        Also, this is a point of beginning for discussing other more advanced topics. A base is necessary, and this article is a base.
        I don’t pretend to have the absolute truth, just to bring relevant data, and I also link to some very nice articles about Java and JIT performance against C or C++.
        Everyone can get their own conclusions.

        * Today I read about a company that reduced the number of Servers from 30 using Ruby to 2 using Go (and 2 was for redundancy, 1 was enough): http://www.iron.io/blog/2013/03/how-we-went-from-30-servers-to-2-go.html

        Well, golang does things on the runtime, and has a garbage collector, and checks the types… but yes, is no JIT.

        Will try golang 1.5 as soon as I can. Thanks.

        Reply
    2. Carles Mateo Post author

      Hi Daniel,

      Yes, you’re right. That was incorrect.
      Thanks for pointing it.

      I’ve changed the point 14, and changed that text to “thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)”.

      I’ve also noted this error and change in the Changelog at the end of the article.

      I’ve been reading some more extra documentation about go, and investigating by myself: compiling, disassembling and examining the produced assembler code. I want to spend some more time as soon as I can, but my impression is that generated binary code is good, and make things go faster than in my Assembler code, for example by using the 64 bit instruction set in my computer.
      Considerations about the advantages of static linking and problems distributing to different architectures go beyond the scope of this article, but with golang’s speed compiling it is possible now to just distribute the .go file across all the servers and just go run.

      Reply
  14. Yamashi

    Oh yea, let’s allow the JIT to optimize away the loop but not the C++ version, also let’s allow SSE extensions for JIT but not for C++, ASM and co.

    You say -O3 can cause bad code, it’s not true, if you write bad code (don’t care about the warnings) aggressive optimizations will eventually break your program, not because O3 changes the logic of the code but because the code was poorly written.

    All you are benching here is non optimized native code versus optimized JIT, needless to say, the fact that non optimized versions compete very well against machine optimized versions of other languages shows that optimized native code is superior.

    Last but not least, writing C code and calling it C++ does NOT make it C++.

    Reply
    1. Carles Mateo Post author

      Just read the article before commenting such things.

      Just compile the code with -O3 and tell what is the result you got, the counter. Obviously you have not tried. There is a bug on gcc optimization for this case and brings a wrong value.

      C++ is the compiler used, I’m not using objects in any language here. It is a simple code that translates well to basic assembler instructions by the compiler.

      Feel free to use SSE instructions on ASM and share the code here. JIT will still win. Even if ASM would win Java by 1 second, in a test with x10 times more registers JIT version would crush the native. Native does not scale as well and you have to provide support for all the different CPU optimizations. This brings horrible time-to-market to the developments.

      Reply
      1. Yamashi

        I ran it with -O3 before commenting, the counter value was indeed 37, nothing wrong with the code generated.

        “time ./main
        Counter: 37
        End

        real 0m0.002s
        user 0m0.002s
        sys 0m0.000s”

        Now, why this code is not a correct test case :
        1) It does not test anything but loop unrolling.
        2) It starts from the assumption that the programmer can’t think of a better way to compute 10*32000*32000 % 51.
        3) There is no real world use of this code.

        You should take a look at the benchmarks here : http://benchmarksgame.alioth.debian.org/u32/compare.php?lang=gpp&lang2=java

        Reply
        1. Carles Mateo Post author

          Cool. Now that you have your binary disassemble it.
          Do you see any loop there?.

          The compiler removed the loops thinking of them as unnecessary. This is not speed, is just removing the code and returning 37. (is not a bug, is what the compiler does trying to optimize your code)

          Think that:
          A 3Ghz processor can execute 3,000,000,000 clocks/cycles per second.
          Assuming that the loop, increasing counter and reading from memory and writing to memory takes 5 cycles (I’m generous here. This number changes from architecture to architecture, and with the caches is difficult to predict. Normally modern architectures use less cycles per instruction than the previous generation), it will take:
          The loops cause the execution of the inner code around 32,000 * 32,000 * 10 = 10,240,000,000 times. And if the code takes 5 cycles, this means that the code will take: 51,200,000,000 cycles.
          51,200,000,000 / 3,000,000,000 cycles per second = 17 seconds theoretically.
          In my computer at 3.9 Ghz with turbo, Core i7, it takes 10 seconds.
          The C compiler produces excellent Assembler code that reduced the time in my original Assembler code from 13 seconds to 10 seconds with improved choose of instructions, but that’s it. It will never take 0.002s or something similar if is executing (32,000^2)*10 times.

          You should read the article, not just the conclusions.

          Replying to your 3 points. This code is a perfect case because:
          1)
          A) It translates to assembler with basic CPU instructions, and no system calls
          B) All the programs do those operations: increasing counters, reading and writing variables from memory, looping.
          C) If a language is unable to do a simple loop, read and write a variable speedy enough, you can’t expect much from it in terms of performance

          2) No, it doesn’t assume a programmer can’t think. It test basic assembler instructions (read memory, write memory, loop) enough times to see the difference. And many programs do that.

          3) Oh yes, many programs do that, all the algorithms loop log N, N^2, etc… times. Sorting algorithms, mathematics, signals, compression algorithms… Basically they loop, read from memory, write, increase counters… billions of times.

          Best,
          Carles

          Reply
  15. Vidar Skaugen

    I know bash is horribly slow for these kinds on tasks, but I was surprised at just *how* slow it was. I tried running it (with fewer loops), and got the time down from 91 seconds to 51 seconds by replacing the 5 lines in the innermost loop with:

    i_counter=0
    [[ $i_counter > 50 ]] && i_counter=0

    Also, it seem like you have a typo, the last date is “$(date +%2)” instead of the “$(date +%s)” you use earlier.

    Awesome article, thanks for the interesting read.

    Reply
  16. tobbik

    I think it is very important to point out what this benchmark is measuring. As a general assumption, in my opinion, this is measuring three things:

    1. for interpreted languages running in a : How efficiently can the interpreter convert it’s internal representation of a variable into a machine executable one.
    2. for JITed code execution: How well can the JIT compiler detect blocks and optimize it for machined code. Once JITed there shall be no significant difference for the calculation because the variables are in memory in machine code.

    Different JIT compilers take different type of clues for block boundaries. Some are smarter, some require the programmers help. Knowing the internals of a JIT compiler can yield significant gain. For example, HHVM prefers function boundaries for optimization. By changing the code and wrapping the three for-loops into a function about 40% can be gained. PHP 5.6, however, gains something but it is negligible.


    <?php
    date_default_timezone_set( "America/Vancouver" );
    $s_date_time = strtotime( date('Y-m-d H:i:s') );
    /* In JIT compilation, the detection of blocks is important. HHVM has a
    * documented and intended preference for function boundaries. Wrapping the
    * main business in a function drops execution time by 40ish percent
    */
    function runner( $i_counter )
    {
    for ($i_loop1 = 0; $i_loop1 < 10; $i_loop1++) {
    for ($i_loop2 = 0; $i_loop2 < 32000; $i_loop2++) {
    for ($i_loop3 = 0; $i_loop3 50) { $i_counter = 0; }
    }
    }
    }
    return $i_counter;
    }
    $data = runner( 0 );
    $s_date_time_end = strtotime( date('Y-m-d H:i:s') );
    echo ( ' Result: ' . $data . ' ' .
    ($s_date_time_end - $s_date_time) . " seconds\n" );
    ?>

    Normal PHP version:

    <?php
    date_default_timezone_set( "America/Vancouver" );
    $s_date_time = strtotime( date('Y-m-d H:i:s') );
    $i_counter = 0;
    for ($i_loop1 = 0; $i_loop1 < 10; $i_loop1++) {
    for ($i_loop2 = 0; $i_loop2 < 32000; $i_loop2++) {
    for ($i_loop3 = 0; $i_loop3 50) { $i_counter = 0; }
    }
    }
    }
    $s_date_time_end = strtotime( date('Y-m-d H:i:s') );
    echo ( ' Result: ' . $i_counter . ' ' .
    ($s_date_time_end - $s_date_time) . " seconds\n" );
    ?>

    Reply
    1. Carles Mateo Post author

      Hi tobbik,

      The idea of the article, is:
      1) to show how different languages+compilers perform with a very basic set of instructions that can be easily converted to Assembler by those compilers/interpreters, using the most close code from one language to another. So conclusions can be extracted from the loops, adding and comparing operations.
      We avoid system calls, access to disk, etc… during the performance test to avoid distorting the results.
      Basic operations that any program uses, the real work:
      – For loop
      – Inc ++
      – Comparison greater and equal
      – Read and write to variables
      2) to discover and show lateral side effects, counter-intuitive behaviour, and curiosities, like how a determinate language/engine optimizes, or performs to do the operations, or bugs that may appear in C/C++ with O2 or O3 optimizations.
      3) to be a point of start, for discussing more advanced topics.

      So all the comments and contributions are welcome!.

      When SysAdmins facing speed/scalability/performance problems look at the time of Python, but specially Bash, they see that they have to use another language or approach for solving that problem.
      In the case of Python, using PyPy or another JIT for Python results in a quick win.

      Your code appears cut, but your point is visible and it makes a very good contribution for the HHVM case!.

      Best,
      Carles

      Reply
  17. tobbik

    First of all, good job on the benchmark, it’s a lot of work!
    I found one particular issue with the Lua implementation: The general rule of Thumb is to make every variable as local as possible. Even if it seems to make no difference if the i_counter variable is declared local, in terms of the way the interpreter handles scope it is significant. an non local variable is globally accessible even from within another file loaded by require() or similar. The effect is dramatic on my computer shortening execution duration to almost 1/4 of the time.

    Reply
    1. Carles Mateo Post author

      Many thanks for your kind words tobbik.

      Your point is very interesting, and true, it allows to use better optimisations.
      By the way I modified the lua code with a soft approach according to your mention, (explicitly declare i_counter as local) and in this case I saw no gain in performance. Looks natural for me as this is a single file and the forced definition of local cannot benefit from using internal computer registers as I wonder they’re are used for improving the loops.
      (I just didn’t use local inside the loops as the variable has to be printed outside. I made few arrangements to make the code like the other versions, as it was missing the final print of the value, that confirms that everything when Okright and the i_counter reset to prevent Jit optimizations and overflows).

      That’s the code:
      local i_counter = 0
      local i_time_start = os.clock()

      for i_loop1=0,9 do
      for i_loop2=0,31999 do
      for i_loop3=0,31999 do
      i_counter = i_counter + 1
      if i_counter > 50 then
      i_counter = 0
      end
      end
      end
      end

      local i_time_end = os.clock()
      print(string.format(“Counter: %i\n”, i_counter))
      print(string.format(“Total seconds: %.2f\n”, i_time_end – i_time_start))

      But it looks that we’ll have a very nice performance gain by doing:

      for i_loop1=0,9 do
      for i_loop2=0,31999 do
      local l_i_counter = i_counter
      for i_loop3=0,31999 do
      l_i_counter = l_i_counter + 1
      if l_i_counter > 50 then
      l_i_counter = 0
      end
      end
      i_counter = l_i_counter
      end
      end

      Can you confirm if your performance gain came from defining i_counter as local inside the main loop? Can you share your modified version of the code?.

      I see other possibilities like using while instead of for and defining local variables for the counting of the loops. I did not try this as I wanted to keep the same looping structure for all the languages but it’s a test worth trying and documenting in the article.

      Thanks!.

      Reply
      1. Tobias Kieslich

        Hi Carles,

        in Lua it is very important to declare a variable local even within a single script because it does affect the way Lua will access it. A global variable is accessed from a hash table, allowing access even across script files. A local variable, however, is indexed. Check the following code:
        file1.lua

        global_var = 10
        local local_var = 20

        -- require( 'test_local_include' ) would do the similar stuff
        local c = loadfile( 'test_local_include.lua' )
        c( )

        file2.lua

        -- this will print the global_var which is a hashtable lookup in _ENV
        print( 'global variable:', global_var )
        -- this will print local_var as nil, because it is not in the global table
        print( 'local variable:', local_var )

        As for your request, I have created a github repository here. Here are the results. It tests for the local/global issue in Lua, also it puts it in reference of some other languages. All tests have been executed on a Dell XPS 15 with an i7-4712HQ with an uptodate Archlinux. (I know the pyhon code is off by one in the loops)

        [tobias@zenit blog]$ make
        make all
        make[1]: Entering directory '/home/tobias/coding/lua/nested_loop/blog'
        gcc -Wall -Wextra -Werror -g -O0 -std=c99 -c c_double.c -o c_double.o
        gcc c_double.o -o c_double
        rm c_double.o
        gcc -Wall -Wextra -Werror -g -O0 -std=c99 -c c_float.c -o c_float.o
        gcc c_float.o -o c_float
        rm c_float.o
        gcc -Wall -Wextra -Werror -g -O0 -std=c99 -c c_int.c -o c_int.o
        gcc c_int.o -o c_int
        rm c_int.o
        make[1]: Leaving directory '/home/tobias/coding/lua/nested_loop/blog'
        [tobias@zenit blog]$ make versions
        gcc -dumpversion
        5.1.0
        node -v
        v0.12.5
        js24 -h | grep Version
        Version: JavaScript-C24.2.0
        luajit -v
        LuaJIT 2.0.4 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/
        lua -v
        Lua 5.3.1 Copyright (C) 1994-2015 Lua.org, PUC-Rio
        php -v | grep built
        PHP 5.6.10 (cli) (built: Jun 11 2015 19:50:24)
        hhvm --version | grep HipHop
        HipHop VM 3.7.3 (rel)
        python2 -V
        Python 2.7.10
        python3 -V
        Python 3.4.3
        pypy -V
        Python 2.7.9 (295ee98b6928, Jun 02 2015, 16:33:44)
        [PyPy 2.6.0 with GCC 5.1.0]
        pypy3 -V
        Python 3.2.5 (b2091e973da69152b3f928bfaabd5d2347e6df46, Nov 18 2014, 20:15:54)
        [PyPy 2.4.0 with GCC 4.9.2]
        ruby -v
        ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
        [tobias@zenit blog]$ make runall
        ./c_int
        Result: 37 21.392 seconds
        ./c_float
        Result: 37.0000 33.230 seconds
        ./c_double
        Result: 37.0000 33.156 seconds
        node js.js
        Result: 37 12.641 seconds
        js24 js.js
        Result: 37 20.692 seconds
        luajit l_local_01.lua
        Result: 37 9.814 seconds
        luajit l_local_02.lua
        Result: 37 9.776 seconds
        luajit l_global.lua
        Result: 37 15.575 seconds
        lua l_local_01.lua
        Result: 37 176.755 seconds
        lua l_local_02.lua
        Result: 37 176.825 seconds
        lua l_global.lua
        Result: 37 519.820 seconds
        ruby rb.rb
        Result: 37 574.644391461 seconds
        hhvm php.php
        Result: 37 51 seconds
        php php.php
        Result: 37 404 seconds
        pypy python.py
        Result: 46 23.433 seconds
        pypy3 python.py
        Result: 46 55.445 seconds
        python2 python.py
        Result: 46 1216.872 seconds
        python3 python.py
        Result: 46 1217.423 seconds
        [tobias@zenit blog]$

        Reply
  18. junk0xc0de

    im just wondering how the assembler (lowest-level programming language) could be at rank #10 ?
    i think there’s something wrong! you know … maybe you’re just not good at assembly 🙂

    Reply
    1. Carles Mateo Post author

      Hi 🙂 There’s nothing wrong.
      In the article everything is explained.
      The Just In Time Compilers are doing a very good job. So god, that they evaluate and optimize the programs before launching. This includes optimizing loops and math operations and using the best instruction set available.
      In fact Assembler is not the language number ten. First 4 are different versions of Java that perform at the same speed. Then Go that uses a JIT compiler also, then LuaJIT (without using a JIT Lua is very slow), then C and C++. And as the article shows, by disassembling the C code, the code in Assembler could run as fast as the C or C++ versions, but it cannot beat the modern JIT compilers. Is one of the conclusions.
      I recommend you to read the article carefully, is very interesting for its conclusions. 🙂
      Cheers.

      Reply
      1. junk0xc0de

        Hi 🙂 i just tested the code with C++ and LuaJit and the result was amazing ! the program compiled with C++ took about 28 sec to perform. and the program ran with LuaJit took about 12 sec. the result really surprised me 🙂 !! i thought Jit compilers optimize the codes and hence i decided to recompile the C++ code with -O3 flag and i got surprised again 🙂
        the program ran within 1 msec or less !! i can’t believe that ! i never knew that -O1,-O2,-O3 could optimize the code such powerfully. the time difference between -O3 optimized program (1 msec) and not -O3 optimized (28 sec) program scares me 🙂

        Reply
        1. Carles Mateo Post author

          Hi!

          🙂 Glad you was amazed! 🙂
          This is really interesting. JIT compilers nowadays are really amazing. They do a lot of optimizations and use instructions available for the architecture running the program.

          The problem with -O3 is that not always work well. In this case is that it optimizes “too much”. Some times -O3 doesn’t work well, and makes things go wrong, like considering that a lot of loops are not necessary to get the final result. That’s why in the sample code I did:

          // This is another trick to avoid compiler’s optimization. To use the var somewhere
          printf(“Counter: %i\n”, i_counter);

          If ran correctly your program should print 37.

          I got scared at the beginning as well 😉

          Reply
          1. junk0xc0de

            Hello again 🙂 yes, -O3 makes a different machine code and ignores the loops. i put a “volatile” behind the counter variables and the optimization didn’t applied to the loops.
            anyway, thanks man, this performance test really taught me new stuff 😉

      2. Ian

        FYI Go does not use a JIT. It is AOT compiled. Using int will cause the compiler to use the most efficient integer for the architecture it is being compiled for.

        Reply
  19. Pingback: csort, my algorithm that heavily beats quicksort | Carles Mateo

  20. mpapec

    In this particular case perl style loops are about 45% faster than original code (v5.20).

    for my $i_loop1 (0 .. 9) { .. }

    Reply
    1. Carles Mateo Post author

      Apologies for the late reply. I was busy and I needed to have a window of time with the computer unloaded from other programs and services to be able to perform the tests in the right conditions.

      You was right Mpapec.
      With the Perl style loops, the code ran much faster, passing from the original 796 seconds to just 436 seconds. It’s very curious, but it is what it is.

      I’m preparing an update to the document and the graphics, also I’ve been requested to benchmark PHP 7, and will be publishing it soon.

      Thanks!.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.