# csort, my algorithm that heavily beats quicksort

Here is my proposal for sorting much, much faster than quicksort. Time and memory consumption in C, for my algorithm csort, versus quicksort, with an universe of 50 Million registers, with values duplicated from 0 to 1,000,000. Less is better

All the algorithms must be checked with different universes of data and their utility is limited to some situations. I explain here the algorithm, show the logic behind code in C, in PHP and Java, the metricas, push the boundaries of the computer… and test against different sets of data looking for the big-O. I prove how much more perfomant it is and explain everything.

The code is written in C to be sure that we compare everything in the same order of magnitude and to avoid optimizations from the compilers / precompilers / Just in Time engines that could bias the conclusions, and so knowing that we are playing fair, but there is a porting to PHP so everyone can understand the algorithm and there are some utilities written in PHP as well. I’ll also explain certain interesting things about PHP internal functions and how it handles memory, and other goodies about C. Finally the code is also available in Java, and is possible to observe how clearly my csort beats quicksort and how fast que JIT compiler performs. Perhaps you can be interested into reading my previous article performance of several languages.

We need a huge amount of data in order to be able to demonstrate the time differences clearly.

Some of the PHP tests and proves require you to have a decent computer with at least 16 GB of RAM. You can skip those or reduce the magnitude of the set of data, or simply use C.

Take this article as an intellectual entertainment, although I think you will discover this useful algorithm / technique of mine will allow you to sort much faster than quicksort in most of the cases I can imagine, and so you will have another tool in your set of resources for sorting fast and dealing with heavy data structures.

Problem

An initial intellectual problem description to enter into matter:

Having a set of 50,000,000 numbers, with values possible between 0 and 1,000,000:

1. Sort them and eliminate duplicates as fast as possible

To be fair and trustworthy the numbers will be generated and saved to disk, and the same set of numbers will be used for comparing between all the algorithms and tests and when checking with different languages.

Generating the numbers

For this test I created an array of numeric items.

I created a .txt file filled with the numbers, and then I load the same set of numbers for all the tests so all the numbers are always the same and results cannot be biased by a random set very lucky for the algorithm to sort.

Code in PHP to generate the random numbers and save to a file

If you prefer to generate a new set of numbers, you can do it by yourselves.

Please note: my computers have a lot of memory, but if you have few memory you can easily modify this program to save the values one by one and so saving memory by not having all the array in memory. Although I recommend you to generate big arrays and pass to a function while measuring the memory consumption to really understand how PHP manages passing Arrays to functions and memory. I found many Senior devs wrong about it.

``````<?php
/**
* Created by   : carles.mateo
* Creation Date: 2015-05-22 18:36
*/
error_reporting(E_ALL);
// To avoid raising warnings from HHVM
date_default_timezone_set('Europe/Andorra');

// This freezes the server, using more than 35 GB RAM
//define('i_NUM_VALUES', 500000000);
// 50M uses ~ 10 GB RAM
define('i_NUM_VALUES', 50000000);
define('i_MAX_NUMBER', 500000000);

function get_random_values() {

\$st_values = null;

\$s_values_filename = 'carles_sort.txt';

if (file_exists(\$s_values_filename)) {
\$st_values = file(\$s_values_filename);
print_with_datetime("Ensuring numeric values...");
foreach(\$st_values as \$i_key=>\$s_value) {
\$i_value = intval(\$s_value);
\$st_values[\$i_key] = \$i_value;
}
} else {
print_with_datetime("Generating values...");
\$st_values = generate_random_values();
print_with_datetime("Writing values to disk...");
save_values_to_disk(\$s_values_filename, \$st_values);
print_with_datetime("Finished writing values to disk");
}

return \$st_values;
}

function generate_random_values() {

\$i_num_values = i_NUM_VALUES;
\$i_max_number = i_MAX_NUMBER;

\$st_values = array();

for(\$i_counter = 0; \$i_counter<\$i_num_values; \$i_counter++) {
\$i_random = rand(0, \$i_max_number);
\$st_values[] = \$i_random;
}

return \$st_values;

}

function save_values_to_disk(\$s_values_filename, \$st_values) {
\$i_counter = 0;

\$o_file = fopen(\$s_values_filename, 'w');
foreach (\$st_values as \$i_key => \$s_value) {
\$i_counter++;
fwrite(\$o_file, \$s_value.PHP_EOL);
if ((\$i_counter % 100000) == 0) {
// every 100K numbers we show a status message
print_with_datetime("Item num: \$i_counter (\$f_advance_percent%) value: \$s_value written to disk");
}
}
fclose(\$o_file);
}

function get_datetime() {
return date('Y-m-d H:i:s');
}

function print_with_datetime(\$s_string) {
echo get_datetime().' '.\$s_string."\n";
}
print_with_datetime('Starting number generator...');

\$st_test_array = get_random_values();

``````

• Obviously to generate an array from 50M items in PHP you need enough of RAM. In PHP this takes ~9GB, in C you need much less: just 50M * 4 bytes per item = 200 MB, although as my program is multipurpose and use several arrays you will realistic be needing >2 GB RAM.
• Playing with memory_get_usage() ayou will discover that returning a big array from the function to the main code does not work by copying it to the stack, and that by passing the array to the function by value (not specifying by reference), does not copy the values neither. A pointer alike is passed unless the data passed is modified. It used a technique called copy-on-write. Other modern languages use similar techniques as well to save memory. So it is not true that “`A copy of the original \$array is created within the function scope`“. As PHP.net section PHP at the Core: A Hacker’s guide defines “PHP is a dynamic, loosely typed language, that uses copy-on-write and reference counting.
``````        foreach(\$st_values as \$i_key=>\$s_value) {
\$i_value = intval(\$s_value);
\$st_values[\$i_key] = \$i_value;
}``````

This forced conversion to int is because when PHP reads the values from the disk it reads them as String (and it puts a jump of line at the end as well). We convert to make sure we use PHP integer values.

To know the max value of an Integer in your platform do:

``echo PHP_INT_MAX;``

For a 64 bits is: 9223372036854775807 I limited the number of registers sorted in this tests to 50,000,000 because it makes PHP use 12.6GB of RAM, and my computer has 32 GB. I tried with 500,000,000 but the Linux computer started to swap like crazy and had to kill the process by ssh (yes, the keyboard and mouse were irresponsible, I had to login via ssh from another computer and kill the process). Although 500M is a reasonably number for the code in C or Java.

• So the fourth conclusion is that for working with big Data Structures PHP is not recommended because it uses a lot of RAM. You can use it if its flexibility brings you more value than performance and memory consumption.
• Everything that you do can do in PHP will be much much much faster if done in C. Sample status of the advance of the process of writing to disk from PHP (that is slow) 50M values from ranges 0 to 500M

My solution 1: simple csort

I’ve developed several implementations, with several utility. This is the first case.

Please take a look at this table of the cost of sorting an Array by using different algorithms from http://bigocheatsheet.com/ : It is interesting to see the complexity of each algorithm in all the cases.

So Quicksort has a complexity of O(n log(n)) for the best case, and O(n^2) for the worst.

For a 50 Million Array, so n=50M, then 50M * log2(50M) = 1,278,771,237.95 ~ 1.28 Billion My algorithm approach is this: having the unsorted array loaded into the array st_values, I do a main loop like this (remember that we want to sort and remove duplicates):

``````for (l_counter=0; l_counter < l_max_values; l_counter++) {
st_values_sorted[st_values[l_counter]] = st_values[l_counter];
}``````

So, I copy to the destination array st_values_sorted the values, assigning the value as the index of the Array st_values_sorted.

With this:

1. We are sure we will have no duplicates, as if the value is duplicated n times it will be simply reassigned to the same position n times, and so having only it one time as st_values_sorted[n] in the final array.
2. We have an array, that has as index, the same value. So even, if in a foreach loop you would get the values assorted, they’re sorted by their key (hash index numeric value).

Finally, if you want to generate a sorted sequential array or a linked list, we do:

``````        // Resort for a creating a ordered list
l_max_range = l_max_values_sorted + 1;
for (l_counter=0; l_counter < l_max_range; l_counter++) {
if (st_values_sorted[l_counter] > -1 ) {
// If the value existed
st_values_sorted_renum[l_counter_hit] = st_values_sorted[l_counter];
l_counter_hit++;
}
}
``````

The magic of this:

1. In the first loop, we read each item of the array just one time. So  O(n)
2. In the second loop, if we really want to generate a sorted list, we simply loop the sorted and deduplicated Array by index, as seen in the code before, from the initial possible value (0) to the maximum value we can have (in this sample m=1M)
3. So we loop, for 1,000,001 items
4. So in total we do n+m cycles
The cost in the Best and in the Worst case is O(n)
5. This is fucking awesome faster than quicksort
For quicksort time complexity is: Best O(n log n). Worst O(n^2)
6. After quicksort you would have to delete duplicates also, adding some cost (in the case of PHP unique() function is really costly)
7. An optimization is that we can also associate the index, in the process of loading the values from disk, so we save the first loop (step 1) and so we save O(n).
``````    while( b_exit == 0 ) {
printf("Unable to read item %li from file.\n", l_counter);
exit(EXIT_FAILURE);
}
st_values[l_number] = l_number;

l_counter++;
if (l_counter > l_max_values -1) break;
}``````

And we will have the Array sorted (by key) without any sort. That’s why I say…
The fastest sort… is no sorting at all!.

This is the recommended way of using the algorithm because it saves a lot of memory, so, instead of having and Array of 50M for loading the data to sort, and another of 1M to hold the csorted data, we will have only the 1M Array. We will be adding the data using csort and eliminating duplicates and sorting on the way.

8. If in the process of loading the have all the values sorted, we will not do the first loop O(n), so our cost will be O(m) where m<n
9. Finally, if our array is sorted (by index) there is no need to do the second loop to produce a sorted list, as the array is sorted by key. But if we need to do it, the cost is m, otherwise we have no cost at all, it was sorted and deduplicated when loaded from disk!
10. Even doing both loops, this is awesome faster than quicksort (even without deduplicating), for the universes tested:
1. 50M items, possible values 0-1M
2. 50M items, possible values 0-50M
3. 50M items, possible values 0-500M
4. The same proportions with less items

Comparing the performance for the first case

Notes:

• Please note the code has been compiled with -O0 to disable optimizations for what I think is a bug I found in gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2. that manifests in the csort with compression. With the parameter -O2 the times for the sorting are half and better (0.1 seconds for csort basic, around 4.7 for quicksort).
• To use arrays of 500M, so using more than 2GB of ram you should pass the parameter  -mcmodel=medium to gcc or the compiler will return an error “relocation truncated to fit: R_X86_64_32S against .bss”
• Computer used is Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz, as computers nowadays use turbo, times vary a bit from test to test, so tests are done many times and an average value is provided
• Please understand how your language pass the parameters in functions before trying to port this code to another language (passing a pure by value string of 50M items to a function would not be a good idea, even worst for the case of recursive functions)

As you can see in the C code below, I have used two quicksort algorithms, both quicksort take between 8.7 and 10 seconds. (the second algorithm is a bit slower)  This time does not change, no matter if the random numbers were for ranges 0-1M, 0-50M or 0-500M.

My implementation of the quicksort with removing duplicates take just a bit more as I use the same base of the csort for removing the duplicates.

With the csort basic, doing the two loops (so no optimizing on load) the time for sorting with removing duplicates and generating a final array like a list, depends on the maximum possible value: Advantages of csort (for its purpose, so for the universes compatible, so if the range of values m is smaller or not many times bigger than the number of items n):

• Is super-fast
• It saves a lot of memory
• Can be used to remove duplicates or to keep duplicates
• It is possible to use it with hybrid universes (see additional optimizations) and have significant performance gains respect using a single sorting algorithm
• It is very easy to understand and to implement

• It is intended for the universe of max(m) < n. If max(m) (maximum value of the number of random values) is much much higher than n (the number of items), then the sorted by key array will use more memory (bigger array), even if it has few values. (* see csort with compression bellow and Additional optimizations sections)
• In certain border edge cases where n is many more times smaller than the difference between min(m) and max(m) the algorithm could be slower than quicksort. (this is if data does not accomplish max(m) < n)
• The sorted array uses memory. Other algorithms, starting from an initial array, do not need additional arrays in order to sort. Although we can generate a final linked list and remove the arrays from memory during the process we are using that memory. But this was supposing we have an original array not sorted, but actually we don’t need to have an starting array, just csort the Data as you get it to the destination Array, and then you save memory because an Array of 1M takes less memory than an Array of 50M. (See My solution 2: optimized when reading)

My solution 2: optimized when reading

This is basically csort, but using the read from disk process to assign the right index, not only save CPU but a lot of memory.

That way the complexity is totally reduced as data is sorted (by index) and deduplicated when loading, with no cost.

In the case of the optimized read, with an universe of 50M items, ranges 0-1M, it takes the astonishing time of.. 3 ms! So 0.003 seconds. Probably this represents better the idea of beating quicksort: csort 0.003 seconds vs ~10 seconds quicksort.

My solution 3: supporting linked data

Image we have a numeric array with unique values, for example id, and another String array with the names.

For example:

st_values = 1500000; st_names = “Larry Page”;

st_values = 731207; st_names = “Sergey Brin”;

st_values = 23788889; st_names = “Steve Wozniak”;

st_values = 5127890; st_names = “Bill Gates”;

st_values = 5127890; st_names = “Alan Turing”;

st_values = 73215; st_names = “Grace Hopper”;

st_values = 218919; st_names = “Jean E. Sammet”;

st_sorted = 1500000;

st_values = 731207;

We will do:

st_sorted = 0;

st_values = 1;

st_values = 2;

So, we will have csorted for the id, but the value will be pointed to the linked data in st_names[].

This can be used too with csort optimized reading and saving the use in memory of the array st_values[]. Obviously st_names[] must be kept.

My solution 4: csort with compression and supporting duplicates

csort is very quick for sorting and removing duplicates. But what happens if we want to sort and keep the duplicates?

The algorithm csort with compression what does is:

``````    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
for (l_counter=0; l_counter < l_max_values; l_counter++) {
l_value = st_values[l_counter];
st_values_sorted[l_value] = st_values_sorted[l_value] + 1;
}
l_counter_hit = l_max_values_sorted + 1;
}   ``````

So, it assigns the index, as the other versions of csort, but instead of assigned the same value as the index, it is assigned the number of times that this value is found.

So, assuming we have a list like this:

{0,10,15,0,3,5,21,5,100000,0,7,5}

It will produce:

st_values = 3;
st_values = 0;
st_values = 0;
st_values = 1;
st_values = 0;
st_values = 3;
st_values = 0;
st_values = 1;…
st_values = 1;

From this Array it is possible to get the original linked list sorted easily:
{0,0,0,3,5,5,5,7,10,15,21,100000}

Now, take the first universe of data 50M items, with ranges 0-1M, and you get that you have compressed an originally unsorted array of 50M items into a sorted array of 1M items (1,000,001 to be exact) and you can still access by index. So a compression of 50 to 1, for this universe of numbers. (As you are thinking that won’t compress, but use more space, if the range of numbers is bigger than the number of items.)

If your values repeat a lot, and you have a huge set of items to sort, like in Big Data, this algorithm is wonderful and have huge memory savings and amazing speedy random access to data.

The idea, as mentioned before, is to do not load the data into an Array, and then sort, but to assigning using csort when reading data from disk. That way instead of having an Array of 50M and another of 1M, we would have only a single Array of 1M.

This is the relation of times for the universe 50M, 0-1M:

• csort basic: 0.553 seconds
Please note: this sample including the time to generate a array like a list: st_values_sorted_renum, so the time just to create the csorted array is ~3ms less
• csort opt read: 0.003 seconds
• csort with compression and duplicates support: 0.883 seconds But awesome news are that the csort with compression can be implemented on the load time, like csort opt read.

1) Using overlay to save Memory and run even faster

Ideally min(m) and max(m) should be calculated.

Imagine that you have a universe of data composed by id driver licenses. Guess that those numbers range from 140,000,000 to 190,000,000.

There is no need have create an array of 190M items. Just create an array of 50M, (max(m) – min(m)) and save an overlay to a variable.

l_overlay = 140000000;

Then simply do your loops from 0 to max(m) – l_overlay and when you’re building the final linked list or the sequentially ordered final array add l_overlay to each value.

You will save memory and increase speed.

2) Hybrid universes / spaces

Guess you have a hybrid universe, composed by let’s say 75% of your values in the range of 0-1M and 25% of your values being frontier values, f.e. between 35M and 900M.

``````    // Last value covered by the csort
long l_last_frontier = 1000000;

long l_counter_unsorted_array = 0;

while( b_exit == 0 ) {
printf("Unable to read item %li from file.\n", l_counter);
exit(EXIT_FAILURE);
}

if (l_number > l_last_frontier) {
st_values_unsorted[l_counter_unsorted_array] = l_number;
l_counter_unsorted_array++;
} else {
st_values[l_number] = l_number;
}

l_counter++;
if (l_counter > l_max_values -1) break;
}``````

You can then sort the 25% with another algorithm and merge the csorted array (it goes first) and the 25% (bigger values, go later), now sorted Array, and you’ll have benefit from a 75% super-fast sorting.

3) Hybrid sort big-O

For my tests, as long as you have memory, you can csort up to n * 10, and have significant time saving respect Quicksort. So if you have a really huge universe you can implement a hybrid algorithm that sorts up to m =  n * 10 (l_last_frontier = n * 10) and use another algorithm to sort the rest >m. Finally just add the sorted second block to the sorted from first block results.

4) Hybrid sorting, with undetermined n and m

If you don’t know how long will be your space, like in a mapreduce, or the max value, you can also use the hybrid approach and define an array big enough, like 10M. If your arrays are based on long int, so using 4 bytes per position, it will take only 38 Mbytes of RAM. Peanuts. (Use the l_last_frontier seen in the point before just in case you have bigger values)

When you’re loading the data, and csorting using csort opt read, keep a variable with the max and min values found, so you’ll not have to parse all the 10M, just from min-max in order to generate a final sorted, deduplicated, list or array.

It’s up to you to fine tune and adjust the size for the minimum memory consumption and maximum efficiency according to the nature of your data.

5) Using less memory

If you have stored the data to sort in a Collection, you can reduce an item one by one, so you will save memory for huge amounts of Data. So, you can add that item to csort resulting array, and eliminate from the Collection/LinkedList source. One by one. That way you’ll save memory. With the sample of 50M registers with 1M different random values, it passes from 400MB* of RAM to 4MB of RAM in C. (* I’m calculating here a Linked List structure of 2 Long Data-type, one for the value and another for next item, in C. In other languages gain can be even more spectacular)

csort and Big Data

With the progression of Big Data we are finding ourselves with seas of data. Those universe of data tend to have a much bigger number of registers (n) than the range of values (m).

For example, in the case of devices monitoring temperatures every second, we will have 3,600 rows per hour, 86,400 rows per day, 31,536,000 per year while the range of values read will be a very stable subset, like from 19.0 to 24.0 C. In this example, multiplying the float value per 10 to work always with Integers, we would be working with a subset of 190-240, so m would be 51 if optimized, or 0-240 if not. If we want to get the unique temperatures for the past year, csort will generate just an Array of 51 items, crushing any ordering algorithm in space and time.

The bigger the Data is, the faster csort will perform compared to other sorting algorithms.

The complete code in C This is the complete source code. It supports all the cases and you can invoke it from command line with a parameter for the case number you want to test.

``````/*
* File:   main.c
* Author: Carles Mateo
*
* Created on 2015-05-14 16:45
*/

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <string.h>

/*
* Change does values for your tests, and recompile it
*/
#define l_MAX_ITEMS 50000000
#define l_MAX_VALUE 1000000

static int const i_ALGORITHM_QUICKSORT        = 1;
static int const i_ALGORITHM_QUICKSORT_UNIQUE = 2;
static int const i_ALGORITHM_QUICKSORT2       = 3;
static int const i_ALGORITHM_CSORT_BASIC      = 7;
static int const i_ALGORITHM_CSORT_OPT1       = 8;
static int const i_ALGORITHM_CSORT_COMP       = 9;

#define s_HELP  "CSort by Carles Mateo\n"\
"=====================\n"\
"\n"\
"This demonstrates a sort algorithm faster than quicksort,\n"\
"as proposed in the article.\n"\
"http://blog.carlesmateo.com\n"\
"\n"\
"Please invoke the program with one of these parameters:\n"\
"\n"\
"--help or without parameters for this help message\n"\
"1 To test using quicksort\n"\
"  Only sorts, does not deletes duplicates\n"\
"2 To test using quicksort and removing duplicates\n"\
"3 To test using another implementation of quicksort\n"\
"  Only sorts, does not deletes duplicates\n"\
"7 To test using csort (basic example, sorts and deletes duplicates)\n"\
"8 To test using csort, improvement with indexes assigned when reading\n"\
"  Use to watch the time saving from basic csort\n"\
"9 To test using csort, with indexes assigned when reading and counting\n"\
"  the number of coincidences (compressed output array)\n"\
"\n"\
"Please, make sure that the file carles_sort.txt is present in the\n"\
"current directory.\n"\
"\n"

// Allocate in the BSS segment, which is a part of the heap
static long st_values[l_MAX_ITEMS];
// Only max value is required
static long st_values_sorted[l_MAX_VALUE];
static long st_values_sorted_renum[l_MAX_VALUE];

int init_values(long l_max_values);
void quicksort(long x[l_MAX_ITEMS], long first, long last);
void quicksort2(long arr[], long beg, long end);
void swap(long *a, long *b);
void printwdt(char s_string[]);

int init_values(long l_max_values) {
long l_counter = 0;
for (l_counter = 0; l_counter < l_max_values; l_counter++) {
// Init to -1
st_values[l_counter] = -1;
}
}

int init_values_sorted(long l_max_values, long l_num) {
long l_counter = 0;
for (l_counter = 0; l_counter < l_max_values; l_counter++) {
// Init to -1
st_values_sorted[l_counter] = l_num;
}
}

// Quicksort modified from original sample from http://www.cquestions.com/2008/01/c-program-for-quick-sort.html
void quicksort(long x[l_MAX_ITEMS], long first, long last){
long pivot,j,temp,i;

if (first<last) {
pivot=first;
i=first;
j=last;

while(i<j){
while(x[i]<=x[pivot]&&i<last)
i++;
while(x[j]>x[pivot])
j--;
if(i<j){
temp=x[i];
x[i]=x[j];
x[j]=temp;
}
}

temp=x[pivot];
x[pivot]=x[j];
x[j]=temp;
quicksort(x,first,j-1);
quicksort(x,j+1,last);

}
}

// Modified by Carles Mateo to use long based on http://alienryderflex.com/quicksort/
void quicksort2(long arr[], long beg, long end)
{
if (end > beg + 1)
{
long piv = arr[beg], l = beg + 1, r = end;
while (l < r)
{
if (arr[l] <= piv)
l++;
else
swap(&arr[l], &arr[--r]);
}
swap(&arr[--l], &arr[beg]);
quicksort2(arr, beg, l);
quicksort2(arr, r, end);
}
}

void swap(long *a, long *b)
{
long t=*a; *a=*b; *b=t;
}

/* print with date and time
*/
void printwdt(char s_string[]) {
time_t timer;
char buffer;
struct tm* tm_info;

time(&timer);
tm_info = localtime(&timer);

strftime(buffer, 26, "%Y-%m-%d %H:%M:%S", tm_info);

printf("%s %s\n", buffer, s_string);

}

int main(int argc, char** argv) {

char s_filename[] = "carles_sort.txt";
long l_max_values = l_MAX_ITEMS;
long l_max_values_sorted = l_MAX_VALUE;
char s_ch;
FILE *o_fp;
long l_counter=0;
long l_counter_hit=0;
long l_number;
int b_exit = 0;
int i_algorithm = 0;
// Small counter for printing some final results
int i_counter=0;
// temp value for the final print
long l_value=0;
// Used for counting the number of results on optimized array
long l_total_items_sorted = 0;
long l_max_range;
long l_items_found = 0;

if (argc < 2) {
printf(s_HELP);
exit(EXIT_SUCCESS);
}

printwdt("Starting program...");
if (strcmp(argv, "1") == 0) {
i_algorithm = i_ALGORITHM_QUICKSORT;
printwdt("Algorithm: Quicksort with duplicates");
} else if (strcmp(argv, "2") == 0) {
i_algorithm = i_ALGORITHM_QUICKSORT_UNIQUE;
printwdt("Algorithm: Quicksort removing duplicates");
} else if (strcmp(argv, "3") == 0) {
i_algorithm = i_ALGORITHM_QUICKSORT2;
printwdt("Algorithm: Quicksort alternative");
} else if (strcmp(argv, "7") == 0) {
i_algorithm = i_ALGORITHM_CSORT_BASIC;
printwdt("Algorithm: csort basic");
} else if (strcmp(argv, "8") == 0) {
i_algorithm = i_ALGORITHM_CSORT_OPT1;
} else if (strcmp(argv, "9") == 0) {
i_algorithm = i_ALGORITHM_CSORT_COMP;
printwdt("Algorithm: csort compress");
}

if (i_algorithm == 0) {
printf(s_HELP);
exit(EXIT_SUCCESS);
}

if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT2) {
// Use this if you want to make tests with partial data to find the end of the values
//printf("Initializing %li values to -1...\n", l_max_values);
//init_values(l_max_values);
}

if (i_algorithm == i_ALGORITHM_CSORT_BASIC || i_algorithm == i_ALGORITHM_CSORT_OPT1 || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {
// We init to -1 as it could happen that some indexes have not a value,
// and we want to know what's the case.
printf("Initializing %li values to -1...\n", l_max_values_sorted + 1);
init_values_sorted(l_max_values_sorted + 1, -1);
}

if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
printf("Initializing %li values_sorted to 0...\n", l_max_values_sorted + 1);
init_values_sorted(l_max_values_sorted + 1, 0);
}

o_fp = fopen(s_filename,"r"); // read mode

if( o_fp == NULL ) {
perror("Error while opening the file.\n");
exit(EXIT_FAILURE);
}

while( b_exit == 0 ) {
printf("Unable to read item %li from file.\n", l_counter);
perror("File ended before reading all the items.\n");
exit(EXIT_FAILURE);
}
if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE ||
i_algorithm == i_ALGORITHM_QUICKSORT2 ||
i_algorithm == i_ALGORITHM_CSORT_BASIC || i_algorithm == i_ALGORITHM_CSORT_COMP) {
st_values[l_counter] = l_number;
}
if (i_algorithm == i_ALGORITHM_CSORT_OPT1) {
st_values[l_number] = l_number;
l_total_items_sorted++;
}

if (l_counter < 10) {
// Echo the reads from the first 10 items, for debugging
printf("Item %li: read %li\n", l_counter, l_number);
}

l_counter++;
if (l_counter > l_max_values -1) break;
}

printf("Sample values read from file: First item: %li Last item: %li\n", st_values, st_values[l_max_values -1]);

fclose(o_fp);

printwdt("Sorting...");

struct timeval time;
gettimeofday(&time, NULL); //This actually returns a struct that has microsecond precision.
long l_microsec_ini = ((unsigned long long)time.tv_sec * 1000000) + time.tv_usec;

if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {
quicksort(st_values, 0, l_MAX_ITEMS-1);
l_counter_hit = l_MAX_ITEMS;
}

if (i_algorithm == i_ALGORITHM_QUICKSORT2) {
quicksort2(st_values, 0, l_MAX_ITEMS-1);
l_counter_hit = l_MAX_ITEMS;
}

if (i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {
printwdt("Removing duplicates...");
l_counter_hit = 0;
long l_last_known_value = -1;
for (l_counter=0; l_counter < l_max_values; l_counter++) {
if (l_last_known_value != st_values[l_counter]) {
l_last_known_value = st_values[l_counter];
st_values_sorted_renum[l_counter_hit] = st_values[l_counter];
l_counter_hit++;
}
}
}

if (i_algorithm == i_ALGORITHM_CSORT_BASIC) {
// Shrink and set the index
for (l_counter=0; l_counter < l_max_values; l_counter++) {
st_values_sorted[st_values[l_counter]] = st_values[l_counter];
}

// Resort for forearch
l_max_range = l_max_values_sorted + 1;
for (l_counter=0; l_counter < l_max_range; l_counter++) {
if (st_values_sorted[l_counter] > -1 ) {
st_values_sorted_renum[l_counter_hit] = st_values_sorted[l_counter];
l_counter_hit++;
}
}
}

if (i_algorithm == i_ALGORITHM_CSORT_OPT1) {
l_max_range = l_max_values_sorted + 1;
for (l_counter=0; l_counter < l_max_range; l_counter++) {
if (st_values[l_counter] > -1 ) {
st_values_sorted_renum[l_counter_hit] = st_values[l_counter];
l_counter_hit++;
}
}
}

if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
for (l_counter=0; l_counter < l_max_values; l_counter++) {
l_value = st_values[l_counter];
st_values_sorted[l_value] = st_values_sorted[l_value] + 1;
}
l_counter_hit = l_max_values_sorted + 1;
}

struct timeval time_end;
gettimeofday(&time_end, NULL); //This actually returns a struct that has microsecond precision.
long l_microsec_end = ((unsigned long long)time_end.tv_sec * 1000000) + time_end.tv_usec;

long l_microsec_total = l_microsec_end - l_microsec_ini;
long l_milliseconds_total = l_microsec_total / 1000;
double f_seconds = (double)l_milliseconds_total / 1000;
printf("Time used: %li microseconds = %li milliseconds = %f seconds \n", l_microsec_total, l_milliseconds_total, f_seconds);

printf("Total items in the final array %li\n", l_counter_hit);
printf("Sample items in the final array:\n");

if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
l_max_range = l_max_values_sorted + 1;
l_counter_hit = 0;

for (l_counter=0; l_counter < l_max_range; l_counter++) {
if (st_values_sorted[l_counter] > 0) {
l_counter_hit++;
l_items_found = l_items_found + st_values_sorted[l_counter];
}
}
printf("A final array of %li significant items, contains sorted lossless %li\n", l_counter_hit, l_items_found);

} else {
for (i_counter=0; i_counter<10000; i_counter=i_counter+1000) {
/* if (st_values_sorted[i_counter] > -1) {
// If you enjoy looking at values being printed or want to debug :)
printf("item at position %i : %li\n", i_counter, st_values_sorted[i_counter]);
}*/
if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT2) {
l_value = st_values[i_counter];
} else {
l_value = st_values_sorted_renum[i_counter];
}
printf("item at position %i : %li\n", i_counter, l_value);

}

}

return (EXIT_SUCCESS);
}
``````

My solution in PHP

As told before I have implemented my algorithm in PHP, and it outperforms by much the PHP native functions (PHP sort() is a quicksort algorithm written in C, and unique() ). I explain everything below for the PHP lovers, is the same than in C, but with special considerations for PHP language. Even in PHP the efficiency of the algorithm is clearly demonstrated.

Warning, do not attempt to run the PHP code as is in your machine unless you have at least 16 GB of RAM. If you have 8 GB RAM you should reduce the scope to avoid your computer to hang up or swap very badly.

Printing of datetime is done by those functions:

``````function get_datetime() {
return date('Y-m-d H:i:s');
}

function print_with_datetime(\$s_string) {
echo get_datetime().' '.\$s_string."\n";
}``````

So once we have our numbers generated we do a sort(\$st_test_values);

And we calculate the time elapsed with microtime for profiling:

``````\$f_microtime_ini = microtime(true);

print_algorithm_type('PHP sort() method');
sort(\$st_test_array);

\$f_microtime_end = microtime(true);

\$f_microtime_resulting = \$f_microtime_end - \$f_microtime_ini;

echo "Microtime elapsed: \$f_microtime_resulting\n";`````` In this case we did a PHP sort(), and we didn’t even remove the duplicates and it takes 88.5 seconds.

If we do a sort() and array_unique() to get the unique results sorted it takes: 741.27 seconds We do the opposite array_unique() and sort() and the result is: 402.45 seconds

Please note that sorting and doing array_unique takes more time than doing in the opposite order.

This is curious as one may thing that removing duplicates and sorting could be much faster than sorting and removing duplicates, but just sorting takes 88 seconds. I found the explanation looking at the source Source Code of PHP array_unique. Internally it sorts the array (zend_qsort line 2813) for removing the duplicates. So well, we were sorting twice, and sorting a sorted array. This is the worst case for quicksort, so that explains the slowlyness.

Then it enters what I engined. PHP works with arrays, that can have a numeric index. So my point is: the fastest way to sort an array is… not sorting it at all.

We know that the range of values we have is much smaller than the number of registers:

• We have values between 0 and 1,000,000
• We have 50,000,000 of registers

PHP sort() function is written in C, so it should be fastest than any thing we program in PHP. Should be…. really?.

Knowing that the values are 0-1M and the registers are 50M, and knowing the number of operations that a quick sort will do, I figured another way to improve this.

Look at this code:

``````function csort(\$st_test_array) {

\$s_algorithm = 'Carles_Sort';
echo "Testing \$s_algorithm\n";

\$i_loop_counter = 0;
\$st_result_array = Array();

\$i_ending_pos = count(\$st_test_array);
for(\$i_pos = 0; \$i_pos < \$i_ending_pos; \$i_pos++) {
\$i_loop_counter++;
\$i_value = \$st_test_array[\$i_pos];
\$st_result_array[\$i_value] = \$i_value;
}

return \$st_result_array;

}

``````

So, the only think I’m doing is looping the 50M array and in a new array \$st_result_array setting the key to the same value, than the integer value. So if the original \$st_test_array is equal to 574145, then I set st_result_array = 574145. Easy.

With this we achieve three objectives:

1. You eliminate duplicates. With 50M items, with values ranged from 0 to 1M you’ll set a value many times. No problem, it will use it’s index.
2. We have the key assigned, matching the value, so you can reference every single value
3. It’s amazingly faster. Much more than any sorting even in C

If you do a var_export or var_dump or print_r or a foreach you will notice that the first item of the array has key 574145 and value 574145, the second 262240, the third 153341, the fourth 656736 and the fifth 879329. They are not ordered in the way that one will think about an ordered array parsed with foreach(), but the thing is: do we need to sort it?. Not really. Every key is a hash that PHP will access very quickly. So we can do something like this to get the results ordered:

``````    \$i_max_value = max(\$st_result_array);

for(\$i_pos = 0; \$i_pos <= \$i_max_value; \$i_pos++) {
if (isset(\$st_result_array[\$i_pos])) {
\$st_result_array2[] = \$i_pos;
}
}``````

In this code we create a new array, with the values sorted in the sense that a foreach will return them sorted. But instead of recurring 50M registers, we only go to the max value (as values and indexes are the same, the max value also shows the max index), because we now don’t have 50M registers as we eliminated the duplicates, but 1M. And in an universe of 50M values from a range 0-1M, many are duplicates.

Even doing this second step still makes this sorting algorithm written in PHP, much more efficient than the quicksort written in C.

I ported the code to C, with the same data file, and it takes 0.674 seconds (674 milliseconds) and it uses only ~1 GB of RAM. Take in consideration that if the data is directly loaded assigning the right index, so doing like in the next code, it will be blazing fast:

``````    o_fp = fopen(s_filename,"r"); // read mode

if( o_fp == NULL ) {
perror("Error while opening the file.\n");
exit(EXIT_FAILURE);
}

while( 1 ) {
fscanf(o_fp, "%li", &l_number);
// The key is here ;)
st_values[l_number] = l_number;
l_counter++;
if (l_counter > l_max_values -1) break;
}

printf("First item: %li Last item: %li\n", st_values, st_values[l_counter-1]);

fclose(o_fp);``````

So if we only resort in another array to have the values sorted eliminating the non existing values (half the tasks we do in the previous program), it takes and order of milliseconds. Using csort opt read with a busy computer (firefox using 16 GB RAM and many tabs) it takes 108 ms, with a idle computer 3 ms

Now imagine you don’t want to eliminate the duplicates.

You can create an memory optimized array where the index is the number, and the value represents the number of apparitions in the original file:

``````\$st_example = Array( 0 => 1,
3 => 5,
4 => 1);
``````

That would mean that the original sorted array, in form of list would be:

{0, 3, 3, 3, 3, 3, 4}

For the case of C, where the arrays have static size, a value of 0 would indicate there is no presence of that number in the orderer list so:

``````st_example = 1;
st_example = 0;
st_example = 0;
st_example = 5;
st_example = 1;
``````

Even in the case of the resulting C array, it will be much smaller than the original 50M Array, as will have only the size corresponding to the max value. In our sample 1M.

In PHP you can get an Array alike the csort compressed by using a function called array_count_values(). It will not have the non existing items = 0, so it would not generate st_example = 0 neither st_example = 0, but it would generate the others.

So conditions for the algorithm to work blazing fast:

• The max number (max(m)) must be ideally lower than the number of registers (this is true for languages like C or Java. Languages like PHP grow dynamically the Array size and work as Array Hash, so it’s not necessary as we can have st_sorted and st_sorted and the array size will be only 2 items. In this case we can use less memory than the csort algorithm and be faster using foreach() instead as for() from 0 to 1M)
• The most numbers that are repeated, the faster that will work the algorithm

• If you directly load the values form disk assigning the value as a key as well, you’ll save one loop per item (n=50M in this universe) and make it faster (having only to parse count(m), in PHP it doesn’t have to be 0-m).
• The fastest way to sort it, is no sorting at all. You can have the values stored per key, even if a foreach doesn’t start by 0,1,2… simply accessing the value index you need will be super-fast.

Source code in PHP

```<?php
/**
* User: Carles Mateo
* Date: 2015-04-05 14:19
*/

error_reporting(E_ALL);
// To avoid raising warnings from HHVM
date_default_timezone_set('Europe/Andorra');

// This freezes the server, using more than 35 GB RAM
//define('i_NUM_VALUES', 500000000);
// 50M uses ~ 10 GB RAM
define('i_NUM_VALUES', 50000000);
define('i_MAX_NUMBER', 1000000);

define('i_EXIT_SUCCESS', 0);
define('i_EXIT_ERROR', 1);

abstract class CsortDemo {

const i_ALGORITHM_QUICKSORT        = 1;
const i_ALGORITHM_PHP_QSORT        = 4;
const i_ALGORITHM_PHP_ARRAY_FLIP   = 5;
const i_ALGORITHM_PHP_ARRAY_UNIQUE = 6;
const i_ALGORITHM_CSORT_BASIC      = 7;
const i_ALGORITHM_CSORT_OPT1       = 8;

public static function displayHelp() {
\$s_help = <<<EOT
CSort.php by Carles Mateo
=========================

This demonstrates a sort algorithm faster than quicksort,
as proposed in the article.
http://blog.carlesmateo.com

Please invoke the program with one of these parameters:

--help or without parameters for this help message
1 To test using a quicksort algorithm written in PHP
4 To test using PHP sort() (is a quicksort algorithm)
5 To test using PHP array_flip()
6 To test using PHP unique() (is a quicksort algorithm)
7 To test using csort (basic example, sorts and deletes duplicates)
8 To test using csort, improvement with indexes assigned when reading
Use to watch the time saving from basic csort

9 To test using csort, with indexes assigned when reading and counting
the number of coincidences (compressed output array)

Please, make sure that the file carles_sort.txt is present in the
current directory.

EOT;

echo \$s_help;

}

public static function get_random_values(\$i_algorithm) {

\$st_values = Array();
\$i_value = 0;

\$s_values_filename = 'carles_sort.txt';

if (file_exists(\$s_values_filename)) {
if (\$i_algorithm == CsortDemo::i_ALGORITHM_CSORT_OPT1) {
\$o_fp = fopen(\$s_values_filename, 'r');
if (\$o_fp) {
while(\$s_line = fgets(\$o_fp) !== false) {
\$i_value = intval(\$s_line);
\$st_values[\$i_value] = \$i_value;
}
}
} else {
\$st_values = file(\$s_values_filename);
self::print_with_datetime("Ensuring numeric values...");
foreach(\$st_values as \$i_key=>\$s_value) {
\$i_value = intval(\$s_value);
\$st_values[\$i_key] = \$i_value;
}

}
} else {
self::print_with_datetime("Generating values...");
\$st_values = self::generate_random_values();
self::print_with_datetime("Writing values to disk...");
self::save_values_to_disk(\$s_values_filename, \$st_values);
self::print_with_datetime("Finished writing values to disk");
}

return \$st_values;
}

public static function save_values_to_disk(\$s_values_filename, \$st_values) {
\$i_counter = 0;
\$i_max_values = 0;

\$i_max_values = count(\$st_values);

\$o_file = fopen(\$s_values_filename, 'w');
foreach (\$st_values as \$i_key => \$s_value) {
\$i_counter++;
fwrite(\$o_file, \$s_value.PHP_EOL);
if ((\$i_counter % 100000) == 0) {
// every 100K numbers we show a status message
self::print_with_datetime("Item num: \$i_counter (\$f_advance_percent%) value: \$s_value written to disk");
}
}
fclose(\$o_file);
}

public static function generate_random_values() {

\$i_num_values = i_NUM_VALUES;
\$i_max_number = i_MAX_NUMBER;

\$st_values = array();

for(\$i_counter = 0; \$i_counter<\$i_num_values; \$i_counter++) {
\$i_random = rand(0, \$i_max_number);
\$st_values[] = \$i_random;
}

return \$st_values;

}

public static function get_datetime() {
return date('Y-m-d H:i:s');
}

public static function print_with_datetime(\$s_string) {
echo self::get_datetime().' '.\$s_string."\n";
}

public static function print_algorithm_type(\$s_algorithm) {
echo "Testing \$s_algorithm\n";
}

public static function print_results(\$st_values, \$i_initial_pos_to_display, \$i_number_of_results_to_display) {

self::print_with_datetime("Showing results from \$i_initial_pos_to_display, \$i_number_of_results_to_display items :");

\$i_counter = 0;
\$i_counter_found = 0;
for(\$i_counter=0; (\$i_counter_found<\$i_number_of_results_to_display && ((\$i_counter + \$i_initial_pos_to_display) < i_MAX_NUMBER)); \$i_counter++) {
\$i_pos = \$i_initial_pos_to_display + \$i_counter;
if (isset(\$st_values[\$i_pos])) {
echo '['.\$i_pos.'] '.\$st_values[\$i_pos]."\n";
\$i_counter_found++;
}
}

}

// Found on the web
public static function quicksort( \$array ) {
if( count( \$array ) < 2 ) {
return \$array;
}
\$left = \$right = array( );
reset( \$array );
\$pivot_key  = key( \$array );
\$pivot  = array_shift( \$array );
foreach( \$array as \$k => \$v ) {
if( \$v < \$pivot )
\$left[\$k] = \$v;
else
\$right[\$k] = \$v;
}
return array_merge(self::quicksort(\$left), array(\$pivot_key => \$pivot), self::quicksort(\$right));
}

public static function csort_basic(\$st_test_array) {
// Note that even if we do not explicit to PHP to pass array by reference it does it,
// and we do not use more memory unless we try to write to original array

\$s_algorithm = 'Carles_sort_basic';
echo "Testing \$s_algorithm\n";

\$st_result_array = Array();
\$st_result_array2 = Array();

\$i_min_value_pos = null;
\$i_max_value     = null;
\$i_max_value_pos = null;

\$i_temp_value = null;
\$i_loop_counter = 0;

\$i_num_items = count(\$st_test_array);

\$i_initial_pos = 0;
\$i_ending_pos  = \$i_num_items;

for(\$i_pos = \$i_initial_pos; \$i_pos < \$i_ending_pos; \$i_pos++) {
\$i_loop_counter++;

\$i_value = \$st_test_array[\$i_pos];
\$st_result_array[\$i_value] = \$i_value;

// If you want to debug a bit
//var_export(\$st_result_array); sleep(3);

// If you want to debug...
//if (\$i_loop_counter % 1000 == 0) {
//echo "Loop: \$i_loop_counter i_pos: \$i_pos Value: \$i_value\n";
//print_results(\$st_test_array, 0, 10);
//print_results(\$st_test_array, \$i_num_items - 2, 2);
//}

}

// Return in the way of a foreach sorted Array
\$i_max_value = max(\$st_result_array);

for(\$i_pos = 0; \$i_pos <= \$i_max_value; \$i_pos++) {
if (isset(\$st_result_array[\$i_pos])) {
\$st_result_array2[] = \$i_pos;
}
}

return \$st_result_array2;

}

}

// Checking parameters
if (count(\$argv) < 2) {
// Display help and exit
CsortDemo::displayHelp();
exit(i_EXIT_SUCCESS);
}

\$i_algorithm = intval(\$argv);

CsortDemo::print_with_datetime('Starting program...');

\$st_test_array = CsortDemo::get_random_values(\$i_algorithm);

\$i_num_items = count(\$st_test_array);
CsortDemo::print_with_datetime("Starting sorting with \$i_num_items items, i_MAX_NUMBER was ".i_MAX_NUMBER);

\$i_loop_counter = 0;

\$i_initial_pos = 0;
\$i_ending_pos = \$i_num_items - 1;

\$i_min_value     = i_MAX_NUMBER * 2; // A high number for the tests
\$i_min_value_pos = null;
\$i_max_value     = null;
\$i_max_value_pos = null;

\$i_temp_value = null;

\$b_changes_done = false;
\$b_exit = false;

// TODO Test array_count_values
\$f_microtime_ini = microtime(true);

if (\$i_algorithm == CsortDemo::i_ALGORITHM_QUICKSORT) {
\$s_algorithm = 'Quicksort written in PHP';
CsortDemo::print_algorithm_type(\$s_algorithm);
CsortDemo::quicksort(\$st_test_array);
}

if (\$i_algorithm == CsortDemo::i_ALGORITHM_CSORT_BASIC) {
\$st_test_array = CsortDemo::csort_basic(\$st_test_array);
}

if (\$i_algorithm == CsortDemo::i_ALGORITHM_PHP_QSORT) {
CsortDemo::print_algorithm_type('PHP sort() method');
sort(\$st_test_array);
}

if (\$i_algorithm == CsortDemo::i_ALGORITHM_PHP_ARRAY_FLIP) {
CsortDemo::print_algorithm_type('PHP array_flip() method');
\$st_test_array = array_flip(\$st_test_array);
}

if (\$i_algorithm == CsortDemo::i_ALGORITHM_PHP_ARRAY_UNIQUE) {
CsortDemo::print_algorithm_type('PHP array_unique() method');
\$st_test_array = array_unique(\$st_test_array);
}

\$f_microtime_end = microtime(true);

\$f_microtime_resulting = \$f_microtime_end - \$f_microtime_ini;

echo "Microtime elapsed: \$f_microtime_resulting\n";

CsortDemo::print_with_datetime('Results:');
//print_r(\$st_test_array);
\$i_total_items = count(\$st_test_array);
echo 'Total items:'.\$i_total_items."\n";
CsortDemo::print_results(\$st_test_array, 0, 10);

\$i_last_number  = array_pop(\$st_test_array);
\$i_last_number1 = array_pop(\$st_test_array);
CSortDemo::print_with_datetime('Last values:');
echo \$i_last_number1."\n";
echo \$i_last_number."\n";

CsortDemo::print_with_datetime('Finished program');

```

PHP under HHVM

Additionally I executed the csort.php in the Facebook’s Hip Hop Virtual Machine, version 3.4.0-dev, and as expected everything is much faster than with PHP standard.

For example:

• Using PHP sort() 88-104 seconds, using HHVM PHP sort() 15.8 seconds
• Using csort basic from PHP 34 to 40 seconds, using csort basic from HHVM 3.4 to 5.13 seconds

The version of PHP used for these tests was:

PHP 5.5.9-1ubuntu4.9 (cli) (built: Apr 17 2015 11:44:57)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies

The version of hhvm used for the test was:

HipHop VM 3.4.0-dev (rel)
Repo schema: 0100f1aaeeacd7b9d9a369b34ae5160acb0d1163
Extension API: 20140829

Must also mention that some operations like initializing the arrays, are super-fast with HHVM. The solution in Java

The code has been executed with Oracle’s JDK 1.7.

I love Just In Time compilers. They are amazingly fast.

For this case, the results with Java, for the universe of 50M items, range 0-1M of values:

• quicksort in Java: 3555 ms  =3.555 seconds versus ~10 seconds in C
Important: In C I had to disable -O2 optimization for what it looks like a bug in gcc (only detected in csort with compression). With -O2 times were around 4.7 seconds
• csort.java basic: 26 ms to 120 ms = 0.026 seconds versus 0.5 seconds csort in C
• csort.java opt read: 2-3 ms = 0.002-0.003 seconds versus 3 ms = 0.003 seconds in C
• csort.java with compression and duplicates support: 160 ms = 0.16 seconds versus 0.272-0.277 segons in C
• The execution on those languages (Java and C) is so fast that it’s hard to measure accurately as any random execution of Ubuntu in background will increase a bit the time.

The code in Java:

``````/*
* This code, except quicksort implementation, is written by Carles Mateo,
* and published as Open Source.
*/
package csort;

import java.io.IOException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;

/**
*
* @author carles
*/
public class Csort {

public static final int i_MAX_REGISTERS = 50000000;
public static final int i_MAX_VALUE     = 1000000;
public static final String s_FILE = "/home/carles/Desktop/codi/php/tests/carles_sort.txt";

public static final int i_ERROR_SUCCESS = 0;
public static final int i_ERROR_FAILURE = 1;

public static final int i_ALGORITHM_QUICKSORT        = 1;
public static final int i_ALGORITHM_CSORT_BASIC      = 7;
public static final int i_ALGORITHM_CSORT_OPT1       = 8;
public static final int i_ALGORITHM_CSORT_COMP       = 9;

public int[] st_values;
// Our Csorted array
public int[] st_values_sorted;
// To resort as a alike list array
public int[] st_values_sorted_renum;

// For quicksort
private int[] numbers;
private int number;

public void systemExit(String s_reason, int i_error_code) {
printWithDatetime("Exiting to system, cause: " + s_reason);
System.exit(i_error_code);
}

public static String getDateTime() {
DateFormat o_dateformat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss");
Date o_date = new Date();

return o_dateformat.format(o_date);
}

public static void printWithDatetime(String s_message) {
System.out.println(getDateTime() + " " + s_message);
}

public void printAlgorith(String s_algorithm) {
Csort.printWithDatetime("Using algorithm: " + s_algorithm);
}

public static void showHelp() {

String s_help = "" +
"CSort.java by Carles Mateo\n" +
"==========================\n" +
"\n" +
"This demonstrates a sort algorithm faster than quicksort,\n" +
"as proposed in the article.\n" +
"http://blog.carlesmateo.com\n" +
"\n" +
"Please invoke the program with one of these parameters:\n" +
"\n" +
"--help or without parameters for this help message\n" +
"1 To test using quicksort\n" +
"  Only sorts, does not deletes duplicates\n" +
"7 To test using csort (basic example, sorts and deletes duplicates)\n" +
"8 To test using csort, improvement with indexes assigned when reading\n" +
"  Use to watch the time saving from basic csort\n" +
"\n" +
"Please, make sure that the file carles_sort.txt is present in the\n" +
"current directory.\n" +
"\n";

System.out.println(s_help);

}

int i_value = 0;
int i_counter = 0;

try {

String sCurrentLine;

while ((sCurrentLine = br.readLine()) != null) {
i_value = Integer.parseInt(sCurrentLine);
if (i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
this.st_values[i_value] = i_value;
} else {
this.st_values[i_counter] = i_value;
}

if (i_counter % 500000 == 0) {
float f_percent = ((float) i_counter / Csort.i_MAX_REGISTERS) * 100;
String s_percent_msg = "Read register " + i_counter + " value " + i_value + " " + f_percent + "% read";
System.out.println(s_percent_msg);
}
i_counter++;
}

} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) br.close();
} catch (IOException ex) {
ex.printStackTrace();
this.systemExit("Error processing the file: " + Csort.s_FILE, Csort.i_ERROR_FAILURE);
}
}

return i_counter;
}

// http://www.vogella.com/tutorials/JavaAlgorithmsQuicksort/article.html
public void qsort(int[] values) {
// check for empty or null array
if (values == null || values.length == 0){
return;
}
this.numbers = values;
number = values.length;
quicksort(0, number - 1);
}

private void quicksort(int low, int high) {
int i = low, j = high;
// Get the pivot element from the middle of the list
int pivot = numbers[low + (high-low)/2];

// Divide into two lists
while (i <= j) {
// If the current value from the left list is smaller then the pivot
// element then get the next element from the left list
while (numbers[i] < pivot) {
i++;
}
// If the current value from the right list is larger then the pivot
// element then get the next element from the right list
while (numbers[j] > pivot) {
j--;
}

// If we have found a values in the left list which is larger then
// the pivot element and if we have found a value in the right list
// which is smaller then the pivot element then we exchange the
// values.
// As we are done we can increase i and j
if (i <= j) {
exchange(i, j);
i++;
j--;
}
}
// Recursion
if (low < j)
quicksort(low, j);
if (i < high)
quicksort(i, high);
}

private void exchange(int i, int j) {
int temp = numbers[i];
numbers[i] = numbers[j];
numbers[j] = temp;
}
// End of http://www.vogella.com/tutorials/JavaAlgorithmsQuicksort/article.html

public void csortBasic() {
int i_counter = 0;
int i_value = 0;
int i_max_range = 0;
int i_counter_hit = 0;

try {
for (i_counter=0; i_counter < i_MAX_REGISTERS; i_counter++) {
i_value = this.st_values[i_counter];
this.st_values_sorted[i_value] = i_value;
}

// Resort for forearch
i_max_range = i_MAX_VALUE;
for (i_counter=0; i_counter < i_max_range; i_counter++) {
if (this.st_values_sorted[i_counter] > -1 ) {
this.st_values_sorted_renum[i_counter_hit] = this.st_values_sorted[i_counter];
i_counter_hit++;
}
}
} catch (Exception e) {
printWithDatetime("Exception captured at i_counter: " + i_counter);
this.systemExit("Exception, so ending", i_ERROR_FAILURE);
}

}

public void csortOptimized() {
int i_counter = 0;
int i_value = 0;
int i_max_range = 0;
int i_counter_hit = 0;

try {
// It was sorted while reading :)

// Resort for forearch/list-style lovers
i_max_range = i_MAX_VALUE;
for (i_counter=0; i_counter < i_max_range; i_counter++) {
if (this.st_values[i_counter] > -1 ) {
this.st_values_sorted_renum[i_counter_hit] = this.st_values[i_counter];
i_counter_hit++;
}
}
} catch (Exception e) {
printWithDatetime("Exception captured at i_counter: " + i_counter);
this.systemExit("Exception, so ending", i_ERROR_FAILURE);
}

}

public void csortCompress() {
int i_counter = 0;
int i_value = 0;

for (i_counter=0; i_counter < i_MAX_REGISTERS; i_counter++) {
i_value = this.st_values[i_counter];
this.st_values_sorted[i_value] = this.st_values_sorted[i_value] + 1;
}

}

public void printItems(int i_num_items, int[] st_array) {
int i_counter = 0;
int i_value   = 0;

for(i_counter=0;i_counter<i_num_items;i_counter++) {
i_value = st_array[i_counter];
System.out.println("Item[" + i_counter + "] value: " + i_value);
}
}

public Csort(int i_registers, int i_max_value) {

// int range is -2,147,483,648 to 2,147,483,647
this.st_values = new int[i_registers];
// If max value is 10 then we could have values from 0 to 10, so a 11 items Array
this.st_values_sorted = new int[i_max_value + 1];
this.st_values_sorted_renum = new int[i_max_value + 1];

}

/**
* @param args the command line arguments
*/
public static void main(String[] args) {

long l_milliseconds_ini   = 0;
long l_milliseconds_end   = 0;
long l_milliseconds_total = 0;

int i_algorithm = 0;
String s_algorithm = "";

int i_counter = 0;

/* Check parameter consistency */
if (args.length == 0 || args.length > 1 || (args.length == 1 && Integer.valueOf(args) == 0)) {
showHelp();
System.exit(i_ERROR_SUCCESS);
}

i_algorithm = Integer.valueOf(args);

Csort o_csort = new Csort(Csort.i_MAX_REGISTERS, Csort.i_MAX_VALUE);

if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC || i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1 || i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {
// We init empty position to -1 to differenciate from empty items and from value 0 (that is valid)
Csort.printWithDatetime("Initializing values to -1...");
for(i_counter=0; i_counter < Csort.i_MAX_VALUE; i_counter++) {
o_csort.st_values_sorted[i_counter] = -1;
o_csort.st_values_sorted_renum[i_counter] = -1;
}
} else if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
// We init empty position to -1 to differenciate from empty items and from value 0 (that is valid)
Csort.printWithDatetime("Initializing values to 0...");
for(i_counter=0; i_counter < Csort.i_MAX_VALUE; i_counter++) {
o_csort.st_values_sorted[i_counter] = 0;
}
} else {
showHelp();
System.exit(i_ERROR_SUCCESS);
}

Csort.printWithDatetime("Starting csort.java...");

l_milliseconds_ini = System.currentTimeMillis();

if (i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {
s_algorithm = "Quicksort";
o_csort.printAlgorith(s_algorithm);

o_csort.qsort(o_csort.st_values);
}

if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC) {
s_algorithm = "Csort basic";
o_csort.printAlgorith(s_algorithm);

o_csort.csortBasic();
}

if (i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
o_csort.printAlgorith(s_algorithm);

o_csort.csortOptimized();
}

if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
s_algorithm = "Csort compress with duplicates";
o_csort.printAlgorith(s_algorithm);

o_csort.csortCompress();
}

l_milliseconds_end = System.currentTimeMillis();

l_milliseconds_total = l_milliseconds_end - l_milliseconds_ini;

Csort.printWithDatetime("Algorithm execution time: " + l_milliseconds_total + " milliseconds");

if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC || i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
o_csort.printItems(10, o_csort.st_values_sorted_renum);
}

if (i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {
o_csort.printItems(10, o_csort.numbers);
}

if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
o_csort.printItems(10, o_csort.st_values_sorted);
}

}

}
``````

Note: This article is growing. I’ll be expanding the article, putting more samples, detailing timmings from different universes I’ve tested and screenshots, specially in the PHP section, to add more tricks and whys, but I wanted to put the base for being able to discuss with some colleagues.

# Performance of several languages

Notes on 2017-03-26 18:57 CEST – Unix time: 1490547518 :

1. As several of you have noted, it would be much better to use a random value, for example, read by disk. This will be an improvement done in the next benchmark. Good suggestion thanks.
2. Due to my lack of time it took more than expected updating the article. I was in a long process with google, and now I’m looking for a new job.
3. I note that most of people doesn’t read the article and comment about things that are well indicated on it. Please before posting, read, otherwise don’t be surprise if the comment is not published. I’ve to keep the blog clean of trash.
4. I’ve left out few comments cause there were disrespectful. Mediocrity is present in the society, so simply avoid publishing comments that lack the basis of respect and good education. If a comment brings a point, under the point of view of Engineering, it is always published.

Thanks.

(This article was last updated on 2015-08-26 15:45 CEST – Unix time: 1440596711. See changelog at bottom)

One may think that Assembler is always the fastest, but is that true?.

If I write a code in Assembler in 32 bit instead of 64 bit, so it can run in 32 and 64 bit, will it be faster than the code that a dynamic compiler is optimizing in execution time to benefit from the architecture of my computer?.

What if a future JIT compiler is able to use all the cores to execute a single thread developed program?.

Are PHP, Python, or Ruby fast comparing to C++?. Does Facebook Hip Hop Virtual machine really speeds PHP execution?.

This article shows some results and shares my conclusions. It is as a base to discuss with my colleagues. Is not an end, we are always doing tests, looking for the edge, and looking at the root of the things in detail. And often things change from one version to the other. This article shows not an absolute truth, but brings some light into interesting aspects.

It could show the performance for the certain case used in the test, although generic core instructions have been selected. Many more tests are necessary, and some functions differ in the performance. But this article is a necessary starting for the discussion with my IT-extreme-lover friends and a necessary step for the next upcoming tests.

It brings very important data for Managers and Decision Makers, as choosing the adequate performance language can save millions in hardware (specially when you use the Cloud and pay per hour of use) or thousand hours in Map Reduce processes.

## Acknowledgements and thanks

Credit for the great Eduard Heredia, for porting my C source code to:

• Go
• Ruby
• Node.js

And for the nice discussions of the results, an on the optimizations and dynamic vs static compilers.

Thanks to Juan Carlos Moreno, CTO of ECManaged Cloud Software for suggesting adding Python and Ruby to the languages tested when we discussed my initial results.

Thanks to Joel Molins for the interesting discussions on Java performance and garbage collection.

Thanks to Cliff Click for his wonderful article on Java vs C performance that I found when I wanted to confirm some of my results and findings.

I was inspired to do my own comparisons by the benchmarks comparing different framework by techempower. It is amazing to see the results of the tests, like how C++ can serialize JSon 1,057,793 times per second and raw PHP only 180,147 (17%).

# For the impatients

I present the results of the tests, and the conclusions, for those that doesn’t want to read about the details. For those that want to examine the code, and the versions of every compiler, and more in deep conclusions, this information is provided below.

# Results

This image shows the results of the tests with every language and compiler.

All the tests are invoked from command line. All the tests use only one core. No tests for the web or frameworks have been made, are another scenarios worth an own article.

More seconds means a worst result. The worst is Bash, that I deleted from the graphics, as the bar was crazily high comparing to others.

* As later is discussed my initial Assembler code was outperformed by C binary because the final Assembler code that the compiler generated was better than mine. Table of times:

Seconds executing Language Compiler used Version
6 s. Java Oracle Java Java JDK 8
6 s. Java Oracle Java Java JDK 7
6 s. Java Open JDK OpenJDK 7
6 s. Java Open JDK OpenJDK 6
7 s. Go Go Go v.1.3.1 linux/amd64
7 s. Go Go Go v.1.3.3 linux/amd64
8 s. Lua LuaJit Luajit 2.0.2
10 s. C++ g++ g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
10 s. C gcc gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
10 s.
(* first version was 13 s. and then was optimized)
Assembler nasm NASM version 2.10.09 compiled on Dec 29 2013
10 s. Nodejs nodejs Nodejs v0.12.4
14 s. Nodejs nodejs Nodejs v0.10.25
18 s. Go Go go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64
20 s. Phantomjs Phantomjs phantomjs 1.9.0
21 s. Phantomjs Phantomjs phantomjs 2.0.1-development
38 s. PHP Facebook HHVM HipHop VM 3.4.0-dev (rel)
44 s. Python Pypy Pypy 2.2.1 (Python 2.7.3 (2.2.1+dfsg-1, Nov 28 2013, 05:13:10))
52 s. PHP Facebook HHVM HipHop VM 3.9.0-dev (rel)
52 s. PHP Facebook HHVM HipHop VM 3.7.3 (rel)
128 s. PHP PHP PHP 7.0.0alpha2 (cli) (built: Jul 3 2015 15:30:23)
278 s. Lua Lua Lua 2.5.3
294 s. Gambas3 Gambas3 3.7.0
316 s. PHP PHP PHP 5.5.9-1ubuntu4.3 (cli) (built: Jul 7 2014 16:36:58)
317 s. PHP PHP PHP 5.6.10 (cli) (built: Jul 3 2015 16:13:11)
323 s. PHP PHP PHP 5.4.42 (cli) (built: Jul 3 2015 16:24:16)
436 s. Perl Perl Perl 5.18.2
523 s. Ruby Ruby ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
694 s. Python Python Python 2.7.6
807 s. Python Python Python 3.4.0
47630 s. Bash GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)

## Conclusions and Lessons Learnt

1. There are languages that will execute faster than a native Assembler program, thanks to the JIT Compiler and to the ability to optimize the program at runtime for the architecture of the computer running the program (even if there is a small initial penalty of around two seconds from JIT when running the program, as it is being analysed, is it more than worth in our example)
2. Modern Java can be really fast in certain operations, it is the fastest in this test, thanks to the use of JIT Compiler technology and a very good implementation in it
3. Oracle’s Java and OpenJDK shows no difference in performance in this test
4. Script languages really sucks in performance. Python, Perl and Ruby are terribly slow. That costs a lot of money if you Scale as you need more Server in the Cloud
5. JIT compilers for Python: Pypy, and for Lua: LuaJit, make them really fly. The difference is truly amazing
6. The same language can offer a very different performance using one version or another, for example the go that comes from Ubuntu packets and the last version from official page that is faster, or Python 3.4 is much slower than Python 2.7 in this test
7. Bash is the worst language for doing the loop and inc operations in the test, lasting for more than 13 hours for the test
8. From command line PHP is much faster than Python, Perl and Ruby
9. Facebook Hip Hop Virtual Machine (HHVM) improves a lot PHP’s speed
10. It looks like the future of compilers is JIT.
11. Assembler is not always the fastest when executed. If you write a generic Assembler program with the purpose of being able to run in many platforms you’ll not use the most powerful instructions specific of an architecture, and so a JIT compiler can outperform your code. An static compiler can also outperform your code with very clever optimizations. People that write the compilers are really good. Unless you’re really brilliant with Assembler probably a C/C++ code beats the performance of your code. Even if you’re fantastic with Assembler it could happen that a JIT compiler notices that some executions can be avoided (like code not really used) and bring magnificent runtime optimizations. (for example a near JMP is much more less costly than a far JMP Assembler instruction. Avoiding dead code could result in a far JMP being executed as near JMP, saving many cycles per loop)
12. Optimizations really needs people dedicated to just optimizations and checking the speed of the newly added code for the running platforms
13. Node.js was a big surprise. It really performed well. It is promising. New version performs even faster
14. go is promising. Similar to C, but performance is much better thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)
15. Gambas 3 performed surprisingly fast. Better than PHP
16. You should be careful when using C/C++ optimization -O3 (and -O2) as sometimes it doesn’t work well (bugs) or as you may expect, for example by completely removing blocks of code if the compiler believes that has no utility (like loops)
17. Perl performance really change from using a for style or another. (See Perl section below)
18. Modern CPUs change the frequency to save energy. To run the tests is strictly recommended to use a dedicated machine, disabling the CPU governor and setting a frequency for all the cores, booting with a text only live system, without background services, not mounting disks, no swap, no network

# Explanations in details

Obviously an statically compiled language binary should be faster than an interpreted language.

C or C++ are much faster than PHP. And good code machine is much faster of course.

But there are also other languages that are not compiled as binary and have really fast execution.

For example, good Web Java Application Servers generate compiled code after the first request. Then it really flies.

For web C# or .NET in general, does the same, the IIS Application Server creates a native DLL after the first call to the script. And after this, as is compiled, the page is really fast.

With C statically linked you could generate binary code for a particular processor, but then it won’t work in other processors, so normally we write code that will work in all the processors at the cost of not using all the performance of the different CPUs or use another approach and we provide a set of different binaries for the different architectures. A set of directives doing one thing or other depending on the platform detected can also be done, but is hard, long and tedious job with a lot of special cases treatment. There is another approach that is dynamic linking, where certain things will be decided at run time and optimized for the computer that is running the program by the JIT (Just-in-time) Compiler.

Java, with JIT is able to offer optimizations for the CPU that is running the code with awesome results. And it is able to optimize loops and mathematics operations and outperform C/C++ and Assembler code in some cases (like in our tests) or to be really near in others. It sounds crazy but nowadays the JIT is able to know the result of several times executed blocks of code and to optimize that with several strategies, speeding the things incredible and to outperform a code written in Assembler. Demonstrations with code is provided later.

A new generation has grown knowing only how to program for the Web. Many of them never saw Assembler, neither or barely programmed in C++.

None of my Senior friends would assert that a technology is better than another without doing many investigations before. We are serious. There is so much to take in count, so much to learn always, that one has to be sure that is not missing things before affirming such things categorically. If you want to be taken seriously, you have to take many things in count.

# Environment for the tests

## Hardware and OS

Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz with 32 GB RAM and SSD Disk.

Ubuntu Desktop 14.04 LTS 64 bit

## Software base and compilers

### PHP versions

Shipped with my Ubuntu distribution:

`php -v`
```PHP 5.5.9-1ubuntu4.3 (cli) (built: Jul  7 2014 16:36:58)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies```

Compiled from sources:

```PHP 5.6.10 (cli) (built: Jul  3 2015 16:13:11)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies```
```PHP 5.4.42 (cli) (built: Jul  3 2015 16:24:16)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2014 Zend Technologies```

### Java 8 version

`java -showversion`
```java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)```

### C++ version

`g++ -v`
```Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.2-19ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)```

### Gambas 3

`gbr3 --version`
`3.7.0`

`go version`
`go version go1.3.1 linux/amd64`

### Go (Ubuntu packages)

`go version`
`go version xgcc (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157] linux/amd64`

### Nasm

`nasm -v`
`NASM version 2.10.09 compiled on Dec 29 2013`

### Lua

`lua -v`
`Lua 5.2.3  Copyright (C) 1994-2013 Lua.org, PUC-Rio`

### Luajit

`luajit -v`
`LuaJIT 2.0.2 -- Copyright (C) 2005-2013 Mike Pall. http://luajit.org/`

### Nodejs

Installed with apt-get install nodejs:

`nodejs --version`
`v0.10.25`

Installed by compiling the sources:

`node --version`
`v0.12.4`

## Phantomjs

Installed with apt-get install phantomjs:

`phantomjs --version`
`1.9.0`

Compiled from sources:

`/path/phantomjs --version`
`2.0.1-development`

### Python 2.7

`python --version`
`Python 2.7.6`

### Python 3

`python3 --version`
`Python 3.4.0`

## Perl

`perl -version`
```This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
(with 41 registered patches, see perl -V for more detail)```

## Bash

`bash --version`
```GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2013 Free Software Foundation, Inc.

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.```

## Test: Time required for nested loops

This is the first sample. It is an easy-one.

The main idea is to generate a set of nested loops, with a simple counter inside.

When the counter reaches 51 it is set to 0.

This is done for:

1. Preventing overflow of the integer if growing without control
2. Preventing the compiler from optimizing the code (clever compilers like Java or gcc with -O3 flag for optimization, if it sees that the var is never used, it will see that the whole block is unnecessary and simply never execute it)

Doing only loops, the increment of a variable and an if, provides us with basic structures of the language that are easily transformed to Assembler. We want to avoid System calls also.

This is the base for the metrics on my Cloud Analysis of Performance cmips.net project.

Here I present the times for each language, later I analyze the details and the code.

Take in count that this code only executes in one thread / core.

## C++

C++ result, it takes 10 seconds.

Code for the C++:

``````/*
* File:   main.cpp
* Author: Carles Mateo
*
* Created on August 27, 2014, 1:53 PM
*/

#include <cstdlib>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <ctime>

using namespace std;

typedef unsigned long long timestamp_t;

static timestamp_t get_timestamp()
{
struct timeval now;
gettimeofday (&now, NULL);
return  now.tv_usec + (timestamp_t)now.tv_sec * 1000000;
}

int main(int argc, char** argv) {

timestamp_t t0 = get_timestamp();

// current date/time based on current system
time_t now = time(0);

// convert now to string form
char* dt_now = ctime(&now);

printf("Starting at %s\n", dt_now);

int i_loop1 = 0;
int i_loop2 = 0;
int i_loop3 = 0;

for (i_loop1 = 0; i_loop1 < 10; i_loop1++) {
for (i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
for (i_loop3 = 0; i_loop3 < 32000; i_loop3++) {
i_counter++;

if (i_counter > 50) {
i_counter = 0;
}
}
// If you want to test how the compiler optimizes that, remove the comment
//i_counter = 0;
}
}

// This is another trick to avoid compiler's optimization. To use the var somewhere
printf("Counter: %i\n", i_counter);

timestamp_t t1 = get_timestamp();
double secs = (t1 - t0) / 1000000.0L;
time_t now_end = time(0);

// convert now to string form
char* dt_now_end = ctime(&now_end);

printf("End time: %s\n", dt_now_end);

return 0;
}`````` You can try to remove the part of code that makes the checks:

``````                /* if (i_counter > 50) {
i_counter = 0;
}*/``````

And the use of the var, later:

``    //printf("Counter: %i\n", i_counter);``

Note: And adding a i_counter = 0; at the beginning of the loop to make sure that the counter doesn’t overflows. Then the C or C++ compiler will notice that this result is never used and so it will eliminate the code from the program, having as result and execution time of 0.0 seconds.

## Java

The code in Java:

``````package cpu;

/**
*
* @author carles.mateo
*/
public class Cpu {

/**
* @param args the command line arguments
*/
public static void main(String[] args) {

int i_loop1 = 0;
//int i_loop_main = 0;
int i_loop2 = 0;
int i_loop3 = 0;
int i_counter = 0;

String s_version = System.getProperty("java.version");

System.out.println("Java Version: " + s_version);

System.out.println("Starting cpu.java...");

for (i_loop1 = 0; i_loop1 < 10; i_loop1++) {
for (i_loop2 = 0; i_loop2 < 32000; i_loop2++) {
for (i_loop3 = 0; i_loop3 < 32000; i_loop3++) {
i_counter++;

if (i_counter > 50) {
i_counter = 0;
}
}
}
}

System.out.println(i_counter);
System.out.println("End");
}

}

``````

It is really interesting how Java, with JIT outperforms C++ and Assembler.

It takes only 6 seconds. ## Go

The case of Go is interesting because I saw a big difference from the go shipped with Ubuntu, and the the go I downloaded from http://golang.org/dl/. I downloaded 1.3.1 and 1.3.3 offering the same performance. 7 seconds.

``````package main

import ("fmt"
"time")

func main() {
fmt.Printf("Starting: %s", time.Now().Local())
var i_counter = 0;
for i_loop1 := 0; i_loop1 < 10; i_loop1++ {
for i_loop2 := 0; i_loop2 < 32000; i_loop2++ {
for i_loop3 := 0; i_loop3 < 32000; i_loop3++ {
i_counter++;
if i_counter > 50 {
i_counter = 0;
}
}
}
}

fmt.Printf("\nCounter: %#v", i_counter)
fmt.Printf("\nEnd: %s\n", time.Now().Local())
}``````

### Assembler

Here is the Assembler for Linux code, with SASM, that I created initially (bellow is optimized).

``````%include "io.inc"

section .text
global CMAIN
CMAIN:
;mov rbp, rsp; for correct debugging
; Set to 0, the faster way
xor     esi, esi

DO_LOOP1:
mov ecx, 10
LOOP1:
mov ebx, ecx
jmp DO_LOOP2
LOOP1_CONTINUE:
mov ecx, ebx

loop LOOP1
jmp QUIT

DO_LOOP2:
mov ecx, 32000
LOOP2:
mov eax, ecx
;call DO_LOOP3
jmp DO_LOOP3
LOOP2_CONTINUE:
mov ecx, eax

loop LOOP2
jmp LOOP1_CONTINUE

DO_LOOP3:
; Set to 32000 loops
MOV ecx, 32000
LOOP3:
inc     esi
cmp     esi, 50
jg      COUNTER_TO_0
LOOP3_CONTINUE:

loop LOOP3
;ret
jmp LOOP2_CONTINUE

COUNTER_TO_0:
; Set to 0
xor     esi, esi

jmp LOOP3_CONTINUE

;    jmp QUIT

QUIT:
xor eax, eax
ret``````

It took 13 seconds to complete.

One interesting explanation on why binary C or C++ code is faster than Assembler, is because the C compiler generates better Assembler/binary code at the end. For example, the use of JMP is expensive in terms of CPU cycles and the compiler can apply other optimizations and tricks that I’m not aware of, like using faster registers, while in my code I use ebx, ecx, esi, etc… (for example, imagine that using cx is cheaper than using ecx or rcx and I’m not aware but the guys that created the Gnu C compiler are) To be sure of what’s going on I switched in the LOOP3 the JE and the JMP of the code, for groups of 50 instructions, INC ESI, one after the other and the time was reduced to 1 second.

(In C also was reduced even a bit more when doing the same)

To know what’s the translation of the C code into Assembler when compiled, you can do:

`objdump --disassemble nested_loops`

Look for the section main and you’ll get something like:

``````0000000000400470 <main>:
400470:    bf 0a 00 00 00           mov    \$0xa,%edi
400475:    31 c9                    xor    %ecx,%ecx
400477:    be 00 7d 00 00           mov    \$0x7d00,%esi
40047c:    0f 1f 40 00              nopl   0x0(%rax)
400480:    b8 00 7d 00 00           mov    \$0x7d00,%eax
400485:    0f 1f 00                 nopl   (%rax)
400488:    83 c2 01                 add    \$0x1,%edx
40048b:    83 fa 33                 cmp    \$0x33,%edx
40048e:    0f 4d d1                 cmovge %ecx,%edx
400491:    83 e8 01                 sub    \$0x1,%eax
400494:    75 f2                    jne    400488 <main+0x18>
400496:    83 ee 01                 sub    \$0x1,%esi
400499:    75 e5                    jne    400480 <main+0x10>
40049b:    83 ef 01                 sub    \$0x1,%edi
40049e:    75 d7                    jne    400477 <main+0x7>
4004a0:    48 83 ec 08              sub    \$0x8,%rsp
4004a4:    be 34 06 40 00           mov    \$0x400634,%esi
4004a9:    bf 01 00 00 00           mov    \$0x1,%edi
4004ae:    31 c0                    xor    %eax,%eax
4004b0:    e8 ab ff ff ff           callq  400460 <__printf_chk@plt>
4004b5:    31 c0                    xor    %eax,%eax
4004b7:    48 83 c4 08              add    \$0x8,%rsp
4004bb:    c3                       retq``````

Note: this is in the AT&T syntax and not in the Intel. That means that add \$0x1,%edx is adding 1 to EDX registerg (origin, destination).

As you can see the C compiler has created a very differed Assembler version respect what I created.
For example at 400470 it uses EDI register to store 10, so to control the number of the outer loop.
It uses ESI to store 32000 (Hexadecimal 0x7D00), so the second loop.
And EAX for the inner loop, at 400480.
It uses EDX for the counter, and compares to 50 (Hexa 0x33) at 40048B.
In 40048E it uses the CMOVGE (Mov if Greater or Equal), that is an instruction that was introduced with the P6 family processors, to move the contents of ECX to EDX if it was (in the CMP) greater or equal to 50. As in 400475 a XOR ECX, ECX was performed, EXC contained 0.
And it cleverly used SUB and JNE (JNE means Jump if not equal and it jumps if ZF = 0, it is equivalent to JNZ Jump if not Zero).
It uses between 4 and 16 clocks, and the jump must be -128 to +127 bytes of the next instruction. As you see Jump is very costly.

Looks like the biggest improvement comes from the use of CMOVGE, so it saves two jumps that my original Assembler code was performing.
Those two jumps multiplied per 32000 x 32000 x 10 times, are a lot of Cpu clocks.

So, with this in mind, as this Assembler code takes 10 seconds, I updated the graph from 13 seconds to 10 seconds.

### Lua

This is the initial code:

``````local i_counter = 0

local i_time_start = os.clock()

for i_loop1=0,9 do
for i_loop2=0,31999 do
for i_loop3=0,31999 do
i_counter = i_counter + 1
if i_counter > 50 then
i_counter = 0
end
end
end
end

local i_time_end = os.clock()
print(string.format("Counter: %i\n", i_counter))
print(string.format("Total seconds: %.2f\n", i_time_end - i_time_start))``````

In the case of Lua theoretically one could take great advantage of the use of local inside a loop, so I tried the benchmark with modifications to the loop:

``````for i_loop1=0,9 do
for i_loop2=0,31999 do
local l_i_counter = i_counter
for i_loop3=0,31999 do
l_i_counter = l_i_counter + 1
if l_i_counter > 50 then
l_i_counter = 0
end
end
i_counter = l_i_counter
end
end``````

I ran it with LuaJit and saw no improvements on the performance.

### Node.js

``````var s_date_time = new Date();
console.log('Starting: ' + s_date_time);

var i_counter = 0;

for (var \$i_loop1 = 0; \$i_loop1 < 10; \$i_loop1++) {
for (var \$i_loop2 = 0; \$i_loop2 < 32000; \$i_loop2++) {
for (var \$i_loop3 = 0; \$i_loop3 < 32000; \$i_loop3++) {
i_counter++;
if (i_counter > 50) {
i_counter = 0;
}
}
}
}

var s_date_time_end = new Date();

console.log('Counter: ' + i_counter + '\n');

console.log('End: ' + s_date_time_end + '\n');``````

Execute with:

nodejs nested_loops.js

## Phantomjs

The same code as nodejs adding to the end:

`phantom.exit(0);`

In the case of Phantom it performs the same in both versions 1.9.0 and 2.0.1-development compiled from sources.

### PHP

The interesting thing on PHP is that you can write your own extensions in C, so you can have the easy of use of PHP and create functions that really brings fast performance in C, and invoke them from PHP.

``````<?php

\$s_date_time = date('Y-m-d H:i:s');

echo 'Starting: '.\$s_date_time."\n";

\$i_counter = 0;

for (\$i_loop1 = 0; \$i_loop1 < 10; \$i_loop1++) {
for (\$i_loop2 = 0; \$i_loop2 < 32000; \$i_loop2++) {
for (\$i_loop3 = 0; \$i_loop3 < 32000; \$i_loop3++) {
\$i_counter++;
if (\$i_counter > 50) {
\$i_counter = 0;
}
}
}
}

\$s_date_time_end = date('Y-m-d H:i:s');

echo 'End: '.\$s_date_time_end."\n";``````

Facebook’s Hip Hop Virtual Machine is a very powerful alternative, that is JIT powered.

```git clone https://github.com/facebook/hhvm.git
cd hhvm
rm -r third-party
git submodule update --init --recursive
./configure
make```

Or just grab precompiled packages from https://github.com/facebook/hhvm/wiki/Prebuilt%20Packages%20for%20HHVM

### Python

``````from datetime import datetime
import time

print ("Starting at: " + str(datetime.now()))
s_unixtime_start = str(time.time())

i_counter = 0

# From 0 to 31999
for i_loop1 in range(0, 10):
for i_loop2 in range(0,32000):
for i_loop3 in range(0,32000):
i_counter += 1
if ( i_counter > 50 ) :
i_counter = 0

print ("Ending at: " + str(datetime.now()))
s_unixtime_end = str(time.time())

i_seconds = long(s_unixtime_end) - long(s_unixtime_start)
s_seconds = str(i_seconds)

print ("Total seconds:" + s_seconds)``````

### Ruby

``````#!/usr/bin/ruby -w

time1 = Time.new

puts "Starting : " + time1.inspect

i_counter = 0;

for i_loop1 in 0..9
for i_loop2 in 0..31999
for i_loop3 in 0..31999
i_counter = i_counter + 1
if i_counter > 50
i_counter = 0
end
end
end
end

time1 = Time.new

puts "End : " + time1.inspect``````

### Perl

The case of Perl was very interesting one.

This is the current code:

``````#!/usr/bin/env perl

print "\$s_datetime Starting calculations...\n";
\$i_counter=0;

\$i_unixtime_start=time();

for my \$i_loop1 (0 .. 9) {
for my \$i_loop2 (0 .. 31999) {
for my \$i_loop3 (0 .. 31999) {
\$i_counter++;
if (\$i_counter > 50) {
\$i_counter = 0;
}
}
}
}

\$i_unixtime_end=time();

\$i_seconds=\$i_unixtime_end-\$i_unixtime_start;

print "Counter: \$i_counter\n";
print "Total seconds: \$i_seconds";``````

But before I created one, slightly different, with the for loops like in the C style:

``````#!/usr/bin/env perl

\$i_counter=0;

\$i_unixtime_start=time();

for (my \$i_loop1=0; \$i_loop1 < 10; \$i_loop1++) {
for (my \$i_loop2=0; \$i_loop2 < 32000; \$i_loop2++) {
for (my \$i_loop3=0; \$i_loop3 < 32000; \$i_loop3++) {
\$i_counter++;
if (\$i_counter > 50) {
\$i_counter = 0;
}
}
}
}

\$i_unixtime_end=time();

\$i_seconds=\$i_unixtime_end-\$i_unixtime_start;

print "Total seconds: \$i_seconds";``````

I repeated this test, with the same version of Perl, due to the comment of a reader (thanks mpapec) that told:

In this particular case perl style loops are about 45% faster than original code (v5.20)

And effectively and surprisingly the time passed from 796 seconds to 436 seconds.

So graphics are updated to reflect the result of 436 seconds.

### Bash

``````#!/bin/bash
echo "Bash version \${BASH_VERSION}..."
date

let "s_time_start=\$(date +%s)"
let "i_counter=0"

for i_loop1 in {0..9}
do
echo "."
date
for i_loop2 in {0..31999}
do
for i_loop3 in {0..31999}
do
((i_counter++))
if [[ \$i_counter > 50 ]]
then
let "i_counter=0"
fi
done
#((var+=1))
#((var=var+1))
#((var++))
#let "var=var+1"
#let "var+=1"
#let "var++"
done
done

let "s_time_end=\$(date +%2)"

let "s_seconds = s_time_end - s_time_start"
echo "Total seconds: \$s_seconds"

# Just in case it overflows
date``````

### Gambas 3

Gambas is a language and an IDE to create GUI applications for Linux.
It is very similar to Visual Basic, but better, and it is not a clone.

I created a command line application and it performed better than PHP. There has been done an excellent job with the compiler. Note: in the screenshot the first test ran for few seconds more than in the second. This was because I deliberately put the machine under some load and I/O during the tests. The valid value for the test, confirmed with more iterations is the second one, done under the same conditions (no load) than the previous tests.

```' Gambas module file MMain.module

Public Sub Main()

' @author Carles Mateo http://blog.carlesmateo.com

Dim i_loop1 As Integer
Dim i_loop2 As Integer
Dim i_loop3 As Integer
Dim i_counter As Integer
Dim s_version As String

i_loop1 = 0
i_loop2 = 0
i_loop3 = 0
i_counter = 0

s_version = System.Version

Print "Performance Test by Carles Mateo blog.carlesmateo.com"
Print "Gambas Version: " & s_version

Print "Starting..." & Now()

For i_loop1 = 0 To 9
For i_loop2 = 0 To 31999
For i_loop3 = 0 To 31999
i_counter = i_counter + 1

If (i_counter > 50) Then
i_counter = 0
Endif
Next
Next
Next

Print i_counter
Print "End " & Now()

End```

Changelog

2015-08-26 15:45

Thanks to the comment of a reader, thanks Daniel, pointing a mistake. The phrase I mentioned was on conclusions, point 14, and was inaccurate. The original phrase told “go is promising. Similar to C, but performance is much better thanks to the use of JIT“. The allusion to JIT is incorrect and has been replaced by this: “thanks to deciding at runtime if the architecture of the computer is 32 or 64 bit, a very quick compilation at launch time, and it compiling to very good assembler (that uses the 64 bit instructions efficiently, for example)”

2015-07-17 17:46

Benchmarked Facebook HHVM 3.9 (dev., the release date is August 3 2015) and HHVM 3.7.3, they take 52 seconds.

Re-benchmarked Facebook HHVM 3.4, before it was 72 seconds, it takes now 38 seconds. I checked the screen captures from 2014 to discard an human error. Looks like a turbo frequency issue on the tests computer, with the CPU governor making it work bellow the optimal speed or a CPU-hungry/IO process that triggered during the tests and I didn’t detect it. Thinking about forcing a fixed CPU speed for all the cores for the tests, like 2.4 Ghz and booting a live only text system without disk access and network to prevent Ubuntu launching processes in the background.

2015-07-05 13:16

Added performance of Phantomjs 1.9.0 installed via apt-get install phantomjs in Ubuntu, and Phantomjs 2.0.1-development.

Added performance of nodejs 0.12.04 (compiled).

Added bash to the graphic. It has so bad performance that I had to edit the graphic to fit in (color pink) in order prevent breaking the scale.

2015-07-03 18:32

Added benchmarks for PHP 7 alpha 2, PHP 5.6.10 and PHP 5.4.42.

2015-07-03 15:13
Thanks to the contribution of a reader (thanks mpapec!) I tried with Perl for style, resulting in passing from 796 seconds to 436 seconds.
(I used the same Perl version: Perl 5.18.2)
Updated test value for Perl.
Added new graphics showing the updated value.

Thanks to the contribution of a reader (thanks junk0xc0de!) added some additional warnings and explanations about the dangers of using -O3 (and -O2) if C/C++.

Updated the Lua code, to print i_counter and do the if i_counter > 50
This makes it take a bit longer, few cents, but passing from 7.8 to 8.2 seconds.
Updated graphics.