Tag Archives: PHP

Installing PHP environment for development in Windows

This article is for my students learning PHP, or for any student that wants to learn PHP and uses a Windows computer for that.

For this we will install:

XAMPP, which is available for Windows, Mac OS and Linux.

You can download it from: https://www.apachefriends.org/

XAMPP installs together:

  • Apache
  • MariaDB
  • PHP
  • Perl

Install WAMPP instead of XAMP (if you prefer WAMPP)

Alternatively you can install WAMPP, which installs:

  • Apache
  • MySQL
  • PHP
  • PHPMyAdmin

https://www.wampserver.com/en/

Development IDE

As Development Environment we will use PHPStorm, from Jetbrains.

https://www.jetbrains.com/phpstorm/

Testing the installation of XAMPP

The default directory for the PHP files is C:\xampp\htdocs

Create a file in c:\xampp\htdocs named hello.php

<?php
$s_today = date("Y-m-d");
echo "Hello! Today is ".$s_today;
?>

Now start Apache:

  1. Open the XAMPP Control Panel
  2. Start the Apache Server
You should see at least port 80 for Apache

And test the new page, with the browser, opening:

http://localhost/hello.php

News from the Blog 2022-01-22

News for the Blog

It has been 9 years since I created the blog, and some articles have old content that still get many visitors. To make sure they get the clear picture and not obsolete information, all the articles of the taxonomy POST, will include the Date and Time when the article was first published.

I also removed the annoying link “Leave a comment” on top. I think it influences some people to leave comments before actually having read the article.

It is still possible to add comments, but they are on the bottom of the page. I believe it makes more sense this way. This is the way.

Technically that involved modifying the files of my template:

  • functions.php
  • content.php

My Books

Automating and Provisioning to Amazon AWS using boto3 Amazon’s SDK for Python 3

I finished my book about Automating and Provisioning to Amazon AWS using boto3 Amazon’s SDK for Python 3.

It’s 128 pages in Full size DIN-A4 DRM-Free, and comes with a link to code samples of a real project CLI Menu based.

Docker Combat Guide

I have updated my book Docker Combat Guide and I added a completely new section, including source code, to work with Docker’s Python SDK.

I show the Docker SDK by showing the code of an actual CLI program I wrote in Python 3.

Here you can see a video that demonstrates how I launched a project with three Docker Containers via Docker compose. The Containers have Python, Flask webserver, and redis as bridge between the two Python Containers.

All the source code are downloadable with the book.

Watch at 1080p at Full Screen for better experience

My Classes

In January I resumed the coding classes. I have new students, and few spots free in my agenda, as some of my students graduated and others have been hired as Software Developers. I can not be more proud. :)

Free Training

Symfony is one of the most popular PHP Frameworks.

You can learn it with these free videos:

https://symfonycasts.com/screencast/symfony/

Software Licenses I Purchased

Before leaving 2021, I registered WinRAR for Windows.

WinRAR is a compressing Software that has been with us for 19 years.

I’m pretty sure I registered it in the past, but these holidays I was out only with two Window laptops and I had to do some work for the university and WinRAR came it handy, so I decided to register it again.

I create Software and Books, and I earn my life with this, so it makes a lot of sense to pay others for their good work crafting Software.

Books I Purchased

I bought this book and by the moment is really good. I wanted to buy some updated books as all my Linux books have some years already. Also I keep my skills sharp by reading reading reading.

Hardware I purchased

So I bought a cheap car power inverter.

The ones I saw in Amazon were €120+ and they were not very good rated, so I opted to buy a cheap one in the supermarket and keep it on the car just in case one day I need it. (My new Asus Zenbook laptop has 18 hours of autonomy and I don’t charge it for days, but you never know)

For those that don’t know, a power inverter allows you to get a 220V (120V in US) plug, from the connection of the lighter from your car. Also you can get the energy from an external car battery. This comes in really handy to charge the laptop, cameras, your drone… if you are in the nature and you don’t have any plug near.

I bought one years ago to power up Raspberry Pi’s when I was doing Research for a project I was studying to launch.

Fun

Many friends are using Starlink as a substitute of fiber for their rural homes, and they are super happy with it.

One of them send me a very fun article.

It is in Italian, but you can google translate it.

https://leganerd.com/2022/01/10/starlink-ha-un-piccolo-e-adorabile-problema-con-i-gatti/

Anyways you can get the idea of what’s going on in the picture :)

So tell me… so your speed with Starlink drops 80% in winter uh… aha…

Random news about Software

I tried the voice recognition in Slack huddle, and it works pretty well. Also Zoom has this feature and they are great. Specially when you are in a group call, or in a class.

My health

I was experimenting some problems, so I scheduled an appointment to get blood analysis and to be checked. Just in case.

TL;TR I could have died.

The doctors saw my analysis and sent me to the hospital urgently, where they found something that was going to be lethal. For hours they were checking me and doing several more analysis and tests to discard false positives, etc… and they precisely found the issue and provided urgent treatment and confirmed that I could have died at any moment.

Basically I dodged a bullet.

I was doing certain healthy things that helped me in a situation that could have been deadly or extremely dangerous to my health.

With the treatment and my strict discipline, I reverted the situation really quick and now I have more health and more energy than before. I feel rejuvenated.

I’m feeling lucky that with my work, the classes, the books I wrote, etc… I didn’t have to worry about the expensive medicines, the transport, etc… It was a bad moment, during Christmas, with so many people on holidays, pharmacies and GPs closed, so I had to spend more time looking for, traveling, and to pay more than it would had been strictly necessary. Despite all the time I used to my health, I managed to finish my university duties on time, and I didn’t miss my duties at work after the hospital, neither I had to cancel any programming classes or mentoring sessions. Nobody out of my closest circle knew what was going on, with the exception of my boss, which I kept informed in real time, just in case there was any problem, as I didn’t want to let down the company and have my duties at work to be unattended if something major happened.

I was not afraid to die. Unfortunately I’ve lost very significant people since I was a child. Relatives, very appreciated bosses and colleagues which I considered my friends, and great friends of different circles. Illnesses, accidents, and a friend of mine committed suicide years ago, and some of my partners attempted it (before we know each other). When you see people that are so good leaving, this brings a sadness that cannot be explained with words. I have had a tough life.

We have a limited time, and he have freedom to make choices. Some people choose to be miserable, to mistreat others, to lie, to cheat, to be unfaithful, to lack ethic and integrity. Those are their choices. Their wasted time will not come back.

Some of my friends are doctors. I admire them. They save lives and improve the quality of life of people with health problems.

I like being an Engineer cause I can create things, I can build instead of destroy, I can help to improve the world, and I can help users to have a good time and to avoid the frustration of services being down. I chose to do a positive work. So many times I’ve been offered much bigger salaries to do something I didn’t like, or by companies that I don’t admire, and I refused. Cause I wanted to make a better world. I know many people don’t think like that, and they only take take take. They are even unable to understand my choices, even to believe that I’m like that. But it’s enough that I know what I’m doing, and that it makes sense for me, and that I know that I’m doing well. And then, one day, you realize, that doing well, being fair and nice even if other people stabbed you in the back, you got to know fantastic people like you, and people that adore you and love to have you in their life, in their companies… So I’m really fortunate. To all the good-hearted people around, that give without expect anything in return, that try to make the world a better place, thank you.

News of the blog 2021-08-16

  • I completed my ZFS on Ubuntu 20.04 LTS book.
    I had an error in an actual hard drive so I added a Troubleshooting section explaining how I fixed it.
  • I paused for a while the advance of my book Python: basic exercises for beginners, as my colleague Michela is translating it to Italian. She is a great Engineer and I cannot be more happy of having her help.
  • I added a new article about how to create a simple web Star Wars game using Flask.
    As always, I use Docker and a Dockerfile to automate the deployment, so you can test it without messing with your local system.
    The code is very simple and easy to understand.
mysql> UPDATE wp_options set option_value='blog.carlesmateo.local' WHERE option_name='siteurl';
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0

This way I set an entry in /etc/hosts and I can do all the tests I want.

  • I added a new section to the blog, is a link where you can see all the articles published, ordered by number of views.
    /posts_and_views.php

Is in the main page, just after the recommended articles.
Here you can see the source code.

  • I removed the Categories:
    • Storage
      • ZFS
  • In favor of:
    • Hardware
      • Storage
        • ZFS
  • So the articles with Categories in the group deleted were reassigned the Categories in the second group.
  • Visually:
    • I removed some annoying lines from the Quick Selection access.
      They came from inherited CSS properties from my WordPress, long time customized, and I created new styles for this section.
    • I adjusted the line-height to avoid separation between lines being too much.
  • I added a link in the section of Other Engineering Blogs that I like, to the great https://github.com/lesterchan site, author of many super cool WordPress plugins.

My PHP Script to see WordPress Posts and Views ordered by Views

This article shows the code for my PHP program posts_and_views.php that you can see at the bottom of in the Quick Selection bar.

Page rendered of all my Published Posts sorted by Views

So I wrote this small PHP Code to display the number of Post Views recorded in my WordPress Database, sorted by Views in descending order.

In other words, sorted from Most Viewed.

Please note: In order to track the post views I use the excellent plugin WP-PostViews by Lester ‘GaMerZ’ Chan

Here is the code for posts_and_views.php

As you can see I use wp-config.php file to get the Database Settings (username, password, database) but I don’t use the WordPress Engine, it is just a stand alone PHP. So I wrote my own function to get the slug based on the category name. I believe the slug for it is in the Database and I could have added this as a SubQuery with JOINs, which would be better, but I wanted to keep the Database workload lightweight and especially, I did not want to invest more time investigating how the get the slug.

As I tested my function I didn’t find any Category failing but I saw that I had the Category Storage repeated in two different structure tree in Categories, and I will only always link to just the first one (So I was not linkin to the slug storage-2). I fixed that Category that I found repeated but to be honest if this script was a commercial solution or an Open Source solution properly maintained rather than just a sample, I would update it to have Categories and Tags’ slugs coming from the Database.

I would probably make it work with a Cron that would generate a cached page, updated every hour or every ten minutes. I did this in other projects I worked or in my PHP Framework Catalonia Framerwork.

But honestly, the load of the page does not justify a major effort in optimizing here, in this case.

<!DOCTYPE html>
<html>
<head>
    <style>
        .first {

            background-color: blue;
            padding: 12px;
        }
        .second {
            background-color: rgba(254, 253, 252, 0.7);
            text-align: center;
            padding:20px 0;
            font-size: 20px;
        }

        .centered {
            text-align: center;
        }

        /* unvisited link */
        a:link {
            color: blue;
            text-decoration: none;
        }

        /* visited link */
        a:visited {
            color: blue;
            text-decoration: none;
        }

        /* mouse over link */
        a:hover {
            color: #00A8EF;
            text-decoration: underline;
        }

        /* selected link */
        a:active {
            color: blue;
        }
    </style>
<body>
<?php

include "wp-config.php";

$s_site_name = "Carles Mateo's blog";
$s_site_link = "https://blog.carlesmateo.com";

function get_category_slug($s_text) {
    $s_output_text = strtolower($s_text);
    $s_output_text = str_replace(" ", "-", $s_output_text);

    return $s_output_text;
}

?>
<h1><a href="<?php print $s_site_link; ?>"><?php print($s_site_name); ?></a></h1>
<?php


$s_sort = "views DESC, post_date, post_title DESC";

if (array_key_exists("sort", $_GET)) {
    if ($_GET["sort"] == "date") {
        $s_sort = "post_date, views, post_title";
    }
}

$s_servername = "localhost";
$s_database = DB_NAME;
$s_username = DB_USER;
$s_password = DB_PASSWORD;


// Create connection
$o_conn = new mysqli($s_servername, $s_username, $s_password, $s_database);

// Check connection
if ($o_conn->connect_error) {
    die("Connection failed: " . $o_conn->connect_error);
}

/*
mysql> DESCRIBE wp_posts;
+-----------------------+-----------------+------+-----+---------------------+----------------+
| Field                 | Type            | Null | Key | Default             | Extra          |
+-----------------------+-----------------+------+-----+---------------------+----------------+
| ID                    | bigint unsigned | NO   | PRI | NULL                | auto_increment |
| post_author           | bigint unsigned | NO   | MUL | 0                   |                |
| post_date             | datetime        | NO   |     | 0000-00-00 00:00:00 |                |
| post_date_gmt         | datetime        | NO   |     | 0000-00-00 00:00:00 |                |
| post_content          | longtext        | NO   |     | NULL                |                |
| post_title            | text            | NO   |     | NULL                |                |
| post_excerpt          | text            | NO   |     | NULL                |                |
| post_status           | varchar(20)     | NO   |     | publish             |                |
| comment_status        | varchar(20)     | NO   |     | open                |                |
| ping_status           | varchar(20)     | NO   |     | open                |                |
| post_password         | varchar(255)    | NO   |     |                     |                |
| post_name             | varchar(200)    | NO   | MUL |                     |                |
| to_ping               | text            | NO   |     | NULL                |                |
| pinged                | text            | NO   |     | NULL                |                |
| post_modified         | datetime        | NO   |     | 0000-00-00 00:00:00 |                |
| post_modified_gmt     | datetime        | NO   |     | 0000-00-00 00:00:00 |                |
| post_content_filtered | longtext        | NO   |     | NULL                |                |
| post_parent           | bigint unsigned | NO   | MUL | 0                   |                |
| guid                  | varchar(255)    | NO   |     |                     |                |
| menu_order            | int             | NO   |     | 0                   |                |
| post_type             | varchar(20)     | NO   | MUL | post                |                |
| post_mime_type        | varchar(100)    | NO   |     |                     |                |
| comment_count         | bigint          | NO   |     | 0                   |                |
+-----------------------+-----------------+------+-----+---------------------+----------------+
 */

/*
mysql> describe wp_postmeta;
+------------+-----------------+------+-----+---------+----------------+
| Field      | Type            | Null | Key | Default | Extra          |
+------------+-----------------+------+-----+---------+----------------+
| meta_id    | bigint unsigned | NO   | PRI | NULL    | auto_increment |
| post_id    | bigint unsigned | NO   | MUL | 0       |                |
| meta_key   | varchar(255)    | YES  | MUL | NULL    |                |
| meta_value | longtext        | YES  |     | NULL    |                |
+------------+-----------------+------+-----+---------+----------------+
 */

$s_sql = "SELECT DISTINCT post_title, post_content,
                 (SELECT CAST(meta_value AS SIGNED) FROM wp_postmeta WHERE wp_postmeta.meta_key = 'views' AND wp_postmeta.post_id = wp_posts.ID) AS 'views',
                 (SELECT group_concat(wp_terms.name separator ',') 
                         FROM wp_terms
                  INNER JOIN wp_term_taxonomy 
                          ON wp_terms.term_id = wp_term_taxonomy.term_id
                  INNER JOIN wp_term_relationships wpr 
                          ON wpr.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
                  WHERE 
                         taxonomy= 'category' AND wp_posts.ID = wpr.object_id
                ) AS 'Categories',
                (SELECT group_concat(wp_terms.name separator ', ') 
                        FROM wp_terms
                 INNER JOIN wp_term_taxonomy 
                         ON wp_terms.term_id = wp_term_taxonomy.term_id
                 INNER JOIN wp_term_relationships wpr 
                         ON wpr.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
                 WHERE 
                        taxonomy= 'post_tag' AND wp_posts.ID = wpr.object_id
                ) AS 'Tags',
               ID, post_name, post_date, post_modified, post_type
          FROM wp_posts
          WHERE 
                post_type = 'post' AND post_status = 'publish'
          ORDER BY
                $s_sort";

$o_result = $o_conn->query($s_sql);

if ($o_result->num_rows > 0) {
    ?><table style="border:1px solid">
    <tr class="first"><th>Title</th><th style="min-width:100px">Views</th><th style="min-width:150px">Creation Date</th><th>Categories</th><th>Tags</th></tr>

    <?php
    $i_total_views = 0;
    $i_articles = 0;

    // output data of each row
    while($row = $o_result->fetch_assoc()) {
        $s_style='style="border:1px solid"';
        $s_style='';
        $s_url = $row['post_name'];
        print('<tr>');
        print("<td $s_style>");
        print('<a href="'.$s_url.'" target="_blank">');
        print($row["post_title"]);
        print('</a>');
        print("</td>");
        print('<td class="centered" '.$s_style.'>');
        print(number_format($row["views"]));
        print("</td>");
        print("<td $s_style>");
        print("<small>");
        print($row["post_date"]);
        print("</small>");
        print("</td>");
        print("<td $s_style>");
        $s_categories = $row["Categories"];
        $a_categories = explode (",", $s_categories);
        $s_categories_content = "";
        foreach($a_categories as $s_category) {
            $s_category_slug = "/category/".get_category_slug($s_category)."/";
            $s_categories_content = $s_categories_content .'<a href="'.$s_category_slug.'" target="_blank">';
            $s_categories_content = $s_categories_content .$s_category;
            $s_categories_content = $s_categories_content ."</a>, ";
        }

        if (strlen($s_categories_content) > 0) {
            $s_categories_content = substr($s_categories_content, 0, -2);
        }

        print($s_categories_content);
        print("</td>");
        print("<td $s_style>");
        print($row["Tags"]);
        print("</td>");
        // $row["post_content"];
        $i_total_views = $i_total_views + intval($row["views"]);
        $i_articles++;
        echo "</tr>";
    } ?></table><?php
    print("<strong>Total articles:</strong> ".number_format($i_articles)." <strong>Total Views:</strong> ".number_format($i_total_views));
    print("<br>");
} else {
    echo "<p>0 results</p>";
}
$o_conn->close();

?>
</body>
</html>

A small Python + MySql + Docker program as a sample

This article can be found in my book Python Combat Guide.

I wrote this code and article in order to help my Python students to mix together Object Oriented Programming, MySql, and Docker.

I prepared this video that walks through the steps and the code:

You can have everything in action with only downloading the code and running the docker_build.sh and docker_run.sh scripts.

You can download the source code from:

https://gitlab.com/carles.mateo/python-mysql-example

and clone with:

git clone https://gitlab.com/carles.mateo/python-mysql-example.git

Installing the MySql driver

We are going to use Oracle’s official MySql driver for Python.

All the documentation is here:

https://dev.mysql.com/doc/connector-python/en/

In order to install we will use pip.

To install it in Ubuntu:

pip install mysql-connector-python

In Mac Os X you have to use pip3 instead of pip.

However we are going to run everything from a Docker Container so the only thing you need is to have installed Docker.

If you prefer to install MySql in your computer (or Virtual Box instance) directly, skip the Docker steps.

Dockerfile

The Dockerfile is the file that Docker uses to build the Docker Container.

Ours is like that:

FROM ubuntu:20.04

MAINTAINER Carles Mateo

ARG DEBIAN_FRONTEND=noninteractive

RUN apt update && apt install -y python3 pip mysql-server vim mc wget curl && apt-get clean

RUN pip install mysql-connector-python

EXPOSE 3306

ENV FOLDER_PROJECT /var/mysql_carles

RUN mkdir -p $FOLDER_PROJECT

COPY docker_run_mysql.sh $FOLDER_PROJECT
COPY start.sql $FOLDER_PROJECT
COPY src $FOLDER_PROJECT

RUN chmod +x /var/mysql_carles/docker_run_mysql.sh

CMD ["/var/mysql_carles/docker_run_mysql.sh"]

The first line defines that we are going to use Ubuntu 20.04 (it’s a LTS version).

We install all the apt packages in a single line, as Docker works in layers, and what is used as disk space in the previous layer is not deleted even if we delete the files, so we want to run apt update, install all the packages, and clean the temporal files in one single step.

I also install some useful tools like: vim, mc, less, wget and curl.

We expose to outside the port 3306, in case you want to run the Python code from your computer, but having the MySql in the Container.

The last line executes a script that starts the MySql service, creates the table, the user, and add two rows and runs an infinite loop so the Docker does not finish.

build_docker.sh

build_docker.sh is a Bash script that builds the Docker Image for you very easily.

It stops the container and removes the previous image, so your hard drive does not fill with Docker images if you do modifications.

It checks for errors building and it also remembers you how to run and debug the Docker Container.

#!/bin/bash

# Execute with sudo

s_DOCKER_IMAGE_NAME="blog_carlesmateo_com_mysql"

printf "Stopping old image %s\n" "${s_DOCKER_IMAGE_NAME}"
sudo docker stop "${s_DOCKER_IMAGE_NAME}"

printf "Removing old image %s\n" "${s_DOCKER_IMAGE_NAME}"
sudo docker rm "${s_DOCKER_IMAGE_NAME}"

printf "Creating Docker Image %s\n" "${s_DOCKER_IMAGE_NAME}"
sudo docker build -t ${s_DOCKER_IMAGE_NAME} . --no-cache

i_EXIT_CODE=$?
if [ $i_EXIT_CODE -ne 0 ]; then
    printf "Error. Exit code %s\n" ${i_EXIT_CODE}
    exit
fi

echo "Ready to run ${s_DOCKER_IMAGE_NAME} Docker Container"
echo "To run type: sudo docker run -d -p 3306:3306 --name ${s_DOCKER_IMAGE_NAME} ${s_DOCKER_IMAGE_NAME}"
echo "or just use run_in_docker.sh"
echo
echo "Debug running Docker:"
echo "docker exec -it ${s_DOCKER_IMAGE_NAME} /bin/bash"
echo

docker_run.sh

I also provide a script named docker_run.sh that runs your Container easily, exposing the MySql port.

#!/bin/bash

# Execute with sudo

s_DOCKER_IMAGE_NAME="blog_carlesmateo_com_mysql"

docker run -d -p 3306:3306 --name ${s_DOCKER_IMAGE_NAME} ${s_DOCKER_IMAGE_NAME}

echo "Showing running Instances"
docker ps

As you saw before I named the image after blog_carlesmateo_com_mysql.

I did that so basically I wanted to make sure that the name was unique, as the build_docker.sh deletes an image named like the name I choose, I didn’t want to use a generic name like “mysql” that may lead to you to delete the Docker Image inadvertently.

docker_run_mysql.sh

This script will run when the Docker Container is launched for the first time:

#!/bin/bash

# Allow to be queried from outside
sed -i '31 s/bind-address/#bind-address/' /etc/mysql/mysql.conf.d/mysqld.cnf

service mysql start

# Create a Database, a user with password, and permissions
cd /var/mysql_carles
mysql -u root &lt; start.sql

while [ true ]; do sleep 60; done

With sed command we modify the line 31 of the the MySQL config file so we can connect from Outside the Docker Instance (bind-address: 127.0.0.1)

As you can see it executes the SQL contained in the file start.sql as root and we start MySql.

Please note: Our MySql installation has not set a password for root. It is only for Development purposes.

start.sql

The SQL file that will be ran inside our Docker Container.

CREATE DATABASE carles_database;

CREATE USER 'python'@'localhost' IDENTIFIED BY 'blog.carlesmateo.com-db-password';
CREATE USER 'python'@'%' IDENTIFIED BY 'blog.carlesmateo.com-db-password';
GRANT ALL PRIVILEGES ON carles_database.* TO 'python'@'localhost';
GRANT ALL PRIVILEGES ON carles_database.* TO 'python'@'%';

USE carles_database;

CREATE TABLE car_queue (
    i_id_car int,
    s_model_code varchar(25),
    s_color_code varchar(25),
    s_extras varchar(100),
    i_right_side int,
    s_city_to_ship varchar(25)
);

INSERT INTO car_queue (i_id_car, s_model_code, s_color_code, s_extras, i_right_side, s_city_to_ship) VALUES (1, "GOLF2021", "BLUE7", "COND_AIR, GPS, MULTIMEDIA_V3", 0, "Barcelona");
INSERT INTO car_queue (i_id_car, s_model_code, s_color_code, s_extras, i_right_side, s_city_to_ship) VALUES (2, "GOLF2021_PLUGIN_HYBRID", "BLUEMETAL_5", "COND_AIR, GPS, MULTIMEDIA_V3, SECURITY_V5", 1, "Cork");

As you can see it creates the user “python” with the password ‘blog.carlesmateo.com-db-password’ for access local and remote (%).

It also creates a Database named carles_database and grants all the permissions to the user “python”, for local and remote.

This is the user we will use to authenticate from out Python code.

Then we switch to use the carles_database and we create the car_queue table.

We insert two rows, as an example.

select_values_example.py

Finally the Python code that will query the Database.

import mysql.connector

if __name__ == "__main__":
    o_conn = mysql.connector.connect(user='python', password='blog.carlesmateo.com-db-password', database='carles_database')
    o_cursor = o_conn.cursor()

    s_query = "SELECT * FROM car_queue"

    o_cursor.execute(s_query)

    for a_row in o_cursor:
        print(a_row)

    o_cursor.close()
    o_conn.close()

Nothing special, we open a connection to the MySql and perform a query, and parse the cursor as rows/lists.

Please note: Error control is disabled so you may see any exception.

Executing the Container

First step is to build the Container.

From the directory where you cloned the project, execute:

sudo ./build_docker.sh

Then run the Docker Container:

sudo ./docker_run.sh

The script also performs a docker ps command, so you can see that it’s running.

Entering the Container and running the code

Now you can enter inside the Docker Container:

docker exec -it blog_carlesmateo_com_mysql /bin/bash

Then change to the directory where I installed the sample files:

cd /var/mysql_carles

And execute the Python 3 example:

python3 select_values_example.py

Tying together MySql and a Python Menu with Object Oriented Programming

In order to tie all together, and specially to give a consistent view to my students, to avoid showing only pieces but a complete program, and to show a bit of Objects Oriented in action I developed a small program which simulates the handling of a production queue for Volkswagen.

MySQL Library

First I created a library to handle MySQL operations.

lib/mysqllib.py

import mysql.connector


class MySql():

    def __init__(self, s_user, s_password, s_database, s_host="127.0.0.1", i_port=3306):
        self.s_user = s_user
        self.s_password = s_password
        self.s_database = s_database
        self.s_host = s_host
        self.i_port = i_port

        o_conn = mysql.connector.connect(host=s_host, port=i_port, user=s_user, password=s_password, database=s_database)
        self.o_conn = o_conn

    def query(self, s_query):
        a_rows = []

        o_cursor = self.o_conn.cursor()

        o_cursor.execute(s_query)

        for a_row in o_cursor:
            a_rows.append(a_row)

        o_cursor.close()

        return a_rows

    def insert(self, s_query):

        o_cursor = self.o_conn.cursor()

        o_cursor.execute(s_query)
        i_inserted_row_count = o_cursor.rowcount

        # Make sure data is committed to the database
        self.o_conn.commit()

        return i_inserted_row_count

    def delete(self, s_query):

        o_cursor = self.o_conn.cursor()

        o_cursor.execute(s_query)
        i_deleted_row_count = o_cursor.rowcount

        # Make sure data is committed to the database
        self.o_conn.commit()

        return i_deleted_row_count


    def close(self):

        self.o_conn.close()

Basically when this class is instantiated, a new connection to the MySQL specified in the Constructor is established.

We have a method query() to send SELECT queries.

We have a insert method, to send INSERT, UPDATE queries that returns the number of rows affected.

This method ensures to perform a commit to make sure changes persist.

We have a delete method, to send DELETE Sql queries that returns the number of rows deleted.

We have a close method which closes the MySql connection.

A Data Object: CarDO

Then I’ve defined a class, to deal with Data and interactions of the cars.

do/cardo.py


class CarDO():

    def __init__(self, i_id_car=0, s_model_code="", s_color_code="", s_extras="", i_right_side=0, s_city_to_ship=""):
        self.i_id_car = i_id_car
        self.s_model_code = s_model_code
        self.s_color_code = s_color_code
        self.s_extras = s_extras
        self.i_right_side = i_right_side
        self.s_city_to_ship = s_city_to_ship

        # Sizes for render
        self.i_width_id_car = 6
        self.i_width_model_code = 25
        self.i_width_color_code = 25
        self.i_width_extras = 50
        self.i_width_side = 5
        self.i_width_city_to_ship = 15

    def print_car_info(self):
        print("Id:", self.i_id_car)
        print("Model Code:", self.s_model_code)
        print("Color Code:", self.s_color_code)
        print("Extras:", self.s_extras)
        s_side = self.get_word_for_driving_side()
        print("Drive by side:", s_side)
        print("City to ship:", self.s_city_to_ship)

    def get_word_for_driving_side(self):
        if self.i_right_side == 1:
            s_side = "Right"
        else:
            s_side = "Left"

        return s_side

    def get_car_info_for_list(self):

        s_output = str(self.i_id_car).rjust(self.i_width_id_car) + " "
        s_output += self.s_model_code.rjust(self.i_width_model_code) + " "
        s_output += self.s_color_code.rjust(self.i_width_color_code) + " "
        s_output += self.s_extras.rjust(self.i_width_extras) + " "
        s_output += self.get_word_for_driving_side().rjust(self.i_width_side) + " "
        s_output += self.get_s_city_to_ship().rjust(self.i_width_city_to_ship)

        return s_output

    def get_car_header_for_list(self):
        s_output = str("Id Car").rjust(self.i_width_id_car) + " "
        s_output += "Model Code".rjust(self.i_width_model_code) + " "
        s_output += "Color Code".rjust(self.i_width_color_code) + " "
        s_output += "Extras".rjust(self.i_width_extras) + " "
        s_output += "Drive".rjust(self.i_width_side) + " "
        s_output += "City to Ship".rjust(self.i_width_city_to_ship)

        i_total_length = self.i_width_id_car + self.i_width_model_code + self.i_width_color_code + self.i_width_extras + self.i_width_side + self.i_width_city_to_ship
        # Add the space between fields
        i_total_length = i_total_length + 5

        s_output += "\n"
        s_output += "=" * i_total_length

        return s_output

    def get_i_id_car(self):
        return self.i_id_car

    def get_s_model_code(self):
        return self.s_model_code

    def get_s_color_code(self):
        return self.s_color_code

    def get_s_extras(self):
        return self.s_extras

    def get_i_right_side(self):
        return self.i_right_side

    def get_s_city_to_ship(self):
        return self.s_city_to_ship

Initially I was going to have a CarDO Object without any logic. Only with Data.

In OOP the variables of the Instance are called Properties, and the functions Methods.

Then I decided to add some logic, so I can show what’s the typical use of the objects.

So I will use CarDO as Data Object, but also to do few functions like printing the info of a Car.

Queue Manager

Finally the main program.

We also use Object Oriented Programming, and we use Dependency Injection to inject the MySQL Instance. That’s very practical to do Unit Testing.

from lib.mysqllib import MySql
from do.cardo import CarDO


class QueueManager():

    def __init__(self, o_mysql):
        self.o_mysql = o_mysql

    def exit(self):
        exit(0)

    def main_menu(self):
        while True:
            print("Main Menu")
            print("=========")
            print("")
            print("1. Add new car to queue")
            print("2. List all cars to queue")
            print("3. View car by Id")
            print("4. Delete car from queue by Id")
            print("")
            print("0. Exit")
            print("")

            s_option = input("Choose your option:")
            if s_option == "1":
                self.add_new_car()
            if s_option == "2":
                self.see_all_cars()
            if s_option == "3":
                self.see_car_by_id()
            if s_option == "4":
                self.delete_by_id()

            if s_option == "0":
                self.exit()

    def get_all_cars(self):
        s_query = "SELECT * FROM car_queue"

        a_rows = self.o_mysql.query(s_query)
        a_o_cars = []

        for a_row in a_rows:
            i_id_car = a_row[0]
            s_model_code = a_row[1]
            s_color_code = a_row[2]
            s_extras = a_row[3]
            i_right_side = a_row[4]
            s_city_to_ship = a_row[5]

            o_car = CarDO(i_id_car=i_id_car, s_model_code=s_model_code, s_color_code=s_color_code, s_extras=s_extras, i_right_side=i_right_side, s_city_to_ship=s_city_to_ship)
            a_o_cars.append(o_car)

        return a_o_cars

    def get_car_by_id(self, i_id_car):
        b_success = False
        o_car = None

        s_query = "SELECT * FROM car_queue WHERE i_id_car=" + str(i_id_car)

        a_rows = self.o_mysql.query(s_query)

        if len(a_rows) == 0:
            # False, None
            return b_success, o_car

        i_id_car = a_rows[0][0]
        s_model_code = a_rows[0][1]
        s_color_code = a_rows[0][2]
        s_extras = a_rows[0][3]
        i_right_side = a_rows[0][4]
        s_city_to_ship = a_rows[0][5]

        o_car = CarDO(i_id_car=i_id_car, s_model_code=s_model_code, s_color_code=s_color_code, s_extras=s_extras, i_right_side=i_right_side, s_city_to_ship=s_city_to_ship)
        b_success = True

        return b_success, o_car

    def replace_apostrophe(self, s_text):
        return s_text.replace("'", "´")

    def insert_car(self, o_car):

        s_sql = """INSERT INTO car_queue 
                                (i_id_car, s_model_code, s_color_code, s_extras, i_right_side, s_city_to_ship) 
                         VALUES 
                                (""" + str(o_car.get_i_id_car()) + ", '" + o_car.get_s_model_code() + "', '" + o_car.get_s_color_code() + "', '" + o_car.get_s_extras() + "', " + str(o_car.get_i_right_side()) + ", '" + o_car.get_s_city_to_ship() + "');"

        i_inserted_row_count = self.o_mysql.insert(s_sql)

        if i_inserted_row_count > 0:
            print("Inserted", i_inserted_row_count, " row/s")
            b_success = True
        else:
            print("It was impossible to insert the row")
            b_success = False

        return b_success

    def add_new_car(self):
        print("Add new car")
        print("===========")

        while True:
            s_id_car = input("Enter new ID: ")
            if s_id_car == "":
                print("A numeric Id is needed")
                continue

            i_id_car = int(s_id_car)

            if i_id_car < 1:
                continue

            # Check if that id existed already
            b_success, o_car = self.get_car_by_id(i_id_car=i_id_car)
            if b_success is False:
                # Does not exist
                break

            print("Sorry, this Id already exists")

        s_model_code = input("Enter Model Code:")
        s_color_code = input("Enter Color Code:")
        s_extras = input("Enter extras comma separated:")
        s_right_side = input("Enter R for Right side driven:")
        if s_right_side.upper() == "R":
            i_right_side = 1
        else:
            i_right_side = 0
        s_city_to_ship = input("Enter the city to ship the car:")

        # Sanitize SQL replacing apostrophe
        s_model_code = self.replace_apostrophe(s_model_code)
        s_color_code = self.replace_apostrophe(s_color_code)
        s_extras = self.replace_apostrophe(s_extras)
        s_city_to_ship = self.replace_apostrophe(s_city_to_ship)

        o_car = CarDO(i_id_car=i_id_car, s_model_code=s_model_code, s_color_code=s_color_code, s_extras=s_extras, i_right_side=i_right_side, s_city_to_ship=s_city_to_ship)
        b_success = self.insert_car(o_car)

    def see_all_cars(self):
        print("")

        a_o_cars = self.get_all_cars()

        if len(a_o_cars) > 0:
            print(a_o_cars[0].get_car_header_for_list())
        else:
            print("No cars in queue")
            print("")
            return

        for o_car in a_o_cars:
            print(o_car.get_car_info_for_list())

        print("")

    def see_car_by_id(self, i_id_car=0):
        if i_id_car == 0:
            s_id = input("Car Id:")
            i_id_car = int(s_id)

        s_id_car = str(i_id_car)

        b_success, o_car = self.get_car_by_id(i_id_car=i_id_car)
        if b_success is False:
            print("Error, car id: " + s_id_car + " not located.")
            return False

        print("")
        o_car.print_car_info()
        print("")

        return True

    def delete_by_id(self):

        s_id = input("Enter Id of car to delete:")
        i_id_car = int(s_id)

        if i_id_car == 0:
            print("Invalid Id")
            return

        # reuse see_car_by_id
        b_found = self.see_car_by_id(i_id_car=i_id_car)
        if b_found is False:
            return

        s_delete = input("Are you sure you want to DELETE. Type Y to delete: ")
        if s_delete.upper() == "Y":
            s_sql = "DELETE FROM car_queue WHERE i_id_car=" + str(i_id_car)
            i_num = self.o_mysql.delete(s_sql)

            print(i_num, " Rows deleted")

            # if b_success is True:
            #     print("Car deleted successfully from the queue")


if __name__ == "__main__":

    try:

        o_mysql = MySql(s_user="python", s_password="blog.carlesmateo.com-db-password", s_database="carles_database", s_host="127.0.0.1", i_port=3306)

        o_queue_manager = QueueManager(o_mysql=o_mysql)
        o_queue_manager.main_menu()
    except KeyboardInterrupt:
        print("Detected CTRL + C. Exiting")

This program talks to MySQL, that we have started in a Docker previously.

We have access from inside the Docker Container, or from outside.

The idea of this simple program is to use a library for dealing with MySql, and objects for dealing with the Cars. The class CarDO contributes to the render of its data in the screen.

To enter inside the Docker once you have generated it and is running, do:

docker exec -it blog_carlesmateo_com_mysql /bin/bash

Then:

cd /var/mysql_carles 
python3 queue_manager.py

Bonus

I added a file called queue_manager.php so you can see how easy is to render a HTML page with data coming from the Database, from PHP.

News from the blog 2021-01-11

Happy New Year to all.

Is something very simple, but will help my student friends to validate Input from Keyboard without losing too many hours.

The Input Validation Classes I create in PHP for Privalia or in my PHP Catalonia Framework, are much, much, more powerful, allowing the validation of complete forms, rendering errors, etc… although they were created for Web, and not for Keyboard input.

It recursively goes to all the subdirectories looking for .py files, and then it counts the lines.

  • I updated the price of my books to be the minimum allowed by LeanPub, to $5 USD, and created a bundle of two of them for $7 USD.

So people can benefit from this during the lock down.

  • I’ve updated the Python Combat Guide book with a sample of using Paramiko Libraries for SSH, and increased the Object Oriented Programing and Unit Testing, sections. I also added some books to the Bibliography.
  • I’ve read the postmortem initial analysis from Slack’s incident. It’s really interesting.

I cannot share it, but I guess that at some point they will publish it on their blog:

https://slack.engineering/

  • As I’m giving more Python Classes I decided to write a book to teach to code in Python for non-programmers.

Post-Mortem: The mystery of the duplicated Transactions into an e-Commerce

Me, with 4 more Senior BackEnd Engineers wrote the new e-Commerce for a multinational.

The old legacy Software evolved into a different code for every country, making it impossible to be maintained.

The new Software we created used inheritance to use the same base code for each country and overloaded only the specific different behavior of every country, like for the payment methods, for example Brazil supporting “parcelados” or Germany with specific payment players.

We rewrote the old procedural PHP BackEnd into modern PHP, with OOP and our own Framework but we had to keep the transactional code in existing MySQL Procedures, so the logic was split. There was a Front End Team consuming our JSONs. Basically all the Front End code was cached in Akamai and pages were rendered accordingly to the JSONs served from out BackEnd.

It was a huge success.

This e-Commerce site had Campaigns that started at a certain time, so the amount of traffic that would come at the same time would be challenging.

The project was working very well, and after some time the original Team was split into different projects in the company and a Team for maintenance and evolutives was hired.

At certain point they started to encounter duplicate transactions, and nobody was able to solve the mystery.

I’m specialized into fixing impossible problems. They used to send me to Impossible Missions, and I am famous for solving impossible problems easily.

So I started the task with a SRE approach.

The System had many components and layers. The problem could be in many places.

I had in my arsenal of tools, Software like mysqldebugger with which I found an unnoticed bug in decimals calculation in the past surprising everybody.

Previous Engineers involved believed the problem was in the Database side. They were having difficulties to identify the issue by the random nature of the repetitions.

Some times the order lines were duplicated, and other times were the payments, which means charging twice to the customer.

Redis Cluster could also play a part on this, as storing the session information and the basket.

But I had to follow the logic sequence of steps.

If transactions from customer were duplicated that mean that in first term those requests have arrived to the System. So that was a good point of start.

With a list of duplicated operations, I checked the Webservers logs.

That was a bit tricky as the Webserver was recording the Ip of the Load Balancer, not the ip of the customer. But we were tracking the sessionid so with that I could track and user request history. A good thing was also that we were using cookies to stick the user to the same Webserver node. That has pros and cons, but in this case I didn’t have to worry about the logs combined of all the Webservers, I could just identify a transaction in one node, and stick into that node’s log.

I was working with SSH and Bash, no log aggregators existing today were available at that time.

So when I started to catch web logs and grep a bit an smile was drawn into my face. :)

There were no transactions repeated by a bad behavior on MySQL Masters, or by BackEnd problems. Actually the HTTP requests were performed twice.

And the explanation to that was much more simple.

Many Windows and Mac User are used to double click in the Desktop to open programs, so when they started to use Internet, they did the same. They double clicked on the Submit button on the forms. Causing two JavaScript requests in parallel.

When I explained it they were really surprised, but then they started to worry about how they could fix that.

Well, there are many ways, like using an UUID in each request and do not accepting two concurrents, but I came with something that we could deploy super fast.

I explained how to change the JavaScript code so the buttons will have no default submit action, and they will trigger a JavaScript method instead, that will set a boolean to True, and also would disable the button so it can not be clicked anymore. Only if the variable was False the submit would be performed. It was almost impossible to get a double click as the JavaScript was so fast disabling the button, that the second click will not trigger anything. But even if that could be possible, only one request would be made, as the variable was set to True on the first click event.

That case was very funny for me, because it was not necessary to go crazy inspecting the different layers of the system. The problem was detected simply with HTTP logs. :)

People often forget to follow the logic steps while many problems are much more simple.

As a curious note, I still see people double clicking on links and buttons on the Web, and some Software not handling it. :)

Resources for Microservices and Business Domain Solutions for the Cloud Architect / Microservices Architect

First you have to understand that Python, Java and PHP are worlds completely different.

In Python you’ll probably use Flask, and listen to the port you want, inside Docker Container.

In PHP you’ll use a Frameworks like Laravel, or Symfony, or Catalonia Framework (my Framework) :) and a repo or many (as the idea is that the change in one microservice cannot break another it is recommended to have one git repo per Service) and split the requests with the API Gateway and Filters (so /billing/ goes to the right path in the right Server, is like rewriting URLs). You’ll rely in Software to split your microservices. Usually you’ll use Docker, but you have to add a Web Server and any other tools, as the source code is not packet with a Web Server and other Dependencies like it is in Java Spring Boot.

In Java you’ll use Spring Cloud and Spring Boot, and every Service will be auto-contained in its own JAR file, that includes Apache Tomcat and all other Dependencies and normally running inside a Docker. Tcp/Ip listening port will be set at start via command line, or through environment. You’ll have many git repositories, one per each Service.

Using many repos, one per Service, also allows to deploy only that repository and to have better security, with independent deployment tokens.

It is not unlikely that you’ll use one language for some of your Services and another for other, as well as a Database or another, as each Service is owner of their data.

In any case, you will be using CI/CD and your pipeline will be something like this:

  1. Pull the latest code for the Service from the git repository
  2. Compile the code (if needed)
  3. Run the Unit and Integration Tests
  4. Compile the service to an executable artifact (f.e. Java JAR with Tomcat server and other dependencies)
  5. Generate a Machine image with your JAR deployed (for Java. Look at Spotify Docker Plugin to Docker build from Maven), or with Apache, PHP, other dependencies, and the code. Normally will be a Docker image. This image will be immutable. You will probably use Dockerhub.
  6. Machine image will be started. Platform test are run.
  7. If platform tests pass, the service is promoted to the next environment (for example Dev -> Test -> PreProd -> Prod), the exact same machine is started in the next environment and platform tests are repeated.
  8. Before deploying to Production the new Service, I recommend running special Application Tests / Behavior-driven. By this I mean, to conduct tests that really test the functionality of everything, using a real browser and emulating the acts of a user (for example with BeHat, Cucumber or with JMeter).
    I recommend this specially because Microservices are end-points, independent of the implementation, but normally they are API that serve to a whole application. In an Application there are several components, often a change in the Front End can break the application. Imagine a change in Javascript Front End, that results in a call a bit different, for example, with an space before a name. Imagine that the Unit Tests for the Service do not test that, and that was not causing a problem in the old version of the Service and so it will crash when the new Service is deployed. Or another example, imagine that our Service for paying with Visa cards generates IDs for the Payment Gateway, and as a result of the new implementation the IDs generated are returned. With the mocked objects everything works, but when we deploy for real is when we are going to use the actual Bank Payment. This is also why is a good idea to have a PreProduction environment, with PreProduction versions of the actual Services we use (all banks or the GDS for flights/hotel reservation like Galileo or Amadeus have a Test, exactly like Production, Gateway)

If you work with Microsoft .NET, you’ll probably use Azure DevOps.

We IT Engineers, CTOs and Architects, serve the Business. We have to develop the most flexible approaches and enabling the business to release as fast as their need.

Take in count that Microservices is a tool, a pattern. We will use it to bring more flexibility and speed developing, resilience of the services, and speed and independence deploying. However this comes at a cost of complexity.

Microservices is more related to giving flexibility to the Business, and developing according to the Business Domains. Normally oriented to suite an API. If you have an API that is consumed by third party you will have things like independence of Services (if one is down the others will still function), gradual degradation, being able to scale the Services that have more load only, being able to deploy a new version of a Service which is independent of the rest of the Services, etc… the complexity in the technical solution comes from all this resilience, and flexibility.

If your Dev Team is up to 10 Developers or you are writing just a CRUD Web Application, a PoC, or you are an Startup with a critical Time to Market you probably you will not want to use Microservices approach. Is like killing flies with laser cannons. You can use typical Web services approach, do everything in one single Https request, have transactions, a single Database, etc…

But if your team is 100 Developer, like a big eCommerce, you’ll have multiple Teams between 5 and 10 Developers per Business Domain, and you need independence of each Service, having less interdependence. Each Service will own their own Data. That is normally around 5 to 7 tables. Each Service will serve a Business Domain. You’ll benefit from having different technologies for the different needs, however be careful to avoid having Teams with different knowledge that can have hardly rotation and difficult to continue projects when the only 2 or 3 Devs that know that technology leave. Typical benefit scenarios can be having MySql for the Billing Services, but having NoSQL Database for the image catalog, or to store logs of account activity. With Microservices, some services will be calling other Services, often asynchronously, using Queues or Streams, you’ll have Callbacks, Databases for reading, you’ll probably want to have gradual and gracefully failure of your applications, client load balancing, caches and read only databases/in-memory databases… This complexity is in order to protect one Service from the failure of others and to bring it the necessary speed under heavy load.

Here you can find a PDF Document of the typical resources I use for Microservice Projects.

You can also download it from my github repository:

https://github.com/carlesmateo/awesome-microservices

Do you use other solutions that are not listed?. Leave a message. I’ll investigate them and update the Document, to share with the Community.

Update 2020-03-06: I found this very nice article explaining the same. Microservices are not for everybody and not the default option: https://www.theregister.co.uk/AMP/2020/03/04/microservices_last_resort/

Update 2020-03-11: Qcom with 1,600 microservices says that microservices architecture is the las resort: https://www.theregister.co.uk/AMP/2020/03/09/monzo_microservices/

Creating a Content Filter for Postfix in PHP

In this article I want to explain how I created a content filter for Postfix, in PHP.

The basic idea is to examine all the incoming messages, looking for a Credit Card pattern, and then sending those emails to another Server, that for instance is PCI compliant, and sending an email to the original receiver telling that they received an email with a CC, that is stored in a safe Server.

I choose the pipe mechanism, because is the last one in the chain of content filters, and first I want to pass the antivirus (Amavis), antispam and other content filters.

Then I inject the emails to sendmail, with the params -G -i , granting that the email will not be reprocessed entering an infinite loop.

/usr/sbin/sendmail -G -i

A remembering about the SMTP protocol, that I’ll mention later. Another link in wikipedia.

Edit the file /etc/postfix/master.cf to add these lines:

# ==========================================================================
# service type private unpriv chroot wakeup maxproc command + args
# (yes) (yes) (yes) (never) (100)
# ==========================================================================
smtp inet n - n - - smtpd
 -o content_filter=filter:dummy



# Other external delivery methods.

filter unix - n n - 10 pipe
 flags=Rq user=filter argv=/var/filtermails/filtercard.php -f ${sender} -- ${size} ${recipient}

The last parameter ${recipient} will expand with as many recipients (RCPT TO:) as the mail has.

Now the code for the PHP filter. Check a simple content filter example here.

The file /var/filtermails/filtercard.php

#!/usr/bin/php
<?php

/*
 * Carles Mateo
 */

date_default_timezone_set('Europe/Andorra');

$s_dest_mail_secure = 'secure@pciserver.carlesmateo.com';

$b_regex_found = false;
$b_emails_rcpt_to = Array();

// All major credit cards regex
// The CC anywhere
$s_cc_regex = '/(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{12}|622((12[6-9]|1[3-9][0-9])|([2-8][0-9][0-9])|(9(([0-1][0-9])|(2[0-5]))))[0-9]{10}|64[4-9][0-9]{13}|65[0-9]{14}|3(?:0[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13})/';

function log_event($s_message) {
 syslog(LOG_WARNING, $s_message);
}

function save_message_to_file($s_file, $s_message) {
 $o_file = fopen($s_file, "a");
 fwrite($o_file, $s_message);
 fclose($o_file);
}

function read_file($s_file) {
 $s_contents = file_get_contents($s_file);
 
 if ($s_contents === false) {
 return '';
 }
 
 return $s_contents;
}

function get_all_rcpt_to($st_emails_input) {
 // First email is pos 5 of the array
 $st_emails = $st_emails_input;
 
 unset($st_emails[0]);
 unset($st_emails[1]);
 unset($st_emails[2]);
 unset($st_emails[3]);
 unset($st_emails[4]);
 
 asort($st_emails);
 
 return $st_emails;
}

/*
 * Returns a @secure. email, from the original email
 */
function get_secure_email($s_email) {
 $i_pos = strpos($s_email, '@');
 
 $s_email_new = $s_email;
 
 if ($i_pos > 0) {
 $s_email_new = substr($s_email, 0, $i_pos);
 $s_email_new .= 'secure.';
 $s_email_new .= substr($s_email, $i_pos +1);
 }
 
 return $s_email_new;
}

function replace_tpl_variables($s_text, $s_sender_original) {

 // TODO: Replace static values
 
 $s_date_sent = date('r'); // RFC 2822 formatted date

 $s_text = str_replace('#DATE_NOW#', $s_date_sent, $s_text);
 $s_text = str_replace('#FROM_NAME#', 'Carles Mateo', $s_text);
 $s_text = str_replace('#FROM_EMAIL#', 'mateo@blog.carlesmateo.com', $s_text);
 $s_text = str_replace('#EMAIL_SENDER_ORIGINAL#', $s_sender_original, $s_text);

 return $s_text;
}

function delete_file($s_file) {
 unlink($s_file);
}

// Read the RCPT TO: fields ${recipient}
$st_emails_rcpt_to = get_all_rcpt_to($argv);

// Read the email
$email = '';
$fd = fopen("php://stdin", "r");
while (!feof($fd)) {
 $line = fread($fd, 1024);
 $email .= $line;
}
fclose($fd);

// Get the portion of the email without headers (to avoid id's being detected as CC numbers)
$i_pos_subject = strpos($email, 'Subject:');
if ($i_pos_subject > 0) {
 // Found
 $email_sanitized = substr($email, $i_pos_subject);
} else {
 // If we don't locate subject we look for From:
 $i_pos_from = strpos($email, 'From:');
 if ($i_pos_from > 0) {
 $email_sanitized = substr($email, $i_pos_from);
 } else {
 // Impossible email, but continue
 $email_sanitized = $email;
 }
}

// Remove spaces, and points so we find 4111.1111.1111.111 and so
$email_sanitized = str_replace(' ', '', $email_sanitized);
$email_sanitized = str_replace('.', '', $email_sanitized);
$email_sanitized = str_replace('-', '', $email_sanitized);

$s_message = "Script filtercard.php successfully ran\n";

log_event('Arguments: '.serialize($argv));

$i_result = preg_match($s_cc_regex, $email_sanitized, $s_matches);
if ($i_result == 1) {
 $b_regex_found = true;
 $s_message .= 'Card found'."\n";
 log_event($s_message);
} else {
 // No credit card
 $s_message .= 'No credit card found'."\n";
 log_event($s_message);
}

$s_dest_mail_original = $argv[5];
$s_sender_original = $argv[2];

// Generate a unique id
$i_unique_id = time().'-'.rand(0,99999).'-'.rand(0,99999);

$INSPECT_DIR='/var/spool/filter/';
// NEVER NEVER NEVER use "-t" here.
$SENDMAIL="/usr/sbin/sendmail -G -i";

$s_file_unique = $INSPECT_DIR.$i_unique_id;

# Exit codes from <sysexits.h>
$EX_TEMPFAIL=75;
$EX_UNAVAILABLE=69;

// Save the file
save_message_to_file($s_file_unique, $email);

$st_output = Array();

if ($b_regex_found == false) {
 // Send normally
 
 foreach ($st_emails_rcpt_to as $i_key=>$s_email_rcpt_to) {
 $s_sendmail = $SENDMAIL.' "'.$s_email_rcpt_to.'" <'.$s_file_unique;
 $i_status = exec($s_sendmail, $st_output);
 
 log_event('Status Sendmail (original mail): '.$i_status.' to: '.$s_email_rcpt_to);
 }
 
 delete_file($s_file_unique);
 
 exit();
}

// Send secure email
$s_sendmail = $SENDMAIL.' "'.$s_dest_mail_secure.'" <'.$s_file_unique;
$i_status = exec($s_sendmail, $st_output);

log_event('Status Sendmail (secure email): '.$i_status.' to: '.$s_dest_mail_secure);

$s_email_tpl = read_file('/usr/share/secure/smtpfilter_email.txt');

if ($s_email_tpl == '') {
 // Generic message
 
 $s_date_sent = date('r'); // RFC 2822 formatted date
 
 $s_email_tpl = <<<EOT
Date: $s_date_sent
From: secure <noreply@secure.carlesmateo.com>
Subject: Message with a Credit Card from $s_sender_original
You received a message with a Credit Card
EOT;
 
}

$s_email_tpl = replace_tpl_variables($s_email_tpl, $s_sender_original);

save_message_to_file($s_file_unique.'-tpl', $s_email_tpl);

// Send the replacement email
foreach ($st_emails_rcpt_to as $i_key=>$s_email_rcpt_to) {
 $st_output = Array();
 $s_sendmail = $SENDMAIL.' "'.$s_email_rcpt_to.'" <'.$s_file_unique.'-tpl';
 
 $i_status = exec($s_sendmail, $st_output);
 log_event('Status Sendmail (TPL): '.$i_status.' to: '.$s_email_rcpt_to);
 
}

delete_file($s_file_unique);
delete_file($s_file_unique.'-tpl');

/* Headers:

From: Carles Mateo <mateo@carlesmateo.com>
To: "carles2@carlesmateo.com" <carles2@carlesmateo.com>, Secure
 <secure@secure.carlesmateo.com>
CC: "test@carlesmateo.com" <test@carlesmateo.com>
Subject: Test with several emails and CCs
Thread-Topic: Test with several emails and CCs
Thread-Index: AQHRt1tmO/z+TpI64UiniKm7I56onw==
Date: Thu, 25 May 2016 14:32:15 +0000
 */

You can test it connecting by telnet to port 25 and doing (in bold the SMTP commands):

HELO mycomputer.com
MAIL FROM: test@carlesmateo.com
RCPT TO: just@asample.com
RCPT TO: another@different.com
DATA
Date: Mon, 30 May 2016 14:07:56 +0000
From: Carles Mateo <mateo@blog.carlesmateo.com>
To: Undisclosed recipients
Subject: Test with CC
This is just a test with a Visa CC 4111 1111 11-11-1111.

You can use the nc command for commodity.

When you’re all set I recommend you to test it by sending real emails from real servers

csort, my algorithm that heavily beats quicksort

Here is my proposal for sorting much, much faster than quicksort.

Time and memory consumption in C, for my algorithm csort, versus quicksort, with an universe of 50 Million registers, with values duplicated from 0 to 1,000,000. Less is better

Time and memory consumption in C, for my algorithm csort, versus quicksort, with an universe of 50 Million registers, with values duplicated from 0 to 1,000,000. Less is better

All the algorithms must be checked with different universes of data and their utility is limited to some situations. I explain here the algorithm, show the logic behind code in C, in PHP and Java, the metricas, push the boundaries of the computer… and test against different sets of data looking for the big-O. I prove how much more perfomant it is and explain everything.

This article is complex, is intended for very Senior Engineers at huge companies that need to handle very big data.

The code is written in C to be sure that we compare everything in the same order of magnitude and to avoid optimizations from the compilers / precompilers / Just in Time engines that could bias the conclusions, and so knowing that we are playing fair, but there is a porting to PHP so everyone can understand the algorithm and there are some utilities written in PHP as well. I’ll also explain certain interesting things about PHP internal functions and how it handles memory, and other goodies about C. Finally the code is also available in Java, and is possible to observe how clearly my csort beats quicksort and how fast que JIT compiler performs. Perhaps you can be interested into reading my previous article performance of several languages.

We need a huge amount of data in order to be able to demonstrate the time differences clearly.

Some of the PHP tests and proves require you to have a decent computer with at least 16 GB of RAM. You can skip those or reduce the magnitude of the set of data, or simply use C.

Take this article as an intellectual entertainment, although I think you will discover this useful algorithm / technique of mine will allow you to sort much faster than quicksort in most of the cases I can imagine, and so you will have another tool in your set of resources for sorting fast and dealing with heavy data structures.

Problem

An initial intellectual problem description to enter into matter:

Having a set of 50,000,000 numbers, with values possible between 0 and 1,000,000:

  1. Sort them and eliminate duplicates as fast as possible

To be fair and trustworthy the numbers will be generated and saved to disk, and the same set of numbers will be used for comparing between all the algorithms and tests and when checking with different languages.

Generating the numbers

For this test I created an array of numeric items.

I created a .txt file filled with the numbers, and then I load the same set of numbers for all the tests so all the numbers are always the same and results cannot be biased by a random set very lucky for the algorithm to sort.

I’ve generated several universes of tests, that you can download.

 

Code in PHP to generate the random numbers and save to a file

If you prefer to generate a new set of numbers, you can do it by yourselves.

Please note: my computers have a lot of memory, but if you have few memory you can easily modify this program to save the values one by one and so saving memory by not having all the array in memory. Although I recommend you to generate big arrays and pass to a function while measuring the memory consumption to really understand how PHP manages passing Arrays to functions and memory. I found many Senior devs wrong about it.

<?php 
/**
 * Created by   : carles.mateo
 * Creation Date: 2015-05-22 18:36
 */
error_reporting(E_ALL);
// To avoid raising warnings from HHVM
date_default_timezone_set('Europe/Andorra');

// This freezes the server, using more than 35 GB RAM
//define('i_NUM_VALUES', 500000000);
// 50M uses ~ 10 GB RAM
define('i_NUM_VALUES', 50000000);
define('i_MAX_NUMBER', 500000000);

function get_random_values() {

    $st_values = null;

    $s_values_filename = 'carles_sort.txt';

    if (file_exists($s_values_filename)) {
        print_with_datetime("Reading values...");
        $st_values = file($s_values_filename);
        print_with_datetime("Ensuring numeric values...");
        foreach($st_values as $i_key=>$s_value) {
            $i_value = intval($s_value);
            $st_values[$i_key] = $i_value;
        }
    } else {
        print_with_datetime("Generating values...");
        $st_values = generate_random_values();
        print_with_datetime("Writing values to disk...");
        save_values_to_disk($s_values_filename, $st_values);
        print_with_datetime("Finished writing values to disk");
    }

    return $st_values;
}

function generate_random_values() {

    $i_num_values = i_NUM_VALUES;
    $i_max_number = i_MAX_NUMBER;

    $st_values = array();

    for($i_counter = 0; $i_counter<$i_num_values; $i_counter++) {
        $i_random = rand(0, $i_max_number);
        $st_values[] = $i_random;
    }

    return $st_values;

}

function save_values_to_disk($s_values_filename, $st_values) {
    $i_counter = 0;

    $o_file = fopen($s_values_filename, 'w');
    foreach ($st_values as $i_key => $s_value) {
        $i_counter++;
        fwrite($o_file, $s_value.PHP_EOL);
        if (($i_counter % 100000) == 0) {
            // every 100K numbers we show a status message
            $f_advance_percent = ($i_counter / $i_max_values)*100;
            print_with_datetime("Item num: $i_counter ($f_advance_percent%) value: $s_value written to disk");
        }
    }
    fclose($o_file);
}

function get_datetime() {
    return date('Y-m-d H:i:s');
}

function print_with_datetime($s_string) {
    echo get_datetime().' '.$s_string."\n";
}
print_with_datetime('Starting number generator...');

$st_test_array = get_random_values();

First comments about this code:

  • Obviously to generate an array from 50M items in PHP you need enough of RAM. In PHP this takes ~9GB, in C you need much less: just 50M * 4 bytes per item = 200 MB, although as my program is multipurpose and use several arrays you will realistic be needing >2 GB RAM.
  • Playing with memory_get_usage() ayou will discover that returning a big array from the function to the main code does not work by copying it to the stack, and that by passing the array to the function by value (not specifying by reference), does not copy the values neither. A pointer alike is passed unless the data passed is modified. It used a technique called copy-on-write. Other modern languages use similar techniques as well to save memory. So it is not true that “A copy of the original $array is created within the function scope“. As PHP.net section PHP at the Core: A Hacker’s guide defines “PHP is a dynamic, loosely typed language, that uses copy-on-write and reference counting.
  • Please note that when we read from disk we do:
            foreach($st_values as $i_key=>$s_value) {
                $i_value = intval($s_value);
                $st_values[$i_key] = $i_value;
            }

    This forced conversion to int is because when PHP reads the values from the disk it reads them as String (and it puts a jump of line at the end as well). We convert to make sure we use PHP integer values.

    To know the max value of an Integer in your platform do:

    echo PHP_INT_MAX;

    For a 64 bits is: 9223372036854775807

    blog-carlesmateo-com-php_int_max-64-bitI limited the number of registers sorted in this tests to 50,000,000 because it makes PHP use 12.6GB of RAM, and my computer has 32 GB. I tried with 500,000,000 but the Linux computer started to swap like crazy and had to kill the process by ssh (yes, the keyboard and mouse were irresponsible, I had to login via ssh from another computer and kill the process). Although 500M is a reasonably number for the code in C or Java.

  • So the fourth conclusion is that for working with big Data Structures PHP is not recommended because it uses a lot of RAM. You can use it if its flexibility brings you more value than performance and memory consumption.
  • Everything that you do can do in PHP will be much much much faster if done in C.

Writing the 50M values to disk

Sample status of the advance of the process of writing to disk from PHP (that is slow) 50M values from ranges 0 to 500M

My solution 1: simple csort

I’ve developed several implementations, with several utility. This is the first case.

Please take a look at this table of the cost of sorting an Array by using different algorithms from http://bigocheatsheet.com/ :

blog-carlesmateo-com-big-o-of-array-sorting-algorithms

It is interesting to see the complexity of each algorithm in all the cases.

So Quicksort has a complexity of O(n log(n)) for the best case, and O(n^2) for the worst.

For a 50 Million Array, so n=50M, then 50M * log2(50M) = 1,278,771,237.95 ~ 1.28 Billion

blog-carlesmateo-com-50M-log2_50M-at-google-cat

My algorithm approach is this: having the unsorted array loaded into the array st_values, I do a main loop like this (remember that we want to sort and remove duplicates):

for (l_counter=0; l_counter < l_max_values; l_counter++) {
    st_values_sorted[st_values[l_counter]] = st_values[l_counter];
}

So, I copy to the destination array st_values_sorted the values, assigning the value as the index of the Array st_values_sorted.

With this:

  1. We are sure we will have no duplicates, as if the value is duplicated n times it will be simply reassigned to the same position n times, and so having only it one time as st_values_sorted[n] in the final array.
  2. We have an array, that has as index, the same value. So even, if in a foreach loop you would get the values assorted, they’re sorted by their key (hash index numeric value).

Finally, if you want to generate a sorted sequential array or a linked list, we do:

        // Resort for a creating a ordered list 
        l_max_range = l_max_values_sorted + 1;
        for (l_counter=0; l_counter < l_max_range; l_counter++) {
            if (st_values_sorted[l_counter] > -1 ) {
                // If the value existed
                st_values_sorted_renum[l_counter_hit] = st_values_sorted[l_counter];
                l_counter_hit++;
            }
        }        

The magic of this:

  1. In the first loop, we read each item of the array just one time. So  O(n)
  2. In the second loop, if we really want to generate a sorted list, we simply loop the sorted and deduplicated Array by index, as seen in the code before, from the initial possible value (0) to the maximum value we can have (in this sample m=1M)
  3. So we loop, for 1,000,001 items
  4. So in total we do n+m cycles
    The cost in the Best and in the Worst case is O(n)
  5. This is fucking awesome faster than quicksort
    For quicksort time complexity is: Best O(n log n). Worst O(n^2)
  6. After quicksort you would have to delete duplicates also, adding some cost (in the case of PHP unique() function is really costly)
  7. An optimization is that we can also associate the index, in the process of loading the values from disk, so we save the first loop (step 1) and so we save O(n).
        while( b_exit == 0 ) {
            i_fitems_read = fscanf(o_fp, "%li", &l_number);
            if (i_fitems_read < 1) {
                printf("Unable to read item %li from file.\n", l_counter);
                exit(EXIT_FAILURE);            
            }
            st_values[l_number] = l_number;
                           
            l_counter++;
            if (l_counter > l_max_values -1) break;
        }

    And we will have the Array sorted (by key) without any sort. That’s why I say…
    The fastest sort… is no sorting at all!.

    This is the recommended way of using the algorithm because it saves a lot of memory, so, instead of having and Array of 50M for loading the data to sort, and another of 1M to hold the csorted data, we will have only the 1M Array. We will be adding the data using csort and eliminating duplicates and sorting on the way.

  8. If in the process of loading the have all the values sorted, we will not do the first loop O(n), so our cost will be O(m) where m<n
  9. Finally, if our array is sorted (by index) there is no need to do the second loop to produce a sorted list, as the array is sorted by key. But if we need to do it, the cost is m, otherwise we have no cost at all, it was sorted and deduplicated when loaded from disk!
  10. Even doing both loops, this is awesome faster than quicksort (even without deduplicating), for the universes tested:
    1. 50M items, possible values 0-1M
    2. 50M items, possible values 0-50M
    3. 50M items, possible values 0-500M
    4. The same proportions with less items

Comparing the performance for the first case

Notes:

  • Please note the code has been compiled with -O0 to disable optimizations for what I think is a bug I found in gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2. that manifests in the csort with compression. With the parameter -O2 the times for the sorting are half and better (0.1 seconds for csort basic, around 4.7 for quicksort).
  • To use arrays of 500M, so using more than 2GB of ram you should pass the parameter  -mcmodel=medium to gcc or the compiler will return an error “relocation truncated to fit: R_X86_64_32S against .bss”
  • Computer used is Intel(R) Core(TM) i7-4770S CPU @ 3.10GHz, as computers nowadays use turbo, times vary a bit from test to test, so tests are done many times and an average value is provided
  • Please understand how your language pass the parameters in functions before trying to port this code to another language (passing a pure by value string of 50M items to a function would not be a good idea, even worst for the case of recursive functions)

As you can see in the C code below, I have used two quicksort algorithms, both quicksort take between 8.7 and 10 seconds. (the second algorithm is a bit slower)

blog-carlesmateo-com-quicksort-1-50M-items-500M-random-values-9secondsblog-carlesmateo-com-quicksort-3-50M-items-500M-random-values-9secondsThis time does not change, no matter if the random numbers were for ranges 0-1M, 0-50M or 0-500M.

My implementation of the quicksort with removing duplicates take just a bit more as I use the same base of the csort for removing the duplicates.

 

With the csort basic, doing the two loops (so no optimizing on load) the time for sorting with removing duplicates and generating a final array like a list, depends on the maximum possible value:

  • Universe of 50M items, random values between 0-1M, time: 0.5 to 0.7 seconds

    blog-carlesmateo-com-csort-basic-50M-items-1M-random-values-0_7seconds

    Here csort basic invoked from command line. Firefox was opened and heavily full of tabs, so it was a bit slower but it takes 785 ms

    blog-carlesmateo-com-c-csort-basic-553ms-0_5s

    Launching csort basic from the netbeans IDE, Release config but -O0, and firewox removed, takes 554 milliseconds / 0.554 seconds

  • Universe of 50M items, random values between 0-50M, 1.7-2 seconds
    blog-carlesmateo-com-csort-basic-50M-items-50M-random-values-2seconds
  • Universe of 50M items, random values between 0-500M, 3.7 seconds

blog-carlesmateo-com-csort-basic-50M-items-500M-random-values-3_7seconds

Advantages of csort (for its purpose, so for the universes compatible, so if the range of values m is smaller or not many times bigger than the number of items n):

  • Is super-fast
  • It saves a lot of memory
  • Can be used to remove duplicates or to keep duplicates
  • It is possible to use it with hybrid universes (see additional optimizations) and have significant performance gains respect using a single sorting algorithm
  • It is very easy to understand and to implement

Disadvantages of csort basic:

  • It is intended for the universe of max(m) < n. If max(m) (maximum value of the number of random values) is much much higher than n (the number of items), then the sorted by key array will use more memory (bigger array), even if it has few values. (* see csort with compression bellow and Additional optimizations sections)
  • In certain border edge cases where n is many more times smaller than the difference between min(m) and max(m) the algorithm could be slower than quicksort. (this is if data does not accomplish max(m) < n)
  • The sorted array uses memory. Other algorithms, starting from an initial array, do not need additional arrays in order to sort. Although we can generate a final linked list and remove the arrays from memory during the process we are using that memory. But this was supposing we have an original array not sorted, but actually we don’t need to have an starting array, just csort the Data as you get it to the destination Array, and then you save memory because an Array of 1M takes less memory than an Array of 50M. (See My solution 2: optimized when reading)

 

My solution 2: optimized when reading

This is basically csort, but using the read from disk process to assign the right index, not only save CPU but a lot of memory.

That way the complexity is totally reduced as data is sorted (by index) and deduplicated when loading, with no cost.

In the case of the optimized read, with an universe of 50M items, ranges 0-1M, it takes the astonishing time of.. 3 ms! So 0.003 seconds.

blog-carlesmateo-com-c-csort-read-opt-3ms

Probably this represents better the idea of beating quicksort: csort 0.003 seconds vs ~10 seconds quicksort.

My solution 3: supporting linked data

Image we have a numeric array with unique values, for example id, and another String array with the names.

For example:

st_values[0] = 1500000; st_names[0] = “Larry Page”;

st_values[1] = 731207; st_names[1] = “Sergey Brin”;

st_values[2] = 23788889; st_names[2] = “Steve Wozniak”;

st_values[3] = 5127890; st_names[3] = “Bill Gates”;

st_values[1000] = 5127890; st_names[1000] = “Alan Turing”;

st_values[10000] = 73215; st_names[10000] = “Grace Hopper”;

st_values[100000] = 218919; st_names[100000] = “Jean E. Sammet”;

Instead of doing:

st_sorted[1500000] = 1500000;

st_values[731207] = 731207;

We will do:

st_sorted[1500000] = 0;

st_values[731207] = 1;

st_values[23788889] = 2;

So, we will have csorted for the id, but the value will be pointed to the linked data in st_names[].

This can be used too with csort optimized reading and saving the use in memory of the array st_values[]. Obviously st_names[] must be kept.

My solution 4: csort with compression and supporting duplicates

csort is very quick for sorting and removing duplicates. But what happens if we want to sort and keep the duplicates?

The algorithm csort with compression what does is:

    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
        for (l_counter=0; l_counter < l_max_values; l_counter++) {
            l_value = st_values[l_counter];
            st_values_sorted[l_value] = st_values_sorted[l_value] + 1;
        }
        l_counter_hit = l_max_values_sorted + 1;
    }   

So, it assigns the index, as the other versions of csort, but instead of assigned the same value as the index, it is assigned the number of times that this value is found.

So, assuming we have a list like this:

{0,10,15,0,3,5,21,5,100000,0,7,5}

It will produce:

st_values[0] = 3;
st_values[1] = 0;
st_values[2] = 0;
st_values[3] = 1;
st_values[4] = 0;
st_values[5] = 3;
st_values[6] = 0;
st_values[7] = 1;…
st_values[100000] = 1;

From this Array it is possible to get the original linked list sorted easily:
{0,0,0,3,5,5,5,7,10,15,21,100000}

Now, take the first universe of data 50M items, with ranges 0-1M, and you get that you have compressed an originally unsorted array of 50M items into a sorted array of 1M items (1,000,001 to be exact) and you can still access by index. So a compression of 50 to 1, for this universe of numbers. (As you are thinking that won’t compress, but use more space, if the range of numbers is bigger than the number of items.)

If your values repeat a lot, and you have a huge set of items to sort, like in Big Data, this algorithm is wonderful and have huge memory savings and amazing speedy random access to data.

The idea, as mentioned before, is to do not load the data into an Array, and then sort, but to assigning using csort when reading data from disk. That way instead of having an Array of 50M and another of 1M, we would have only a single Array of 1M.

This is the relation of times for the universe 50M, 0-1M:

  • csort basic: 0.553 seconds
    Please note: this sample including the time to generate a array like a list: st_values_sorted_renum, so the time just to create the csorted array is ~3ms less
  • csort opt read: 0.003 seconds
  • csort with compression and duplicates support: 0.883 seconds

blog-carlesmateo-com-c-csort-compressed

But awesome news are that the csort with compression can be implemented on the load time, like csort opt read.

Additional improvements

1) Using overlay to save Memory and run even faster

Ideally min(m) and max(m) should be calculated.

Imagine that you have a universe of data composed by id driver licenses. Guess that those numbers range from 140,000,000 to 190,000,000.

There is no need have create an array of 190M items. Just create an array of 50M, (max(m) – min(m)) and save an overlay to a variable.

l_overlay = 140000000;

Then simply do your loops from 0 to max(m) – l_overlay and when you’re building the final linked list or the sequentially ordered final array add l_overlay to each value.

You will save memory and increase speed.

2) Hybrid universes / spaces

Guess you have a hybrid universe, composed by let’s say 75% of your values in the range of 0-1M and 25% of your values being frontier values, f.e. between 35M and 900M.

    // Last value covered by the csort
    long l_last_frontier = 1000000;

    long l_counter_unsorted_array = 0;

    while( b_exit == 0 ) {
        i_fitems_read = fscanf(o_fp, "%li", &l_number);
        if (i_fitems_read < 1) {
            printf("Unable to read item %li from file.\n", l_counter);
            exit(EXIT_FAILURE);            
        }

        if (l_number > l_last_frontier) {
            st_values_unsorted[l_counter_unsorted_array] = l_number;
            l_counter_unsorted_array++;
        } else {
            st_values[l_number] = l_number;
        }
                       
        l_counter++;
        if (l_counter > l_max_values -1) break;
    }

You can then sort the 25% with another algorithm and merge the csorted array (it goes first) and the 25% (bigger values, go later), now sorted Array, and you’ll have benefit from a 75% super-fast sorting.

3) Hybrid sort big-O

For my tests, as long as you have memory, you can csort up to n * 10, and have significant time saving respect Quicksort. So if you have a really huge universe you can implement a hybrid algorithm that sorts up to m =  n * 10 (l_last_frontier = n * 10) and use another algorithm to sort the rest >m. Finally just add the sorted second block to the sorted from first block results.

4) Hybrid sorting, with undetermined n and m

If you don’t know how long will be your space, like in a mapreduce, or the max value, you can also use the hybrid approach and define an array big enough, like 10M. If your arrays are based on long int, so using 4 bytes per position, it will take only 38 Mbytes of RAM. Peanuts. (Use the l_last_frontier seen in the point before just in case you have bigger values)

When you’re loading the data, and csorting using csort opt read, keep a variable with the max and min values found, so you’ll not have to parse all the 10M, just from min-max in order to generate a final sorted, deduplicated, list or array.

It’s up to you to fine tune and adjust the size for the minimum memory consumption and maximum efficiency according to the nature of your data.

5) Using less memory

If you have stored the data to sort in a Collection, you can reduce an item one by one, so you will save memory for huge amounts of Data. So, you can add that item to csort resulting array, and eliminate from the Collection/LinkedList source. One by one. That way you’ll save memory. With the sample of 50M registers with 1M different random values, it passes from 400MB* of RAM to 4MB of RAM in C. (* I’m calculating here a Linked List structure of 2 Long Data-type, one for the value and another for next item, in C. In other languages gain can be even more spectacular)

csort and Big Data

With the progression of Big Data we are finding ourselves with seas of data. Those universe of data tend to have a much bigger number of registers (n) than the range of values (m).

For example, in the case of devices monitoring temperatures every second, we will have 3,600 rows per hour, 86,400 rows per day, 31,536,000 per year while the range of values read will be a very stable subset, like from 19.0 to 24.0 C. In this example, multiplying the float value per 10 to work always with Integers, we would be working with a subset of 190-240, so m would be 51 if optimized, or 0-240 if not. If we want to get the unique temperatures for the past year, csort will generate just an Array of 51 items, crushing any ordering algorithm in space and time.

The bigger the Data is, the faster csort will perform compared to other sorting algorithms.

The complete code in C

blog-carlesmateo-com-csort-instructions

This is the complete source code. It supports all the cases and you can invoke it from command line with a parameter for the case number you want to test.

/* 
 * File:   main.c
 * Author: Carles Mateo
 *
 * Created on 2015-05-14 16:45
 */

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <string.h>

/*
 * Change does values for your tests, and recompile it
 */
#define l_MAX_ITEMS 50000000
#define l_MAX_VALUE 1000000


static int const i_ALGORITHM_QUICKSORT        = 1;
static int const i_ALGORITHM_QUICKSORT_UNIQUE = 2;
static int const i_ALGORITHM_QUICKSORT2       = 3;
static int const i_ALGORITHM_CSORT_BASIC      = 7;
static int const i_ALGORITHM_CSORT_OPT1       = 8;
static int const i_ALGORITHM_CSORT_COMP       = 9;


#define s_HELP  "CSort by Carles Mateo\n"\
                "=====================\n"\
                "\n"\
                "This demonstrates a sort algorithm faster than quicksort,\n"\
                "as proposed in the article.\n"\
                "Please read the complete article at:\n"\
                "http://blog.carlesmateo.com\n"\
                "\n"\
                "Please invoke the program with one of these parameters:\n"\
                "\n"\
                "--help or without parameters for this help message\n"\
                "1 To test using quicksort\n"\
                "  Only sorts, does not deletes duplicates\n"\
                "2 To test using quicksort and removing duplicates\n"\
                "3 To test using another implementation of quicksort\n"\
                "  Only sorts, does not deletes duplicates\n"\
                "7 To test using csort (basic example, sorts and deletes duplicates)\n"\
                "8 To test using csort, improvement with indexes assigned when reading\n"\
                "  Use to watch the time saving from basic csort\n"\
                "9 To test using csort, with indexes assigned when reading and counting\n"\
                "  the number of coincidences (compressed output array)\n"\
                "\n"\
                "Please, make sure that the file carles_sort.txt is present in the\n"\
                "current directory.\n"\
                "\n"


// Allocate in the BSS segment, which is a part of the heap
static long st_values[l_MAX_ITEMS];
// Only max value is required
static long st_values_sorted[l_MAX_VALUE];
static long st_values_sorted_renum[l_MAX_VALUE];

int init_values(long l_max_values);
void quicksort(long x[l_MAX_ITEMS], long first, long last);
void quicksort2(long arr[], long beg, long end);
void swap(long *a, long *b);
void printwdt(char s_string[]);

int init_values(long l_max_values) {
    long l_counter = 0;
    for (l_counter = 0; l_counter < l_max_values; l_counter++) {
        // Init to -1
        st_values[l_counter] = -1;
    }
}

int init_values_sorted(long l_max_values, long l_num) {
    long l_counter = 0;
    for (l_counter = 0; l_counter < l_max_values; l_counter++) {
        // Init to -1
        st_values_sorted[l_counter] = l_num;
    }
}

// Quicksort modified from original sample from http://www.cquestions.com/2008/01/c-program-for-quick-sort.html
 void quicksort(long x[l_MAX_ITEMS], long first, long last){
    long pivot,j,temp,i;

     if (first<last) {
         pivot=first;
         i=first;
         j=last;

         while(i<j){
             while(x[i]<=x[pivot]&&i<last)
                 i++;
             while(x[j]>x[pivot])
                 j--;
             if(i<j){
                 temp=x[i];
                  x[i]=x[j];
                  x[j]=temp;
             }
         }

         temp=x[pivot];
         x[pivot]=x[j];
         x[j]=temp;
         quicksort(x,first,j-1);
         quicksort(x,j+1,last);

    }
}

// Modified by Carles Mateo to use long based on http://alienryderflex.com/quicksort/
void quicksort2(long arr[], long beg, long end)
{
  if (end > beg + 1)
  {
    long piv = arr[beg], l = beg + 1, r = end;
    while (l < r)
    {
      if (arr[l] <= piv)
        l++;
      else
        swap(&arr[l], &arr[--r]);
    }
    swap(&arr[--l], &arr[beg]);
    quicksort2(arr, beg, l);
    quicksort2(arr, r, end);
  }
} 

void swap(long *a, long *b)
{
  long t=*a; *a=*b; *b=t;
}

 /* print with date and time
  */
 void printwdt(char s_string[]) {
    time_t timer;
    char buffer[26];
    struct tm* tm_info;

    time(&timer);
    tm_info = localtime(&timer);

    strftime(buffer, 26, "%Y-%m-%d %H:%M:%S", tm_info);
    
    printf("%s %s\n", buffer, s_string);
  
 }
 
int main(int argc, char** argv) {
    
    char s_filename[] = "carles_sort.txt";
    long l_max_values = l_MAX_ITEMS;
    long l_max_values_sorted = l_MAX_VALUE;
    char s_ch;
    FILE *o_fp;
    long l_counter=0;
    long l_counter_hit=0;
    long l_number;
    int b_exit = 0;
    int i_fitems_read = 0;
    int i_algorithm = 0;
    // Small counter for printing some final results
    int i_counter=0;
    // temp value for the final print
    long l_value=0;
    // Used for counting the number of results on optimized array
    long l_total_items_sorted = 0;
    long l_max_range;
    long l_items_found = 0;
    
    if (argc < 2) {
        printf(s_HELP);
        exit(EXIT_SUCCESS);
    }
        
    printwdt("Starting program...");
    if (strcmp(argv[1], "1") == 0) {
        i_algorithm = i_ALGORITHM_QUICKSORT;
        printwdt("Algorithm: Quicksort with duplicates");
    } else if (strcmp(argv[1], "2") == 0) {
        i_algorithm = i_ALGORITHM_QUICKSORT_UNIQUE;
        printwdt("Algorithm: Quicksort removing duplicates");
    } else if (strcmp(argv[1], "3") == 0) {
        i_algorithm = i_ALGORITHM_QUICKSORT2;
        printwdt("Algorithm: Quicksort alternative");
    } else if (strcmp(argv[1], "7") == 0) {
        i_algorithm = i_ALGORITHM_CSORT_BASIC;
        printwdt("Algorithm: csort basic");
    } else if (strcmp(argv[1], "8") == 0) {
        i_algorithm = i_ALGORITHM_CSORT_OPT1;
        printwdt("Algorithm: csort optimized read");
    } else if (strcmp(argv[1], "9") == 0) {
        i_algorithm = i_ALGORITHM_CSORT_COMP;
        printwdt("Algorithm: csort compress");
    }
    
    if (i_algorithm == 0) {
        printf(s_HELP);
        exit(EXIT_SUCCESS);        
    }
    
    if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT2) {
        // Use this if you want to make tests with partial data to find the end of the values
        //printf("Initializing %li values to -1...\n", l_max_values);
        //init_values(l_max_values);        
    }
    
    if (i_algorithm == i_ALGORITHM_CSORT_BASIC || i_algorithm == i_ALGORITHM_CSORT_OPT1 || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {
        // We init to -1 as it could happen that some indexes have not a value,
        // and we want to know what's the case.
        printf("Initializing %li values to -1...\n", l_max_values_sorted + 1);
        init_values_sorted(l_max_values_sorted + 1, -1);        
    } 
    
    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
        printf("Initializing %li values_sorted to 0...\n", l_max_values_sorted + 1);
        init_values_sorted(l_max_values_sorted + 1, 0);        
    }

    
    o_fp = fopen(s_filename,"r"); // read mode
    
    if( o_fp == NULL ) {
        printf("File %s not found or I/O error.\n", s_filename);  
        perror("Error while opening the file.\n");
        exit(EXIT_FAILURE);
    }
    
    printf("Reading file %s :\n", s_filename);
    while( b_exit == 0 ) {
        i_fitems_read = fscanf(o_fp, "%li", &l_number);
        if (i_fitems_read < 1) {
            printf("Unable to read item %li from file.\n", l_counter);
            perror("File ended before reading all the items.\n");
            exit(EXIT_FAILURE);            
        }
        if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE ||
            i_algorithm == i_ALGORITHM_QUICKSORT2 ||
            i_algorithm == i_ALGORITHM_CSORT_BASIC || i_algorithm == i_ALGORITHM_CSORT_COMP) {
            st_values[l_counter] = l_number;
        }
        if (i_algorithm == i_ALGORITHM_CSORT_OPT1) {
            st_values[l_number] = l_number;
            l_total_items_sorted++;
        }
        
        if (l_counter < 10) {
            // Echo the reads from the first 10 items, for debugging
            printf("Item %li: read %li\n", l_counter, l_number);            
        }

        l_counter++;
        if (l_counter > l_max_values -1) break;
    }
    
    printf("Total read values: %li\n", l_counter);
    printf("Sample values read from file: First item: %li Last item: %li\n", st_values[0], st_values[l_max_values -1]);
 
    fclose(o_fp);        
    
    printwdt("Sorting...");
    
    struct timeval time;
    gettimeofday(&time, NULL); //This actually returns a struct that has microsecond precision.
    long l_microsec_ini = ((unsigned long long)time.tv_sec * 1000000) + time.tv_usec;

    if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {
        quicksort(st_values, 0, l_MAX_ITEMS-1);
        l_counter_hit = l_MAX_ITEMS;
    }
    
    if (i_algorithm == i_ALGORITHM_QUICKSORT2) {
        quicksort2(st_values, 0, l_MAX_ITEMS-1);
        l_counter_hit = l_MAX_ITEMS;
    }
    
    if (i_algorithm == i_ALGORITHM_QUICKSORT_UNIQUE) {        
        printwdt("Removing duplicates...");
        l_counter_hit = 0;
        long l_last_known_value = -1;
        for (l_counter=0; l_counter < l_max_values; l_counter++) {
            if (l_last_known_value != st_values[l_counter]) {
                l_last_known_value = st_values[l_counter];
                st_values_sorted_renum[l_counter_hit] = st_values[l_counter];
                l_counter_hit++;
            }
        }            
    }
    
    if (i_algorithm == i_ALGORITHM_CSORT_BASIC) {
        // Shrink and set the index
        for (l_counter=0; l_counter < l_max_values; l_counter++) {
            st_values_sorted[st_values[l_counter]] = st_values[l_counter];
        }

        // Resort for forearch
        l_max_range = l_max_values_sorted + 1;
        for (l_counter=0; l_counter < l_max_range; l_counter++) {
            if (st_values_sorted[l_counter] > -1 ) {
                st_values_sorted_renum[l_counter_hit] = st_values_sorted[l_counter];
                l_counter_hit++;
            }
        }        
    }
    
    if (i_algorithm == i_ALGORITHM_CSORT_OPT1) {
        l_max_range = l_max_values_sorted + 1;
        for (l_counter=0; l_counter < l_max_range; l_counter++) {
            if (st_values[l_counter] > -1 ) {
                st_values_sorted_renum[l_counter_hit] = st_values[l_counter];
                l_counter_hit++;
            }
        }        
    }
    
    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
        for (l_counter=0; l_counter < l_max_values; l_counter++) {
            l_value = st_values[l_counter];
            st_values_sorted[l_value] = st_values_sorted[l_value] + 1;
        }
        l_counter_hit = l_max_values_sorted + 1;
    }   
    
    
    struct timeval time_end;
    gettimeofday(&time_end, NULL); //This actually returns a struct that has microsecond precision.
    long l_microsec_end = ((unsigned long long)time_end.tv_sec * 1000000) + time_end.tv_usec;
    
    long l_microsec_total = l_microsec_end - l_microsec_ini;
    long l_milliseconds_total = l_microsec_total / 1000;
    double f_seconds = (double)l_milliseconds_total / 1000;
    printf("Time used: %li microseconds = %li milliseconds = %f seconds \n", l_microsec_total, l_milliseconds_total, f_seconds);
    
    printf("Total items in the final array %li\n", l_counter_hit);
    printf("Sample items in the final array:\n");

    if (i_algorithm == i_ALGORITHM_CSORT_COMP) {
        l_max_range = l_max_values_sorted + 1;        
        l_counter_hit = 0;

        for (l_counter=0; l_counter < l_max_range; l_counter++) { 
            if (st_values_sorted[l_counter] > 0) {
                l_counter_hit++;
                l_items_found = l_items_found + st_values_sorted[l_counter];                
            }            
        }
        printf("A final array of %li significant items, contains sorted lossless %li\n", l_counter_hit, l_items_found);
        
    } else {
        for (i_counter=0; i_counter<10000; i_counter=i_counter+1000) {
            /* if (st_values_sorted[i_counter] > -1) {
                // If you enjoy looking at values being printed or want to debug :)
                printf("item at position %i : %li\n", i_counter, st_values_sorted[i_counter]);
            }*/
            if (i_algorithm == i_ALGORITHM_QUICKSORT || i_algorithm == i_ALGORITHM_QUICKSORT2) {
                l_value = st_values[i_counter];
            } else {
                l_value = st_values_sorted_renum[i_counter];
            }
            printf("item at position %i : %li\n", i_counter, l_value);

        }
        
    }
    
    return (EXIT_SUCCESS);
}

My solution in PHP

As told before I have implemented my algorithm in PHP, and it outperforms by much the PHP native functions (PHP sort() is a quicksort algorithm written in C, and unique() ). I explain everything below for the PHP lovers, is the same than in C, but with special considerations for PHP language. Even in PHP the efficiency of the algorithm is clearly demonstrated.

Warning, do not attempt to run the PHP code as is in your machine unless you have at least 16 GB of RAM. If you have 8 GB RAM you should reduce the scope to avoid your computer to hang up or swap very badly.

 

 

Printing of datetime is done by those functions:

function get_datetime() {
    return date('Y-m-d H:i:s');
}

function print_with_datetime($s_string) {
    echo get_datetime().' '.$s_string."\n";
}

So once we have our numbers generated we do a sort($st_test_values);

And we calculate the time elapsed with microtime for profiling:

$f_microtime_ini = microtime(true);

print_algorithm_type('PHP sort() method');
sort($st_test_array);

$f_microtime_end = microtime(true);

$f_microtime_resulting = $f_microtime_end - $f_microtime_ini;

echo "Microtime elapsed: $f_microtime_resulting\n";

blog-carlesmateo-com-php-sort-quicksortIn this case we did a PHP sort(), and we didn’t even remove the duplicates and it takes 88.5 seconds.

 

If we do a sort() and array_unique() to get the unique results sorted it takes: 741.27 seconds

blog-carlesmateo-com-php-sort-array_unique

We do the opposite array_unique() and sort() and the result is: 402.45 seconds

blog-carlesmateo-com-php-array_unique-sort-slowIt takes a lot.

Please note that sorting and doing array_unique takes more time than doing in the opposite order.

This is curious as one may thing that removing duplicates and sorting could be much faster than sorting and removing duplicates, but just sorting takes 88 seconds. I found the explanation looking at the source Source Code of PHP array_unique. Internally it sorts the array (zend_qsort line 2813) for removing the duplicates.

blog-carlesmateo-com-php-array_c-zend_qsortSo well, we were sorting twice, and sorting a sorted array. This is the worst case for quicksort, so that explains the slowlyness.

Then it enters what I engined. PHP works with arrays, that can have a numeric index. So my point is: the fastest way to sort an array is… not sorting it at all.

We know that the range of values we have is much smaller than the number of registers:

  • We have values between 0 and 1,000,000
  • We have 50,000,000 of registers

PHP sort() function is written in C, so it should be fastest than any thing we program in PHP. Should be…. really?.

Knowing that the values are 0-1M and the registers are 50M, and knowing the number of operations that a quick sort will do, I figured another way to improve this.

Look at this code:

function csort($st_test_array) {

    $s_algorithm = 'Carles_Sort';
    echo "Testing $s_algorithm\n";

    $i_loop_counter = 0;
    $st_result_array = Array();

    $i_ending_pos = count($st_test_array);
    for($i_pos = 0; $i_pos < $i_ending_pos; $i_pos++) {
        $i_loop_counter++;
        $i_value = $st_test_array[$i_pos];
        $st_result_array[$i_value] = $i_value;
    }

    return $st_result_array;

}

So, the only think I’m doing is looping the 50M array and in a new array $st_result_array setting the key to the same value, than the integer value. So if the original $st_test_array[0] is equal to 574145, then I set st_result_array[574145] = 574145. Easy.

Think about it.

With this we achieve three objectives:

  1. You eliminate duplicates. With 50M items, with values ranged from 0 to 1M you’ll set a value many times. No problem, it will use it’s index.
  2. We have the key assigned, matching the value, so you can reference every single value
  3. It’s amazingly faster. Much more than any sorting even in C

If you do a var_export or var_dump or print_r or a foreach you will notice that the first item of the array has key 574145 and value 574145, the second 262240, the third 153341, the fourth 656736 and the fifth 879329. They are not ordered in the way that one will think about an ordered array parsed with foreach(), but the thing is: do we need to sort it?. Not really. Every key is a hash that PHP will access very quickly. So we can do something like this to get the results ordered:

    $i_max_value = max($st_result_array);

    for($i_pos = 0; $i_pos <= $i_max_value; $i_pos++) {
        if (isset($st_result_array[$i_pos])) {
            $st_result_array2[] = $i_pos;
        }
    }

In this code we create a new array, with the values sorted in the sense that a foreach will return them sorted. But instead of recurring 50M registers, we only go to the max value (as values and indexes are the same, the max value also shows the max index), because we now don’t have 50M registers as we eliminated the duplicates, but 1M. And in an universe of 50M values from a range 0-1M, many are duplicates.

Even doing this second step still makes this sorting algorithm written in PHP, much more efficient than the quicksort written in C.

I ported the code to C, with the same data file, and it takes 0.674 seconds (674 milliseconds) and it uses only ~1 GB of RAM.

blog-carlesmateo-com-csort-0_674-secondsTake in consideration that if the data is directly loaded assigning the right index, so doing like in the next code, it will be blazing fast:

    o_fp = fopen(s_filename,"r"); // read mode
    
    if( o_fp == NULL ) {
        printf("File %s not found or I/O error.\n", s_filename);  
        perror("Error while opening the file.\n");
        exit(EXIT_FAILURE);
    }
    
    printf("Reading file %s :\n", s_filename);
    while( 1 ) {
        fscanf(o_fp, "%li", &l_number);
        // The key is here ;)
        st_values[l_number] = l_number;
        l_counter++;
        if (l_counter > l_max_values -1) break;
    }
    
    printf("First item: %li Last item: %li\n", st_values[0], st_values[l_counter-1]);
 
    fclose(o_fp);

So if we only resort in another array to have the values sorted eliminating the non existing values (half the tasks we do in the previous program), it takes and order of milliseconds.

blog-carlesmateo-com-csort-second-part-fast-0_1seconds

Using csort opt read with a busy computer (firefox using 16 GB RAM and many tabs) it takes 108 ms, with a idle computer 3 ms

Now imagine you don’t want to eliminate the duplicates.

You can create an memory optimized array where the index is the number, and the value represents the number of apparitions in the original file:

$st_example = Array( 0 => 1,
                     3 => 5,
                     4 => 1);

That would mean that the original sorted array, in form of list would be:

{0, 3, 3, 3, 3, 3, 4}

For the case of C, where the arrays have static size, a value of 0 would indicate there is no presence of that number in the orderer list so:

st_example[0] = 1;
st_example[1] = 0;
st_example[2] = 0;
st_example[3] = 5;
st_example[4] = 1;

Even in the case of the resulting C array, it will be much smaller than the original 50M Array, as will have only the size corresponding to the max value. In our sample 1M.

In PHP you can get an Array alike the csort compressed by using a function called array_count_values(). It will not have the non existing items = 0, so it would not generate st_example[1] = 0 neither st_example[2] = 0, but it would generate the others.

So conditions for the algorithm to work blazing fast:

  • The max number (max(m)) must be ideally lower than the number of registers (this is true for languages like C or Java. Languages like PHP grow dynamically the Array size and work as Array Hash, so it’s not necessary as we can have st_sorted[0] and st_sorted[1000000] and the array size will be only 2 items. In this case we can use less memory than the csort algorithm and be faster using foreach() instead as for() from 0 to 1M)
  • The most numbers that are repeated, the faster that will work the algorithm

Key conclusions and Advantages:

  • If you directly load the values form disk assigning the value as a key as well, you’ll save one loop per item (n=50M in this universe) and make it faster (having only to parse count(m), in PHP it doesn’t have to be 0-m).
  • The fastest way to sort it, is no sorting at all. You can have the values stored per key, even if a foreach doesn’t start by 0,1,2… simply accessing the value index you need will be super-fast.

Source code in PHP

<?php
/**
 * User: Carles Mateo
 * Date: 2015-04-05 14:19
 */

error_reporting(E_ALL);
// To avoid raising warnings from HHVM
date_default_timezone_set('Europe/Andorra');

// This freezes the server, using more than 35 GB RAM
//define('i_NUM_VALUES', 500000000);
// 50M uses ~ 10 GB RAM
define('i_NUM_VALUES', 50000000);
define('i_MAX_NUMBER', 1000000);

define('i_EXIT_SUCCESS', 0);
define('i_EXIT_ERROR', 1);

abstract class CsortDemo {

    const i_ALGORITHM_QUICKSORT        = 1;
    const i_ALGORITHM_PHP_QSORT        = 4;
    const i_ALGORITHM_PHP_ARRAY_FLIP   = 5;
    const i_ALGORITHM_PHP_ARRAY_UNIQUE = 6;
    const i_ALGORITHM_CSORT_BASIC      = 7;
    const i_ALGORITHM_CSORT_OPT1       = 8;


    public static function displayHelp() {
        $s_help = <<<EOT
CSort.php by Carles Mateo
=========================

This demonstrates a sort algorithm faster than quicksort,
as proposed in the article.
Please read the complete article at:
http://blog.carlesmateo.com

Please invoke the program with one of these parameters:

--help or without parameters for this help message
1 To test using a quicksort algorithm written in PHP
4 To test using PHP sort() (is a quicksort algorithm)
5 To test using PHP array_flip()
6 To test using PHP unique() (is a quicksort algorithm)
7 To test using csort (basic example, sorts and deletes duplicates)
8 To test using csort, improvement with indexes assigned when reading
  Use to watch the time saving from basic csort


9 To test using csort, with indexes assigned when reading and counting
  the number of coincidences (compressed output array)

Please, make sure that the file carles_sort.txt is present in the
current directory.

EOT;

        echo $s_help;

    }

    public static function get_random_values($i_algorithm) {

        $st_values = Array();
        $i_value = 0;

        $s_values_filename = 'carles_sort.txt';

        if (file_exists($s_values_filename)) {
            self::print_with_datetime("Reading values...");
            if ($i_algorithm == CsortDemo::i_ALGORITHM_CSORT_OPT1) {
                $o_fp = fopen($s_values_filename, 'r');
                if ($o_fp) {
                    while($s_line = fgets($o_fp) !== false) {
                        $i_value = intval($s_line);
                        $st_values[$i_value] = $i_value;
                    }
                }
            } else {
                $st_values = file($s_values_filename);
                self::print_with_datetime("Ensuring numeric values...");
                foreach($st_values as $i_key=>$s_value) {
                    $i_value = intval($s_value);
                    $st_values[$i_key] = $i_value;
                }

            }
        } else {
            self::print_with_datetime("Generating values...");
            $st_values = self::generate_random_values();
            self::print_with_datetime("Writing values to disk...");
            self::save_values_to_disk($s_values_filename, $st_values);
            self::print_with_datetime("Finished writing values to disk");
        }

        return $st_values;
    }

    public static function save_values_to_disk($s_values_filename, $st_values) {
        $i_counter = 0;
        $f_advance_percent = 0;
        $i_max_values = 0;

        $i_max_values = count($st_values);

        $o_file = fopen($s_values_filename, 'w');
        foreach ($st_values as $i_key => $s_value) {
            $i_counter++;
            fwrite($o_file, $s_value.PHP_EOL);
            if (($i_counter % 100000) == 0) {
                // every 100K numbers we show a status message
                $f_advance_percent = ($i_counter / $i_max_values)*100;
                self::print_with_datetime("Item num: $i_counter ($f_advance_percent%) value: $s_value written to disk");
            }
        }
        fclose($o_file);
    }

    public static function generate_random_values() {

        $i_num_values = i_NUM_VALUES;
        $i_max_number = i_MAX_NUMBER;

        $st_values = array();

        for($i_counter = 0; $i_counter<$i_num_values; $i_counter++) {
            $i_random = rand(0, $i_max_number);
            $st_values[] = $i_random;
        }

        return $st_values;

    }

    public static function get_datetime() {
        return date('Y-m-d H:i:s');
    }

    public static function print_with_datetime($s_string) {
        echo self::get_datetime().' '.$s_string."\n";
    }

    public static function print_algorithm_type($s_algorithm) {
        echo "Testing $s_algorithm\n";
    }

    public static function print_results($st_values, $i_initial_pos_to_display, $i_number_of_results_to_display) {

        self::print_with_datetime("Showing results from $i_initial_pos_to_display, $i_number_of_results_to_display items :");

        $i_counter = 0;
        $i_counter_found = 0;
        for($i_counter=0; ($i_counter_found<$i_number_of_results_to_display && (($i_counter + $i_initial_pos_to_display) < i_MAX_NUMBER)); $i_counter++) {
            $i_pos = $i_initial_pos_to_display + $i_counter;
            if (isset($st_values[$i_pos])) {
                echo '['.$i_pos.'] '.$st_values[$i_pos]."\n";
                $i_counter_found++;
            }
        }

    }

    // Found on the web
    public static function quicksort( $array ) {
        if( count( $array ) < 2 ) {
            return $array;
        }
        $left = $right = array( );
        reset( $array );
        $pivot_key  = key( $array );
        $pivot  = array_shift( $array );
        foreach( $array as $k => $v ) {
            if( $v < $pivot )
                $left[$k] = $v;
            else
                $right[$k] = $v;
        }
        return array_merge(self::quicksort($left), array($pivot_key => $pivot), self::quicksort($right));
    }

    public static function csort_basic($st_test_array) {
        // Note that even if we do not explicit to PHP to pass array by reference it does it,
        // and we do not use more memory unless we try to write to original array


        $s_algorithm = 'Carles_sort_basic';
        echo "Testing $s_algorithm\n";

        $st_result_array = Array();
        $st_result_array2 = Array();

        $i_min_value_pos = null;
        $i_max_value     = null;
        $i_max_value_pos = null;

        $i_temp_value = null;
        $i_loop_counter = 0;

        $i_num_items = count($st_test_array);

        $i_initial_pos = 0;
        $i_ending_pos  = $i_num_items;

        for($i_pos = $i_initial_pos; $i_pos < $i_ending_pos; $i_pos++) {
            $i_loop_counter++;

            $i_value = $st_test_array[$i_pos];
            $st_result_array[$i_value] = $i_value;

            // If you want to debug a bit
            //var_export($st_result_array); sleep(3);

            // If you want to debug...
            //if ($i_loop_counter % 1000 == 0) {
                //echo "Loop: $i_loop_counter i_pos: $i_pos Value: $i_value\n";
                //print_results($st_test_array, 0, 10);
                //print_results($st_test_array, $i_num_items - 2, 2);
            //}

        }

        // Return in the way of a foreach sorted Array
        $i_max_value = max($st_result_array);

        for($i_pos = 0; $i_pos <= $i_max_value; $i_pos++) {
            if (isset($st_result_array[$i_pos])) {
                $st_result_array2[] = $i_pos;
            }
        }

        return $st_result_array2;

    }

}

    // Checking parameters
    if (count($argv) < 2) {
        // Display help and exit
        CsortDemo::displayHelp();
        exit(i_EXIT_SUCCESS);
    }

    $i_algorithm = intval($argv[1]);

    CsortDemo::print_with_datetime('Starting program...');

    $st_test_array = CsortDemo::get_random_values($i_algorithm);

    $i_num_items = count($st_test_array);
    CsortDemo::print_with_datetime("Starting sorting with $i_num_items items, i_MAX_NUMBER was ".i_MAX_NUMBER);

    $i_loop_counter = 0;

    $i_initial_pos = 0;
    $i_ending_pos = $i_num_items - 1;

    $i_min_value     = i_MAX_NUMBER * 2; // A high number for the tests
    $i_min_value_pos = null;
    $i_max_value     = null;
    $i_max_value_pos = null;

    $i_temp_value = null;

    $b_changes_done = false;
    $b_exit = false;


    // TODO Test array_count_values
    $f_microtime_ini = microtime(true);

    if ($i_algorithm == CsortDemo::i_ALGORITHM_QUICKSORT) {
        $s_algorithm = 'Quicksort written in PHP';
        CsortDemo::print_algorithm_type($s_algorithm);
        CsortDemo::quicksort($st_test_array);
    }

    if ($i_algorithm == CsortDemo::i_ALGORITHM_CSORT_BASIC) {
        $st_test_array = CsortDemo::csort_basic($st_test_array);
    }

    if ($i_algorithm == CsortDemo::i_ALGORITHM_PHP_QSORT) {
        CsortDemo::print_algorithm_type('PHP sort() method');
        sort($st_test_array);
    }

    if ($i_algorithm == CsortDemo::i_ALGORITHM_PHP_ARRAY_FLIP) {
        CsortDemo::print_algorithm_type('PHP array_flip() method');
        $st_test_array = array_flip($st_test_array);
    }

    if ($i_algorithm == CsortDemo::i_ALGORITHM_PHP_ARRAY_UNIQUE) {
        CsortDemo::print_algorithm_type('PHP array_unique() method');
        $st_test_array = array_unique($st_test_array);
    }

    $f_microtime_end = microtime(true);

    $f_microtime_resulting = $f_microtime_end - $f_microtime_ini;

    echo "Microtime elapsed: $f_microtime_resulting\n";

    CsortDemo::print_with_datetime('Results:');
    //print_r($st_test_array);
    $i_total_items = count($st_test_array);
    echo 'Total items:'.$i_total_items."\n";
    CsortDemo::print_results($st_test_array, 0, 10);

    $i_last_number  = array_pop($st_test_array);
    $i_last_number1 = array_pop($st_test_array);
    CSortDemo::print_with_datetime('Last values:');
    echo $i_last_number1."\n";
    echo $i_last_number."\n";

    CsortDemo::print_with_datetime('Finished program');

PHP under HHVM

Additionally I executed the csort.php in the Facebook’s Hip Hop Virtual Machine, version 3.4.0-dev, and as expected everything is much faster than with PHP standard.

For example:

  • Using PHP sort() 88-104 seconds, using HHVM PHP sort() 15.8 seconds
  • Using csort basic from PHP 34 to 40 seconds, using csort basic from HHVM 3.4 to 5.13 seconds

The version of PHP used for these tests was:

PHP 5.5.9-1ubuntu4.9 (cli) (built: Apr 17 2015 11:44:57)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies

The version of hhvm used for the test was:

HipHop VM 3.4.0-dev (rel)
Compiler: heads/master-0-g96dec459bb84f606df2fe15c8c82df684b20ed83
Repo schema: 0100f1aaeeacd7b9d9a369b34ae5160acb0d1163
Extension API: 20140829

Must also mention that some operations like initializing the arrays, are super-fast with HHVM.

blog-carlesmateo-com-hhvm-executing-csort-php-5s

The solution in Java

The code has been executed with Oracle’s JDK 1.7.

I love Just In Time compilers. They are amazingly fast.

For this case, the results with Java, for the universe of 50M items, range 0-1M of values:

  • quicksort in Java: 3555 ms  =3.555 seconds versus ~10 seconds in C
    Important: In C I had to disable -O2 optimization for what it looks like a bug in gcc (only detected in csort with compression). With -O2 times were around 4.7 seconds
  • csort.java basic: 26 ms to 120 ms = 0.026 seconds versus 0.5 seconds csort in C
  • csort.java opt read: 2-3 ms = 0.002-0.003 seconds versus 3 ms = 0.003 seconds in C
  • csort.java with compression and duplicates support: 160 ms = 0.16 seconds versus 0.272-0.277 segons in C
  • The execution on those languages (Java and C) is so fast that it’s hard to measure accurately as any random execution of Ubuntu in background will increase a bit the time.

blog-carlesmateo-com-java-quicksort-3555ms

Quicksort in Java, 3555 ms = 3.555 seconds

blog-carlesmateo-com-java-csort-basic-26ms

Csort basic with Java, 26 milliseconds

blog-carlesmateo-com-java-csort-read-opt-5ms

Csort read opt with Java, 5 milliseconds

The code in Java:

/*
 * This code, except quicksort implementation, is written by Carles Mateo, 
 * and published as Open Source.
 */
package csort;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;

/**
 *
 * @author carles
 */
public class Csort {

    public static final int i_MAX_REGISTERS = 50000000;
    public static final int i_MAX_VALUE     = 1000000;
    public static final String s_FILE = "/home/carles/Desktop/codi/php/tests/carles_sort.txt";
    
    public static final int i_ERROR_SUCCESS = 0;
    public static final int i_ERROR_FAILURE = 1;
    
    public static final int i_ALGORITHM_QUICKSORT        = 1;
    public static final int i_ALGORITHM_CSORT_BASIC      = 7;
    public static final int i_ALGORITHM_CSORT_OPT1       = 8;
    public static final int i_ALGORITHM_CSORT_COMP       = 9;

    // Original values read
    public int[] st_values; 
    // Our Csorted array
    public int[] st_values_sorted;
    // To resort as a alike list array
    public int[] st_values_sorted_renum;
    
    // For quicksort
    private int[] numbers;
    private int number;    
    
    public void systemExit(String s_reason, int i_error_code) {
        printWithDatetime("Exiting to system, cause: " + s_reason);
        System.exit(i_error_code);
    }
    
    public static String getDateTime() {
        DateFormat o_dateformat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss");
        Date o_date = new Date();
        
        return o_dateformat.format(o_date);
    }
    
    public static void printWithDatetime(String s_message) {
        System.out.println(getDateTime() + " " + s_message);
    }
    
    public void printAlgorith(String s_algorithm) {
        Csort.printWithDatetime("Using algorithm: " + s_algorithm);
    }
    
    public static void showHelp() {
        
        String s_help = "" + 
            "CSort.java by Carles Mateo\n" +
            "==========================\n" +
            "\n" +
            "This demonstrates a sort algorithm faster than quicksort,\n" +
            "as proposed in the article.\n" +
            "Please read the complete article at:\n" +
            "http://blog.carlesmateo.com\n" +
            "\n" +
            "Please invoke the program with one of these parameters:\n" +
            "\n" +
            "--help or without parameters for this help message\n" +
            "1 To test using quicksort\n" +
            "  Only sorts, does not deletes duplicates\n" +
            "7 To test using csort (basic example, sorts and deletes duplicates)\n" +
            "8 To test using csort, improvement with indexes assigned when reading\n" +
            "  Use to watch the time saving from basic csort\n" +
            "\n" +
            "Please, make sure that the file carles_sort.txt is present in the\n" +
            "current directory.\n" +
            "\n";
        
        System.out.println(s_help);
        
    }
    
    public int readFile(int i_algorithm) {
        BufferedReader br = null;

        int i_value = 0;
        int i_counter = 0;        
        
        try {

                String sCurrentLine;

                br = new BufferedReader(new FileReader(Csort.s_FILE));

                while ((sCurrentLine = br.readLine()) != null) {
                    i_value = Integer.parseInt(sCurrentLine);
                    if (i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
                        this.st_values[i_value] = i_value;
                    } else {
                        this.st_values[i_counter] = i_value;
                    }
                    
                    if (i_counter % 500000 == 0) {
                        float f_percent = ((float) i_counter / Csort.i_MAX_REGISTERS) * 100;
                        String s_percent_msg = "Read register " + i_counter + " value " + i_value + " " + f_percent + "% read";
                        System.out.println(s_percent_msg);
                    }
                    i_counter++;
                }

        } catch (IOException e) {
                e.printStackTrace();
        } finally {
                try {
                    if (br != null) br.close();
                } catch (IOException ex) {
                    ex.printStackTrace();
                    this.systemExit("Error processing the file: " + Csort.s_FILE, Csort.i_ERROR_FAILURE);
                }
        }
        
        return i_counter;
    }
    
    // http://www.vogella.com/tutorials/JavaAlgorithmsQuicksort/article.html
    public void qsort(int[] values) {
        // check for empty or null array
        if (values == null || values.length == 0){
            return;
        }
        this.numbers = values;
        number = values.length;
        quicksort(0, number - 1);
    }    
    
    private void quicksort(int low, int high) {
        int i = low, j = high;
        // Get the pivot element from the middle of the list
        int pivot = numbers[low + (high-low)/2];

        // Divide into two lists
        while (i <= j) {
          // If the current value from the left list is smaller then the pivot
          // element then get the next element from the left list
          while (numbers[i] < pivot) {
            i++;
          }
          // If the current value from the right list is larger then the pivot
          // element then get the next element from the right list
          while (numbers[j] > pivot) {
            j--;
          }

          // If we have found a values in the left list which is larger then
          // the pivot element and if we have found a value in the right list
          // which is smaller then the pivot element then we exchange the
          // values.
          // As we are done we can increase i and j
          if (i <= j) {
            exchange(i, j);
            i++;
            j--;
          }
        }
        // Recursion
        if (low < j)
          quicksort(low, j);
        if (i < high)
          quicksort(i, high);
    }

    private void exchange(int i, int j) {
        int temp = numbers[i];
        numbers[i] = numbers[j];
        numbers[j] = temp;
    }
    // End of http://www.vogella.com/tutorials/JavaAlgorithmsQuicksort/article.html
  
    public void csortBasic() {
        int i_counter = 0;
        int i_value = 0;
        int i_max_range = 0;
        int i_counter_hit = 0;
        
        try {
            for (i_counter=0; i_counter < i_MAX_REGISTERS; i_counter++) {
                i_value = this.st_values[i_counter];
                this.st_values_sorted[i_value] = i_value;
            }

            // Resort for forearch
            i_max_range = i_MAX_VALUE;
            for (i_counter=0; i_counter < i_max_range; i_counter++) {
                if (this.st_values_sorted[i_counter] > -1 ) {
                    this.st_values_sorted_renum[i_counter_hit] = this.st_values_sorted[i_counter];
                    i_counter_hit++;
                }
            }
        } catch (Exception e) {
            printWithDatetime("Exception captured at i_counter: " + i_counter);
            this.systemExit("Exception, so ending", i_ERROR_FAILURE);
        }        
        
    }
    
    public void csortOptimized() {
        int i_counter = 0;
        int i_value = 0;
        int i_max_range = 0;
        int i_counter_hit = 0;
        
        try {
            // It was sorted while reading :)

            // Resort for forearch/list-style lovers
            i_max_range = i_MAX_VALUE;
            for (i_counter=0; i_counter < i_max_range; i_counter++) {
                if (this.st_values[i_counter] > -1 ) {
                    this.st_values_sorted_renum[i_counter_hit] = this.st_values[i_counter];
                    i_counter_hit++;
                }
            }
        } catch (Exception e) {
            printWithDatetime("Exception captured at i_counter: " + i_counter);
            this.systemExit("Exception, so ending", i_ERROR_FAILURE);
        }        
        
    }    
    
    public void csortCompress() {
        int i_counter = 0;
        int i_value = 0;
        
        for (i_counter=0; i_counter < i_MAX_REGISTERS; i_counter++) {
            i_value = this.st_values[i_counter];
            this.st_values_sorted[i_value] = this.st_values_sorted[i_value] + 1;
        }

    }
    
    public void printItems(int i_num_items, int[] st_array) {
        int i_counter = 0;
        int i_value   = 0;
        
        for(i_counter=0;i_counter<i_num_items;i_counter++) {
            i_value = st_array[i_counter];
            System.out.println("Item[" + i_counter + "] value: " + i_value);
        }
    }
        
    public Csort(int i_registers, int i_max_value) {
        
        // int range is -2,147,483,648 to 2,147,483,647
        this.st_values = new int[i_registers];
        // If max value is 10 then we could have values from 0 to 10, so a 11 items Array
        this.st_values_sorted = new int[i_max_value + 1];
        this.st_values_sorted_renum = new int[i_max_value + 1];
        
    }
    
    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
                
        long l_milliseconds_ini   = 0;
        long l_milliseconds_end   = 0;
        long l_milliseconds_total = 0;
        
        int i_items_read = 0;
        
        int i_algorithm = 0;
        String s_algorithm = "";
        
        int i_counter = 0;
        
        /* Check parameter consistency */
        if (args.length == 0 || args.length > 1 || (args.length == 1 && Integer.valueOf(args[0]) == 0)) {            
            showHelp();
            System.exit(i_ERROR_SUCCESS);
        }
        
        i_algorithm = Integer.valueOf(args[0]);
        
        Csort o_csort = new Csort(Csort.i_MAX_REGISTERS, Csort.i_MAX_VALUE);
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC || i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1 || i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {
            // We init empty position to -1 to differenciate from empty items and from value 0 (that is valid)
            Csort.printWithDatetime("Initializing values to -1...");
            for(i_counter=0; i_counter < Csort.i_MAX_VALUE; i_counter++) {
                o_csort.st_values_sorted[i_counter] = -1;
                o_csort.st_values_sorted_renum[i_counter] = -1;
            }            
        } else if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
            // We init empty position to -1 to differenciate from empty items and from value 0 (that is valid)
            Csort.printWithDatetime("Initializing values to 0...");
            for(i_counter=0; i_counter < Csort.i_MAX_VALUE; i_counter++) {
                o_csort.st_values_sorted[i_counter] = 0;
            }            
        } else {
            showHelp();
            System.exit(i_ERROR_SUCCESS);            
        }        
        
        Csort.printWithDatetime("Starting csort.java...");
        
        Csort.printWithDatetime("Reading file...");
        i_items_read = o_csort.readFile(i_algorithm);
        Csort.printWithDatetime("Registers read:" + i_items_read);
        
        l_milliseconds_ini = System.currentTimeMillis();
        
        if (i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {            
            s_algorithm = "Quicksort";
            o_csort.printAlgorith(s_algorithm);
            
            o_csort.qsort(o_csort.st_values);
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC) {
            s_algorithm = "Csort basic";
            o_csort.printAlgorith(s_algorithm);
            
            o_csort.csortBasic();
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
            s_algorithm = "Csort Read optimized";
            o_csort.printAlgorith(s_algorithm);
            
            o_csort.csortOptimized();            
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
            s_algorithm = "Csort compress with duplicates";
            o_csort.printAlgorith(s_algorithm);
            
            o_csort.csortCompress();
        }        
                
        l_milliseconds_end = System.currentTimeMillis();
        
        l_milliseconds_total = l_milliseconds_end - l_milliseconds_ini;
        
        Csort.printWithDatetime("Algorithm execution time: " + l_milliseconds_total + " milliseconds");
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_BASIC || i_algorithm == Csort.i_ALGORITHM_CSORT_OPT1) {
            o_csort.printItems(10, o_csort.st_values_sorted_renum);
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_QUICKSORT) {
            o_csort.printItems(10, o_csort.numbers);
        }
        
        if (i_algorithm == Csort.i_ALGORITHM_CSORT_COMP) {
            o_csort.printItems(10, o_csort.st_values_sorted);
        }
                        
    }
    
}

 

Note: This article is growing. I’ll be expanding the article, putting more samples, detailing timmings from different universes I’ve tested and screenshots, specially in the PHP section, to add more tricks and whys, but I wanted to put the base for being able to discuss with some colleagues.