Updated on 2017-04-04 12:58 Barcelona Time 1491303515:
- A method writeValuesFromArrayListToDisk(String sFilename) has been introduced as per a request, to easily check that the data is properly sorted.
- A silly bug in the final ArrayList generation has been solved. It was storing iCounter that was always 1 as this is not the compressed version, for supporting repeated numbers, of the algorithm. I introduced this method for the article, as it is not necessary for the algorithm as it is already sorted, and unfortunately I didn’t do a final test on the output. My fault.
- Some JavaDoc has been updated
Past Friday I was discussing with my best friend about algorithms and he told me that hadoop is not fast enough, and about when I was in Amazon and as part of the test they asked me to defined an S3 system from the scratch, and I did using Java and multiple streams per file and per node (replication factor) and they told me that what I just created was the exact way their system works, and we ended talking about my sorting algorithm CSort, and he asked me if it could run in MultiThread. Yes, it is one of the advantages in front of QuickSort. Also it can run in multinode, different computers. So he asked me how much faster it would be a MultiThread version of CSort, versus a regular QuickSort.
Well here is the answer with 500 Million of registers, with values from 1 to 1000000, and deduplicating.
2017-03-26 18:50:41 CSort time in seconds:0.189129089
2017-03-26 18:51:47 QuickSort cost in seconds:61.853190885
That’s Csort is 327 times faster than QuickSort!. In this example and with my busy 4 cores laptop. In my 8 cores computer it is more than 525 times faster. Imagine in a Intel Xeon Server with a 64 cores!.
How is it possible? The answer is easy, it has O(n) complexity. I use linear access.
This depends on your universe of Data. Please read my original posting about CSort here, that explains it on detail.
Please note that CSort with compression is also available for keeping duplicated values and also saving memory (space) and time, with equally totally astonishing results.
Please note that in this sample I first load the values to an Array, and then I work from this. This is just to avoid bias by discarding the time of loading the data from disk, but, in the other article you have samples where CSort sorts at the same time that loads the data from disks. I have a much more advanced algorithm that self allocates the memory needed for handling an enormous universe of numbers (big numbers and small with no memory penalty), but I’m looking forward to discuss this with a tech giant when it hires me. ;) Yes, I’m looking for a job.
In my original article I demonstrated it in C and PHP, this time here is the code in Java. It uses the MT Notation.
You can download the file I used for the tests from here:
- 50 Million results, with values from 0 to 1M: carles_sort-50M.txt.zip [156 MB] (344 MB uncompressed)
Obviously it runs much more faster than hashing. I should note that hashing and CSorting with .containsKey() is faster than QuickSorting. (another day I will talk about sorting Strings faster) ;)
/*
* (c) Carles Mateo blog.carlesmateo.com
* Proof of concept of CSort, with multithread, versus QuickSort
* For the variables notation MT Notation is used.
*/
package com.carlesmateo.blog;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
/**
* @author carles mateo
*/
public class CSortMultiThread extends Thread {
// Download the original file from the blog
public static final int piNUM_REGISTERS_IN_FILE = 50000000;
// Max value found, to optimize the memory used for CSort
// Note that CSort can be implemented in the read from file mechanism
// in order to save memory (space)
public static int piMaxValue = 0;
// The array containing the numbers read from disk
public static int[] paiNumbers;
// The array used by CSort (if not using direct loading from disk)
public static int[] paiNumbersCsorted;
// Final ArrayList Sorted. CSort and QuickSort finally fullfil this
public static ArrayList<Integer> pliNumbers = new ArrayList<>();
// For the Threads
private Thread oT;
private String sThreadName;
private int piStart;
private int piEnd;
private boolean bFinished = false;
CSortMultiThread (String name, int iStart, int iEnd) {
sThreadName = name;
piStart = iStart;
piEnd = iEnd;
writeWithDateTime("Creating " + sThreadName );
}
public void run() {
writeWithDateTime("Running " + sThreadName + " to sort from " + piStart + " to " + piEnd);
int iCounter;
int iNumber;
for (iCounter=piStart; iCounter < piEnd; iCounter++) {
iNumber = paiNumbers[iCounter];
paiNumbersCsorted[iNumber] = 1;
}
System.out.println("Thread " + sThreadName + " exiting.");
bFinished = true;
}
public void start () {
writeWithDateTime("Starting " + sThreadName );
if (oT == null) {
oT = new Thread (this, sThreadName);
oT.start ();
}
}
/**
* Write values to Disk to demonstrate that are sorted ;)
* @param sFilenameData
*/
private static void writeValuesFromArrayListToDisk(String sFilenameData) {
ObjectOutputStream out = null;
int iCounter;
try {
out = new ObjectOutputStream(new FileOutputStream(sFilenameData));
for (iCounter=0; iCounter<pliNumbers.size(); iCounter++) {
out.writeChars(pliNumbers.get(iCounter).toString() + "\n");
}
// To store the object instead
//out.writeObject(pliNumbers);
} catch (IOException e) {
System.out.println("I/O Error!");
e.printStackTrace();
displayHelpAndQuit(10);
} finally {
if (out != null) {
try {
out.close();
} catch (IOException e) {
System.out.println("I/O Error!");
e.printStackTrace();
displayHelpAndQuit(10);
}
}
}
}
/**
* Reads the data from the disk. The file has 50M and we will be duplicating
* to get 500M registers.
* @param sFilenameData
*/
private static void readValuesFromFileToArray(String sFilenameData) {
BufferedReader oBR = null;
String sLine;
int iCounter = 0;
int iRepeat;
int iNumber;
// We will be using 500.000.000 items, so dimensionate the array
paiNumbers = new int[piNUM_REGISTERS_IN_FILE * 10];
try {
oBR = new BufferedReader(new FileReader(sFilenameData));
while ((sLine = oBR.readLine()) != null) {
for (iRepeat = 0; iRepeat < 10; iRepeat++) {
int iPointer = (piNUM_REGISTERS_IN_FILE * iRepeat) + iCounter;
iNumber = Integer.parseInt(sLine);
paiNumbers[iPointer] = iNumber;
if (iNumber > piMaxValue) {
piMaxValue = iNumber;
}
}
iCounter++;
}
if (iCounter < piNUM_REGISTERS_IN_FILE) {
write("Warning... only " + iCounter + " values were read");
}
} catch (FileNotFoundException e) {
System.out.println("File not found! " + sFilenameData);
displayHelpAndQuit(100);
} catch (IOException e) {
System.out.println("I/O Error!");
e.printStackTrace();
displayHelpAndQuit(10);
} finally {
if (oBR != null) {
try {
oBR.close();
} catch (IOException e) {
System.out.println("I/O Error!");
e.printStackTrace();
displayHelpAndQuit(10);
}
}
}
}
private static String displayHelp() {
String sHelp = "Help\n" +
"====\n" +
"Csort from Carles Mateo blog.carlesmateo.com\n" +
"\n" +
"Proof of concept of the fast load algorithm\n" +
"\n";
return sHelp;
}
/**
* Displays Help Message and Quits with and Error Level
* Errors:
* - 1 - Wrong number of parameters
* - 10 - I/O Error
* - 100 - File not found
* @param iErrorLevel
*/
private static void displayHelpAndQuit(int iErrorLevel) {
System.out.println(displayHelp());
System.exit(iErrorLevel);
}
// This is QuickSort from vogella http://www.vogella.com/tutorials/JavaAlgorithmsQuicksort/article.html
public static void sort() {
int piNumber = paiNumbers.length;
quicksort(0, piNumber - 1);
}
private static void quicksort(int low, int high) {
int i = low, j = high;
// Get the pivot element from the middle of the list
int pivot = paiNumbers[low + (high-low)/2];
// Divide into two lists
while (i <= j) {
// If the current value from the left list is smaller then the pivot
// element then get the next element from the left list
while (paiNumbers[i] < pivot) {
i++;
}
// If the current value from the right list is larger then the pivot
// element then get the next element from the right list
while (paiNumbers[j] > pivot) {
j--;
}
// If we have found a values in the left list which is larger then
// the pivot element and if we have found a value in the right list
// which is smaller then the pivot element then we exchange the
// values.
// As we are done we can increase i and j
if (i <= j) {
exchange(i, j);
i++;
j--;
}
}
// Recursion
if (low < j)
quicksort(low, j);
if (i < high)
quicksort(i, high);
}
private static void exchange(int i, int j) {
int temp = paiNumbers[i];
paiNumbers[i] = paiNumbers[j];
paiNumbers[j] = temp;
}
/**
* We want to remove duplicated values
*/
private static void removeDuplicatesFromQuicksort() {
int iCounter;
int iOldValue=-1;
int iNewValue;
for (iCounter=0; iCounter<paiNumbers.length; iCounter++) {
iNewValue = paiNumbers[iCounter];
if (iNewValue != iOldValue) {
iOldValue = iNewValue;
pliNumbers.add(iNewValue);
}
}
}
// End of vogella QuickSort code
/**
* Generate the final Array
*/
private static void copyFromCSortToArrayList() {
int iCounter;
int iNewValue;
for (iCounter=0; iCounter<=piMaxValue; iCounter++) {
iNewValue = paiNumbersCsorted[iCounter];
if (iNewValue > 0) {
pliNumbers.add(iCounter);
}
}
}
/**
* Write with the date
* @param sText
*/
private static void writeWithDateTime(String sText) {
DateFormat oDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date oDate = new Date();
String sDate = oDateFormat.format(oDate);
write(sDate + " " + sText);
}
/**
* Write with \n
* @param sText
*/
private static void write(String sText) {
System.out.println(sText + "\n");
}
public static void main(String args[]) throws InterruptedException {
// For Profiling
long lStartTime;
double dSeconds;
long lElapsedTime;
int iThreads = 8;
int iRegistersPerThread = piNUM_REGISTERS_IN_FILE / iThreads;
CSortMultiThread[] aoCSortThreads = new CSortMultiThread[iThreads];
writeWithDateTime("CSort MultiThread proof of concept by Carles Mateo");
writeWithDateTime("Reading values from Disk...");
readValuesFromFileToArray("/home/carles/Desktop/codi/java/carles_sort.txt");
writeWithDateTime("Readed");
writeWithDateTime("Total values to sort de deduplicate " + paiNumbers.length);
writeWithDateTime("The max value between the Data is " + piMaxValue);
paiNumbersCsorted = new int[piMaxValue + 1];
writeWithDateTime("Performing CSort with removal of duplicates");
lStartTime = System.nanoTime();
for (int iThread=0; iThread < iThreads; iThread++) {
int iStart = iThread * iRegistersPerThread;
int iEnd = ((iThread + 1) * iRegistersPerThread) - 1;
if (iThread == (iThreads -1)) {
// Last thread grabs the remaining.
// For instance 100/8 = 12 so each Thread orders 12 registers,
// but last thread orders has 12 + 4 = 16
iEnd = piNUM_REGISTERS_IN_FILE -1 ;
}
CSortMultiThread oThread = new CSortMultiThread("Thread-" + iThread, iStart, iEnd);
oThread.start();
aoCSortThreads[iThread] = oThread;
}
boolean bExit = false;
while (bExit == false) {
bExit = true;
for (int iThread=0; iThread < iThreads; iThread++) {
if (aoCSortThreads[iThread].bFinished == false) {
bExit = false;
// Note: 10 milliseconds. This takes some CPU cycles, but we need
// to ensure that all the threads did finish.
sleep(10);
continue;
}
}
}
writeWithDateTime("Main loop ended");
writeWithDateTime("Copy to the ArrayList");
copyFromCSortToArrayList();
writeWithDateTime("The final array contains " + pliNumbers.size());
lElapsedTime = System.nanoTime() - lStartTime;
dSeconds = (double)lElapsedTime / 1000000000.0;
writeWithDateTime("CSort time in seconds:" + dSeconds);
writeWithDateTime("Writing values to Disk...");
writeValuesFromArrayListToDisk("/home/carles/Desktop/codi/java/carles_sort-csorted.txt");
// Reset the ArrayList
pliNumbers = new ArrayList<>();
lStartTime = System.nanoTime();
/** QuickSort begin **/
writeWithDateTime("Sorting with QuickSort");
sort();
writeWithDateTime("Finished QuickSort");
writeWithDateTime("Removing duplicates from QuickSort");
removeDuplicatesFromQuicksort();
writeWithDateTime("The final array contains " + pliNumbers.size());
lElapsedTime = System.nanoTime() - lStartTime;
dSeconds = (double)lElapsedTime / 1000000000.0;
writeWithDateTime("QuickSort cost in seconds:" + dSeconds);
}
}
The complete traces:
run: 2017-03-26 19:28:13 CSort MultiThread proof of concept by Carles Mateo 2017-03-26 19:28:13 Reading values from Disk... 2017-03-26 19:28:39 Readed 2017-03-26 19:28:39 Total values to sort de deduplicate 500000000 2017-03-26 19:28:39 The max value between the Data is 1000000 2017-03-26 19:28:39 Performing CSort with removal of duplicates 2017-03-26 19:28:39 Creating Thread-0 2017-03-26 19:28:39 Starting Thread-0 2017-03-26 19:28:39 Creating Thread-1 2017-03-26 19:28:39 Starting Thread-1 2017-03-26 19:28:39 Running Thread-0 to sort from 0 to 6249999 2017-03-26 19:28:39 Creating Thread-2 2017-03-26 19:28:39 Starting Thread-2 2017-03-26 19:28:39 Running Thread-1 to sort from 6250000 to 12499999 2017-03-26 19:28:39 Creating Thread-3 2017-03-26 19:28:39 Running Thread-2 to sort from 12500000 to 18749999 2017-03-26 19:28:39 Starting Thread-3 2017-03-26 19:28:39 Creating Thread-4 2017-03-26 19:28:39 Starting Thread-4 2017-03-26 19:28:39 Running Thread-3 to sort from 18750000 to 24999999 2017-03-26 19:28:39 Creating Thread-5 2017-03-26 19:28:39 Running Thread-4 to sort from 25000000 to 31249999 2017-03-26 19:28:39 Starting Thread-5 2017-03-26 19:28:39 Creating Thread-6 2017-03-26 19:28:39 Starting Thread-6 2017-03-26 19:28:39 Running Thread-5 to sort from 31250000 to 37499999 2017-03-26 19:28:39 Creating Thread-7 2017-03-26 19:28:39 Starting Thread-7 2017-03-26 19:28:39 Running Thread-6 to sort from 37500000 to 43749999 2017-03-26 19:28:39 Running Thread-7 to sort from 43750000 to 49999999 Thread Thread-0 exiting. Thread Thread-2 exiting. Thread Thread-1 exiting. Thread Thread-6 exiting. Thread Thread-7 exiting. Thread Thread-5 exiting. Thread Thread-4 exiting. Thread Thread-3 exiting. 2017-03-26 19:28:39 Main loop ended 2017-03-26 19:28:39 Copy to the ArrayList 2017-03-26 19:28:39 The final array contains 1000001 2017-03-26 19:28:39 CSort time in seconds:0.189129089 2017-03-26 19:28:39 Sorting with QuickSort 2017-03-26 19:29:40 Finished QuickSort 2017-03-26 19:29:40 Removing duplicates from QuickSort 2017-03-26 19:29:41 The final array contains 1000001 2017-03-26 19:29:41 QuickSort cost in seconds:61.853190885 BUILD SUCCESSFUL (total time: 1 minute 28 seconds)