Tag Archives: Map-Reduce

Install Windows Subsystem for Linux, WSL 2 on Windows 10 64 bit, with Ubuntu, solution to error WslRegisterDistribution failed with error: 0x80070057

You know I love Linux. I was compiling my own Kernels back in 1995, when it took more than 24 hours in a 386, and working on the first ISPs in Barcelona managing the Linux Systems.

For my computers I prefer Linux, no doubt about it, but many multinationals I worked for have Windows option only for the Laptops and Desktops.

During years I had to deal with sending files to Linux or Unix (HP UX, Sun Solaris…) to process them and getting back the result. Some sort of ETL and Map Reduce in the prehistory of personal computers, taking in count aspects like Networks speeds too, available space, splitting files for processing.

When I was working as Senior Project Manager in Winterthur Insurance, now Axa, I had to run a lot of ETL (Extract Transform Load) for considerably big files, or when I was project manager and later head of department in Volkswagen gedas or later helping Start ups like Privalia. I can tell you that Windows didn’t like you to open editors to work with 1GB text or CSV file, and doesn’t like it, even if your computer has 16GB of Memory, and even if they do the simplicity of Bash scripts and using pipes, grep, awk… is so powerful that is very convenient to have those files processed using Linux.

And honestly is a pain to send back and forth files to a UNIX System just for Data Crunch. And a VM will be slow and use memory, and you have enable some sort of sharing with it so it can access the Data. Not to talk if you need to split the data files in blocks to be processed in parallel by several computers.

There are many solutions, like using Virtual Machines, Docker, external Servers, etc…

WSL allows you to run Linux command line tools inside Windows.

https://wiki.ubuntu.com/WSL

Having WSL allows things to be done much more straightforward, processing the files in your local windows hard drives.

Please note: Maybe you have enough using GitBash.

Error installing: WslRegisterDistribution failed with error: 0x80070057

When I installed it I found this error and look for an answer online. I found no solutions and many people suffering from the same problem, so I decided to publish an article on how to make it work.

The instructions I detail are based on the documentation from Canonical https://wiki.ubuntu.com/WSL and from Microsoft https://docs.microsoft.com/en-us/windows/wsl/install-win10 and the final solution I found.

Microsoft use Powershell to activate the features disabled in Windows, I did the same with Command Line, which I found more convenient for most of the non extremely tech people.

You will need:

  • For x64 systems: Version 1903 or higher, with Build 18362 or higher.

You can check your version of windows opening a Terminal (CMD.exe) and typing:

winver
  • For ARM64 systems: Version 2004 or higher, with Build 19041 or higher.

I’m not covering installing WSL for ARM, only for Intel/AMD Desktop/Laptops with Windows 10.

If you’re unsure, you can open a Terminal (CMD.exe) and run:

systeminfo | find "System Type"

If is a x64 system it will return: x64-based PC

Launch CMD.exe as Administrator and type:

dism /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart

Enable Virtual Machine Feature:

dism /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

At this point you have to Restart Windows to complete this part of the installation, otherwise next step won’t work.

Download the Linux Kernel Update Package from here:

https://wslstorestorage.blob.core.windows.net/wslblob/wsl_update_x64.msi

Execute wsl_update_x64.msi and grant permissions for modifying the system.

Now it’s crucial that you reboot again. Even if you’re not asked for.

That’s the tricky part.

Then set the version 2 of WSL as default:

C:\WINDOWS\system32>wsl --set-default-version 2
 For information on key differences with WSL 2 please visit https://aka.ms/wsl2

Installing Ubuntu (or Kali, or Debian, or openSUSE…)

Open the Microsoft Store or use Microsoft Store link to go directly to the installation of Ubuntu.

You have also several options as indicated in https://wiki.ubuntu.com/WSL:

The recommended way to install Ubuntu on WSL is through the Microsoft Store.

The following Ubuntu releases are available as apps on the Microsoft Store:

  • Ubuntu 16.04 LTS (Xenial) is the first release available for WSL. It supports the x64 architecture only. (offline installer: x64)
  • Ubuntu 18.04 LTS (Bionic) is the second LTS release and the first one supporting ARM64 systems, too. (offline installers: x64, ARM64)
  • Ubuntu 20.04 LTS (Focal) is the current LTS release, supporting both x64 and ARM64 architecture.
  • Ubuntu (without the release version) always follows the recommended release, switching over to the next one when it gets the first point release. Right now it installs Ubuntu 20.04 LTS.

Each app creates a separate root file system in which Ubuntu shells are opened but app updates don’t change the root file system afterwards. Installing a different app in parallel creates a different root file system allowing you to have both Ubuntu LTS releases installed and running in case you need it for keeping compatibility with other external systems. You can also upgrade your Ubuntu 16.04 to 18.04 by running ‘do-release-upgrade’ and have three different systems running in parallel, separating production and sandboxes for experiments.

But if you prefer, instead of using the Windows Store, you can download the appx.

In the same page mentioned you can do it for several versions, I attach the link for Ubuntu 20.04 LTS: https://aka.ms/wslubuntu2004

Assuming you used the Windows Store, if you did not reboot and try now to execute it for the first time, or you go to the Command Line and write bash, or open Ubuntu from Windows menu, whatever method you use, you’ll get the abovementioned error.

If that happens to you, just reboot and when you open it will work and will start the install and ask for a user and password:

From here you’re able to update the system, execute the text commands available in Linux, access to the Windows drives, launch htop, git, Python3, apt, wget… copy and paste between windows and Linux terminal, share PATH…

And of course you can run CTOP.py

Take in count that the space reported in / partition is not real, and that you have a 4GB swap.

You can access your C:\ Windows files from:

/mnt/c/

Upgrade your Scalability with NoSql

CAP-theorem
We’re experiencing another digital breach.

The first one was between people not knowing about IT and those knowing, but we’re living another between IT guys being unable to Scale and those being able to Scale well.

Few years ago I was working all the time with Relational Databases. Designing cool relational Schemas for amazing projects. I had work for years with Oracle, Microsoft Sql Server, Informix, Dbase, Trees, Xml, and in the last times with PostgreSql and MySql.

I was doing a lot of improvements to MySql installations to allow Scaling and Scaling more, to bring more reliability, to improve performance, to allow more sessions… in definitive to fit the needs of the businesses in a challenging world that demanded more and more avility to handle more and more users.

Master Master, Master with secondaries for read, cluster of memcached or redis to use as cache, database sharding, Ip’s fail over, load balancers, additional indexes, InMemory engines, Ramdisks… everything that could help to match an increase on the load volumes.

I used commercial products like Code Futures dbshards, I created my own database sharding solution, in order to split the data to severl MySql servers, etc..

Artisan’s setup and a lot of studying and testing, everything to Scale to the needs of the companies, to handle more and more traffic, more and more users…

And I was proud of my level. Since I was able of suceed where few were able.

But now that is not needed anymore.

Basically the NoSql systems were born to deal with the actual problems.

NoSql servers -take in mind that the term comprises a lot of different solutions- were born to:

  • Work in cluster
  • Split the load among the cluster
  • Work in cheap commodity servers (or small cloud instances)
  • Resistance to failure: Allow the destruction of some nodes without data loss
  • Work with nodes at distant-location datacenters

There are many different NoSql Softwares like: Cassandra, Hadoop, MongoDb, Riak, Neo4J, Redis…

And they do auto-sharding of the data, distribute the data across the network to fit the replication factor set, support load balancing, and in the case of Cassandra Scaling horizontally is so easy like adding more nodes to the Cassandra Cluster.

So yes, believe it. That’s why I write this article. So you can improve your projects and save tons of money.

Databases like Cassandra allow you to Scale so easily like adding new nodes. It is a peer to peer cluster with no single point of failure. All the nodes know the status of the other nodes and they distribute the load.

You can query all the time the same server, but it will be splitting the load among the other servers.

NoSql like hadoop allows you to create a large filesystem in cluster, with as-big-as-all-the-cluster files, but the best quality of HDFS is that it balances the load, and replicates the blocks of data among different servers, so if you loss nodes of the cluster and you have enough replication factor you’ll not loss data. I know companies in Barcelona with 500+ TB in HDFS and companies in the States with thousands of nodes.

So unlike most people believes, NoSql is not about how the information is stored in the database: Schemaless. (* take a look at Graph NoSql databases for relations in NoSql)

NoSql has not an Schema in the traditional sense of Relational Databases, but it has aggregation, columns, supercolumns, or documents depending on the solution, and the design has impact on the performance, but the principal virtue of the NoSql systems is that they were born to work in cluster, to distribute the load, to be resilent to errors and to Scale.

I’ve seen many Startups suffering problems of overloaded MySql databases, but it happens that nothing of this will happen with NoSql like Cassandra, or MongoDb.

Before they were scaling vertically the MySql server, so adding more Ram, adding more CPU, having better disks, until it was impossible to upgrade more. And if sharding was not possible due to joins, the project was in serious trouble.

But with NoSql you can have, instead of an expensive very powerful server, 5 really cheap servers, and it could be faster, cheaper, resilent to errors, with a better uptime. And if you want to Scale simply add more cheap servers.

The most important of this article has been said, so you can start to look at NoSql solutions.

For bonus, I add a list of NoSql’s and the kind of Data Model that they have:

 

Database name Type of data model Extra info Companies using it
Memcached Key-Value Storage is in Memory, so it is used mainly as cache Companies I’ve worked for: ECManaged, privalia.
Other well known companies:
LiveJournal, Wikipedia, Flickr, Bebo, Twitter, Typepad, Yellowbot, Youtube, Digg, WordPress.com, Craigslist, Mixi
Redis Key-Value Work in cluster. Can be used in memory or persistant Companies I’ve worked for: Atrapalo, ECManaged
Other well known companies: Twitter, Instagram, Github, Engine Yard, Craiglist, guardian.co.uk, blizzard, digg, flickr, stackoverflow, tweetdeck
Riak Key-Value Supports a REST API through HTTP and Protocol Buffers for basic PUT, GET, POST, and DELETE. MapReduce with native Javascript and Erlang. In multi-datacenter replication, one cluster acts as a “primary cluster”. AT&T, AOL, Ask.com, Best Buy, Boeing, Bump, Braintree, Comcast, DataPipe, Gilt Group, UK National Health Services (NHS), OpenX, Rovio, Symantec, TBS, The Weather Channel, WorkDay, Voxer, Yahoo! Japan, Yandex
BerkeleyDB Key-Value
LevelDB Key-Value
Project Voldemort Key-Value LinkedIn
Google BigTable Key-Value
Amazon DynamoDB Key-Value DynamoDB from Amazon, run in their AWS Cloud solution. See info on wikipedia
Cassandra Column-Family My favourite Db-alike. You can download my CQLSÍ wrapper for PHP :) NetFlix, Spotify, Facebok used it until 2010, Instagram, Rackspace, Rockyou, Zoho, Soundcloud, Hailo, ComCast, Hulu
HBase Column-Family Provides BigTable-like, SQL alike, support on the Hadoop core
Hypertable Column-Family
Amazon SimpleDB Column-Family
MongoDB Document Databases Written in C++, JSON-style documents, default stores to RAM until flush, high performance but dangerous for data integrity. Supports Map-Reduce
CouchDB Document Databases
OrientDb Document Databases
RavenDB Document Databases
Terrastore Document Databases (legacy)
Infinite Graph Graph Databases
HyperGraph DB Graph Databases
FlockDB Graph Databases
Neo4J Graph Databases
OrientDB Graph Databases

Bonus for PHP Developers: A kind of lightweight key-value store very simple component useful for one-server PHP projects are: APC (datastore capability), and Cache Lite (part of PEAR).

I can’t miss to mention hadoop, that is a NoSql that does not match the categories of Data Storage up, because is a Framework for the distributed processing of large data sets across clusters, so a monster, being able to do many many things and to distribute loads across its nodes. The most well-known components are HDFS, the distributed filesystem, and Map-Reduce: a simple to develop YARN-based system for parallel processing of large data sets across the clusters. All the big companies like Netflix, Amazon, Yahoo, etc… are using Hadoop. Often synomym when talking about BigData.

Hadoop is a world itself, and the many projects surrounding, but is worth, because allow incredible possibilities to distribute loads and to Scale.

Hadoop has a single point of failure in the namenode, that stores the name of the files of the HDFS in RAM, but solutions like MapR have overcome this.

Don’t get me wrong. Relational databases are wonderful, very useful, support transactions, stored procedures, have been tested for years, focused on consistency, and are very reliable.

Simply they don’t allow to Scale according to our current needs, while NoSql opens a wonderful world of easy, nearly infinite, Scaling.

As you see Open Source is ruling the world. :)

Companies are still sleeping and not supporting NoSql. I’m particularly disappointed with Open Source CMS that are still based on Relational Models, and are very hard to Scale. Drupal, WordPress, Joomla… and e-Commerces like Magento, osCommerce… and plugins for the CMS mentioned (uberkart, woocommerce, virtuemart…) need to be ported to NoSql urgently. (Although some partial support exists in some solutions, it is not fully supported)
That’s why I’ve started to create a very simple Open Source CMS based on NoSql. To help companies and bloggers that can’t Scale more their sites.