• info@wordstem.com
  • info@wordstem.com

From SQL to NoSQL – and back again

We take a look at the suppliers behind the emergence of databases that organise web-scale datasets and the like – and the most recent set of upstarts dubbed NewSQL

The first thing to say about NoSQL databases is that the name is misleading.

Another term used to describe this family of technologies is “not only SQL”, which may be cumbersome, but at least acknowledges that SQL (structured query language) – the main interface to traditional relational database management systems (RDBMS) – is still sometimes used to query NoSQL databases.

That said, NoSQL has become the common parlance for databases used to process the huge datasets that many web-scale applications rely on.

NoSQL databases have a more flexible model than RDBMSs, making it easier to organise large amounts of data with varied formats that change over time.

For example, take a car manufacturer’s database, linked via an internet of things (IoT) application to every individual car that has been sold.

Each model has a different initial data structure, which may change over time as more sensors are provisioned and the data collected from them increases.

Customised records of the millions of individual cars can be maintained as they develop unique change and service histories.

NoSQL databases aim to provide unlimited scalability, delivering consistent high performance, however big the data set becomes and however many nodes the database is spread across.

To this end all the popular NoSQL products support Hadoop, the open source software ecosystem for big data that provides massively parallel processing across clusters of thousands of servers; just add more hardware or cloud resource to keep up with demand. Some large and complex transactions may take hours to process on a centralised RDBMS, but only a few minutes on a NoSQL database distributed across a Hadoop cluster.

Millions of simultaneous users can be supported over different time periods; for example a simple finance transaction may take just milliseconds to complete, but the need to monitor a user’s online activity may persist for minutes or hours.

RDBMSs still dominate the overall database market on many measures and are still the obvious choice for some transactional applications, especially in finance, where great store is placed on the consistency of transactions. NoSQL trades stringent consistency for speed and agility. If a user interrupts their viewing of a video stream they are impressed when the service picks up where they left off – if it does not it is inconvenient, but does not create a legal problem.

That said, one NoSQL supplier, DataStax, says it is seeing a move to more relaxed consistency accepted by banks in favour of the immediate customer experience, for example, authorising a cash withdrawal at a remote cashpoint before funds are confirmed to be available; better to please most customers and pursue the few that go over agreed limits at a later date.

The term NoSQL is not new. It was first used in the late 1990s, and some of the data models that underpin it go back even further. However, it has come to prominence with the rise of big data and the need for web-scale processing. Another supplier, Couchbase, describes three market phases; first, in the mid-2000s the market was developer-driven for small-scale, niche projects. Then, in the 2010s, NoSQL started to be used for mission-critical applications, where guarantees around scale and performance were required. The third phase for Couchbase is the beginning of enterprise-wide deployments; NoSQL is now being used to manage some of the largest global data stores for applications ranging from social data mining, through sensor data analytics to stock market analysis.

Open source

The developer-driven nature of the early NoSQL market lent itself to the open source licence model and most products are licensed as such. Measured by downloads, MongoDB dominates the market, claiming 300-400 thousand deployments (most using the free version). MongoDB is an example of company-driven open source, where a supplier owns the source code and puts it in the public domain, only charging customers that want a certain service level agreement or to use add-ons. Couchbase operates in the same way. A comparison from the RDBMS world would be Oracle’s MySQL.

The alternative is “true” open source projects, such as those run by the Apache Software Foundation that commercial suppliers can modify and provide commercial distributions of, in the way Red Hat does with the Linux operating system.

The most widely used commercial distribution of the Cassandra open source project is from DataStax. Other Apache-run NoSQL projects include CouchDB (not to be confused with Couchbase) and HBase. There are also pure commercial NoSQL offerings such as MarkLogic.

Flavours of NoSQL

There a four basic types of NoSQL database and a dizzying number of suppliers with overlapping support for the various systems. A given organisation may use multiple products to support different use cases and a single supplier may support different models, for example, DataStax says it already supports the first three listed below and will soon support the fourth.

The most basic model is the key value store, which is supported by most NoSQL databases. Data is stored as objects, the structure and composition of which may vary for each entity. Embedded in the objects are keys which can be used to quickly retrieve the required data from look-up tables. Typical use cases are online advertising and gaming, where cookies stored on a server can be quickly retrieved and used to feed custom content to millions of individual users. Aerospike is a specialist in this area, others include Basho’s Riak, the Symas LMDB (lightning memory-mapped database), Redis and MemcacheDB.

Apache’s Cassandra and HBase are both examples of column databases which build on key value stores by creating collections of one or more key value pairs that match a record. This model is particularly good for handling time series data such as IoT tables, messaging and recommendation engines.

Cassandra was initially developed by Facebook to power inbox search and was open-sourced in 2008. DataStax, the flagship commercial distribution of Cassandra, was released in 2010 and now has around 500 paying customers deploying cloud applications which are intensely transactional and can “never go down”, supporting needs such as user activity tracking and fraud detection.

MongoDB and Couchbase are both document databases; others include MarkLogic and CouchDB. Cassandra now supports the document model as well. Json (JavaScript Object Notation) is used to describe documents and define dynamic schemas. MongoDB cites one customer, Sky’s Now TV, that is able to handle spikes in demand for online TV streaming during the screening of popular events, including English Premier League games or new episodes of Game of Thrones. This ensures that, wherever possible, a unique and consistent customer experience is maintained. Now TV hope this will enable it to take on Netflix, which is itself a DataStax customer.

Couchbase says it focuses mainly on enterprise deployments and has 500 paying customers. Many are in digital economy segments such as e-commerce, online video, financial services and the IoT, where, for example, it works with GE on its Predix industrial internet platform. Couchbase is widely used to record user profiles and sessions, storing millions of unique states and for maintaining regionally dynamic inventories for retailers such as Tesco and Staples.

Finally, many consider graph databases – such as OrientDB and Neo4J – as NoSQL databases. Here another layer is added to the relationship between objects and documents via links that map them together for rapid association. For example, mapping connections on social media or to find patterns in crime data for predictive policing or to optimise call routing across telco networks.

Querying NoSQL databases

As mentioned at the start, NoSQL is a poor name, as SQL can be used to query NoSQL databases. Couchbase has a fully SQL-compliant interface. However, many have their own interfaces for example the MongoDB and Cassandra query languages. Drivers and SDKs are provided for the most popular programming languages allowing database calls to be embedded in applications. There are also a variety of ad hoc query and analysis tools, such as Drill, Spark SQL, Hive and Pig, all open source and available from Apache.

The fight back! NewSQL

The old school has not taken the NoSQL challenge lying down; Oracle, for example, has its own NoSQL implementation. However, RDBMS suppliers and the now almost-establishment NoSQL school are having to face up to a new set of upstarts dubbed NewSQL, a class of RDBMS that seeks to provide the same scalable performance of NoSQL while still maintaining the transactional consistency. NewSQL suppliers include MemSQL, NuoDB and MariaDB, and Google Spanner.

What goes around comes around!