Abstract represents “ONT ONLY SQL”[2]. The goal of

Abstract

With
the rise of Internet Web2.0 applications and explosive growth of data,
traditional relational databases face many new challenges. Therefore, it is
hoped that a new type of database can replace the traditional relational
database to deal with large-scale and high concurrency applications. In this
context, NoSQL database comes into being. At present, NoSQL has become a hot
topic in industry and academia, it can handle the problems of high concurrency,
high scalability and high availability that traditional relational databases
cannot solve.

In this
paper, four new types of storage models are introduced: column storage,
document storage, key-value storage and graph storage. According to CAP
theorem, NoSQL database is divided into three categories: satisfying CA
principle, satisfying CP principle and satisfying AP principle. Finally, we
introduce the advantages, disadvantages and architecture of four mainstream
NoSQL database :BigTable, MongoDB, Redis and Neo4j.

 

Keywords?NoSQL;
storage model; database architecture; BigTable; MongoDB; Redis; Neo4j;

 

1.
INTRODUCTION

The term “Relational Database”1
was coined in 1970 by IBM’s Edgar Codd, in which Codd introduced the concept
“data Relational model of large Shared data Banks” in his paper.
Since then, relational databases have been the main model of database
management. The traditional RDBMSs provides a powerful mechanism for storing
and querying structured data with a strong consistency and transaction
guarantees, it has reached high reliability, stability and support in decades
of development.

With various advances in computing,
scalability, resource utilization, and energy conservation are given higher
priority. Therefore, in order to take advantage of cloud computing technology,
SQL vendors have proposed two methods: manual sharding and caching. However,
this is not enough to deal with modern Web applications. Modern Web
applications have to make changes based on future requirements, which requires
better flexibility, but it can’t be done for traditional SQL databases. A new
database model named NoSQL is being valued as an alternative to database management.
Most people agree that NoSQL represents “ONT ONLY SQL”2. The goal
of NoSQL is not to reject SQL, but to use it as an alternative data model for
applications that are incompatible with relational database models. The NoSQL
databases are meeting these requirements with high scalability and easy-to-program
models. After introducing the background of NoSQL, we will focus on the
advantages and disadvantages of NoSQL database and introduce some of the major
NoSQL databases.

 

2.
Features,Categories and CAP Theorem

2.1 Features of NoSQL

The
main advantages of NoSQL compared with relational databases are the following:

1)
Highly scalable, NoSQL database, with peer-to-peer architecture, all nodes are
the same. This scales easily to accommodate the amount and complexity of cloud
applications;

2) Flexible,
multi-model functions of NoSQL database make them extremely flexible in
processing data. They can easily handle structured, semi-structured and
unstructured data;

3) Fast
reading/writing, NoSQL database has very high reading and writing performance, even
in large data volumes. This is due to its non-relational and simple structure
of the database;

4) Low-cost,
open-source NoSQL databases do not require expensive licensing fees, and can be
run on low-resource hardware. Simpler data distribution and simpler data models
make the system less costly to maintain.

Of course, the NoSQL database
is not perfect, it also has shortcomings and limitations. First, most NoSQL
databases do not support SQL, which means you need to manually or proprietary
query languages, which adds more time and complexity. Second, most NoSQL
databases do not support the reliability characteristics supported by
traditional relational databases. These reliability characteristics can be
atomicity, consistency, isolation and durability.It also means that NoSQL
databases which do not support these capabilities sacrifice consistency for
better performance and scalability. To enable the Nosql database to support
reliability and consistent functionality, developers must develop additional
code, which makes the system more complex.

2.2 Categories

The
NoSQL database began to popularize in 2000, when many large companies began to
invest and research distributed databases 3. With the rapid development of
web2.0, non-relational and distributed data storage has been rapidly developed,
making the category of NoSQL database growing. The most common NoSQL database
categories are as follows:

1)      Column
storage: as the name implies, the data is stored in columns, and each column of
data is an index of the database. The biggest feature is convenient storage of
structured and semi-structured data to facilitate data compression.It has a
very large IO advantage for queries against a particular column or a few
columns.

2)      Document
storage: document storage is typically stored in a format similar to json, and
the stored content is document type. In this way, some fields can be indexed to
implement some of the functions of the relational database.

3)      Key-value
storage:in the key-value storage database, a hash table is used where the
unique key points to a specific item and its value is quickly queried by key.
In general, any format value can be stored.

4)      Graph
storage: the best storage for graphical relationships, data stored in a
graph-like structure in graph databases, so that data can be accessed easily.
The performance will be lower if use traditional relational database to solve
the data, as well as, the design is not convenient. A graphical database is
typically used for social networking applications.

2.3 CAP Theorem

While
Lynch proved in 1985 that there was no consistent distributed algorithm (FLP
Impossibility)4 in asynchronous communication, people began to search for
various factors in distributed system design. CAP theorem 5, Like the FLP
Theorem, proposed by Professor Eric Brewer in 2000. Lynch and others proved the
Brewer conjecture in 2002, so that CAP is raised as a theorem. The core of CAP
theorem is that a distributed system can only guarantee two of the following
three characteristics in the Figure 1 at the same time:

Figure 1 CAP Theorem

Consistency
(C): both reading and writing are atomic and are strictly consistent 6. In
other words, the data on all nodes stays in sync at all times.

Availability
(A): if the server is working, it can accept reading and writing requests from
the client, and each request can receive a response, regardless of the success
or failure of the response.

Partition
tolerance (P): the system can continue to provide services, even if two sets of
servers are isolated or when messages are lost within the system.

According
to CAP theorem, we can never get the three features of CAP at the same time.
Base on different concerns, the Nosql database is initially divided into the
following three categories:

CA
without P: C (strong consistency) and A (availability) can be guaranteed if P
is not required (no partition). But in fact, the partition is not dependent on
what you want, there will always be partitioning. Therefore, the CA system is
more likely to be allowed to partition, while the subsystems still remain CA.

CP without
A: if you don’t require A (availability), equivalent to every request needs
strong consensus between the Server, and P (partition) will cause
synchronization time unlimited extension, so the CP can be guaranteed. Many
traditional database distributed transactions belong to this model.

AP
wihtout C: to be high availability and allow partitioning, it need to abandon
the consistency. Once a partition occurs, nodes may lose contact. For high
availability, each node can only provide services with local data, which can
lead to inconsistencies in global data. Now many NoSQL belong to this category.

 

3.
Typical NoSQL Databases

3.1 Column-Oriented Databases

These
types of databases are stored data in a column-related storage schema, which is
mainly suitable for batch data processing and real-time queries. Although
database based on column storage does not subvert the traditional way of
storing, compared with most of the current relational databases that store data
in row, column-based database can maintain high-performance data processing and
analysis, with high scalability.The related column storage database including: BigTable,
Hbase, Cassandra, Hypertable, etc.

3.1.1 BigTable

BigTable7
is a distributed data storage system designed by Google, which is a
non-relational database used to process massive data. BigTable has achieved the
following several goals: wide applicability, scalability, high performance and
high availability.It has been used in more than 60 Google products and
projects, including Google Analytics, Google finance, Orkut, Personalized
Search, Writely ,Google earth,etc.

3.1.2 Features of BigTable

BigTable
is not a relational database, but in many ways BigTable is very similar to a
relational database. It uses many of the terms of relational databases like
table, row, column, and so on. That easily confuses readers with the concepts
of BigTable and relational databases. In essence, BigTable is a sparse,
distributed, persistent multi-dimensional sorting Map (Key => Value). The
index (Key) of BigTable has three dimensions: row Key, column Key, timestamp.
We can store a record using array form Key
(row:string, col:string, time:int64), where row and column keys are byte strings,
and timestamps are 64-bit integers. As shown in Figure 2, we present an example
of BigTable storing webpage information:

Figure 2 A snippet
of the Webtable.

Webtable
stores many web pages and related information. Each line store a web page, page
reversed URL as the key, such as the data of page”maps.google.com/index.html”
stored in the row whose key is”com.google.maps/index.html”. The reason for
reversing the URL is to allow the sub-domain pages under the same domain name
to be cluster together, leading to more efficient analysis in the same area.

Columns
are secondary indexes, the number of columns per row is unlimited, you can
increase or decrease at any time. In order to facilitate management, the column
keys are grouped into a number of different collections, called the column
family (the basic unit of access control). Columns in a column family generally
store the same type of data. The column family rarely changes, but the columns
in the column family can be added or delete at will. The Webtable has three
columns, column keys are named according to format “family:qualifier”. Column
family “contents” is the document content of the web page, and the
column family has only one empty column. Column family “anchor” is
the anchor link text of the web page, this column family contains 2 columns
“cnnsi.com” and “my.look.ca”.

The
timestamp is the third index. BigTable allows to save multiple versions of
data, and different versions are indexed by timestamps. The timestamp can be
assigned by BigTable, which represents the exact time of data entry into BigTable,
or it can be assigned by the client. In the Webtable, set the
“contents:” column to save only the latest three versions of the web
page (the number of versions are free to set), and the timestamp were t3, t5,
t6.

3.1.3 BigTable Architecture

The
architecture of the BigTable database consists of three components: client
library, a master server, multiple sub-table servers, as shown in Figure 3. If
we think of the database as a large table, we can divide it into many basic
small tables, called tablets, which are the smallest units in BigTable.

Figure 3 BigTable
Architecture

The
master server is responsible for distributing the tablet to the sub-table
server, detecting Tablet servers joined or failed, garbage collection in Google
file system, and balance load between Tablet servers.

The
sub-table server is responsible for reading, writing and accessing the
sub-tables it manages. Each sub-table server manages a sub-table collection,
which usually contains 10 to 1000 sub-tables. When the number of sub-tables in
the collection becomes too large, the server will splits the sub-tables.

The
client accesses the data in the access table through the API.

3.2
Document Databases

Document
database is a very important branch of NoSQL, it is mainly used to store, index
and manage document-oriented data or similar semi-structured data. Document
database does not care about high-performance on read-write concurrency, but to
ensure that the big data storage and good query performance. The typical
document database is MongoDB, CouchDB,RavenDB, etc.

3.2.1 MongoDB

MongoDB
8 is an open source, distributed and document-oriented non-relational
database. The main reason to abandon the relational model is to achieve more
convenient scalability and higher performance. The data structure supported by
mongoDB is very loose and storing data in the form of key-value pairs.,so that
it can store complex data types. The biggest feature of mongoDB is that it
supports the powerful query language while can achieve most of the functions of
relational database query statements.

3.2.2 Features of MongoDB

The
main features of MongoDB are: 1) scalable, high-performance 2) efficient data
storage, supporting binary data and large objects; 3) support dynamic query,
mongoDB supports rich query expressions, and query instruction using json formal
expressions. 4) complete index support, mongoDB query optimizer will analyze
the query expressions and generate an efficient query plan. 5) support
replication and recovery. Due to these features, MongoDB becomes the most
popular document database.

3.2.3 MongoDB Architecture

MongoDB
cluster includes a certain number of Shard nodes, mongos (routing processing),
config server (configuration server), as shown in Figure 4,

Figure 4 MongoDB
Architecture

Shards
node: a shard contains a set of mongod, responsible for storing data. In addition, a group of mongod
usually has two mongods, which guarantees the availability of the system, so as
to avoid affecting the service and losting data when a single node fails in the
cluster.At the same time, the data is divided according to an orderly manner,
and the data on each slice is a range of data blocks, so it can support the
range query of the specified sharding.

Mongos:
mainly responsible for data routing and coordination operations, there can be
multiple Mongos at the same time. Mongos can locate the data location from the
sharding node, and then return the query result to the client. Mongos nodes are
stateless, it does not hold any data or metadata, so that it can be scaled
arbitrarily and horizontally as long as any one node failure can be easily
failover without any serious impact.

Config
server: responsible for storing metadata and routing information of MongoDB
clusters, including location of the data and cluster configuration information.
The config server is highly available to some extent, and each config server
has a copy including data information about all the block to ensure the
consistency of the data on each config server.

3.3
Key-value databases

The
key-value database is the most widely used database in the field of NoSQL,
involving the largest number of products. It uses associative arrays as the
basic data model, where each key is associated with a value in a collection.
Redis, LevelDB, Scalaris, Riak are some examples of key-value database.

3.3.1 Redis

Redis9
is an open source key-value database based on the BSD license. It’s fast and
supports many data types (String, Hash, List, Set, Sorted,etc ).It uses RDB or
AOF persistence and replication to ensure data security and supports
multi-lingual client libraries. Redis is a NoSQL database that runs on memory
and supports persistence, it is one of the most popular NoSql databases at the
moment and is also known as the data structure server.

Redis
runs on memory and supports persistent NoSQL databases, one of the most popular
NoSQL databases, and is also known as a data structure server.

3.3.2 Features of Redis

Redis
is very similar to Memcached10, but there are some differences between them.
Below is a list of the main features of Redis:

1)    
Redis exists in memory, uses hard disk for
durability, fast reading and writing.

2)    
with rich data structure, support for
transaction, pipeline, publish/subscribe, message queue and other functions.

3)    
support multiple programming languages,
including Java, PHP, Python, Ruby, Lua, node.js,etc.

4)    
support master-slave replication, the host
will automatically synchronize data to the slave machine. Reading and writing
can be separated.

In
addition, Redis also has some shortcomings including:

1)      No
automatic fault tolerance and recovery function;

2)      Capacity
is limited by physical memory;

3)      Poor
scalability;

4)      Low
availability.

3.3.3 Redis Architecture

The
Redis architecture consists of two main parts: Redis Client and Redis Server,
as shown in Figure 5,

Figure 5 Redis
Architecture

Redis
Clients and Redis Servers can be installed on the same or different computers.
The Redis server is responsible for storing the data in memory while handling
various administrative behaviors and acting as an important part of the
architecture. Redis client can be the Redis console client or the Redis API in
any other programming language.

As
shown, Redis stores all the content in the main memory.The main memory is
unstable, once we restart our Redis server or computer, we will release all the
stored data. Therefore, the Redis database is not a method to store data
persistently.

3.4
Graph Databases

Graph Database
is a type of NoSQL database, which applies the graph theory to store the
relationship information between entities. For a long time, Graph Database was
limited to the academic circle related to graph theory and it was not widely
used in the industry.Until the
mid-late 2000s, commercial atomicity, consistency, isolation, durability (ACID) graph databases became available.
With the emergence of social media companies, various types of graphical
databases have become particularly popular in social network analytics.
Currently, database vendors have developed several graphical databases such as
Neo4j, OrientDB, AllegroGraph, FlockDB,etc.

3.4.1 Neo4j

Neo4j 11
is a high-performance non-relational Graph Database based on Java. Unlike other
databases, Neo4j database stores data on the network rather than tables or
collections. The Neo4j storage engine uses fixed-size arrays to store graphical
data, it can gracefully represent any type of data in a highly accessible way.

3.4.2 Features of Redis

Neo4j
is a fully transactional database. It can also be seen as a high-performance
graph engine with all the features of a mature database, such as:

1)      It has
a simple query language (Neo4j CQL);

2)      It
follows Property Graph Data Model;

3)      It
supports indexing by using Apache Lucence;

4)      It has
very efficient query performance;

5)      Unstructured
data storage means that there is great flexibility in database design.

At the
same time, Neo4j also has some shortcomings, for example: 1) No sharding
storage mechanism; 2) Fatigue of handling large nodes; 3) The speed of
inserting nodes is too slow.

3.4.3 Neo4j Architecture

Neo4j
is designed to maximize the traversal speed in any graphical algorithm, so
let’s take a look at the architecture of Neo4j, as shown in Figure 6:

Figure 6 Neo4j Architecture

Instead
of taking a single section from other non-graphic technologies, the
architecture is optimized for storing graphical data from the top of Cypher’s
query layer to the bottom of the file layer on disk. Neo4j stores the graph
data in several different storage files, each file contains the data (nodes,
relationships, attributes) of a specific part of the graph. By dividing the storage
responsibilities, the graph traversal of higher performance is facilitated.

 

4 Conclusions

NoSQL
has become increasingly popular in recent years, especially for big Internet
companies. It is becoming an important force in the database realm. The emergence
of NoSQL database makes up for the deficiency of relational database in some
aspects, which greatly saves development cost and maintenance cost. MySQL and
NoSQL has their own characteristics and application scenarios, Let relational
database focus on relationship, NoSQL database focus on storage. The close
combination of the two will bring new ideas to the development of web3.0
database.

This
paper summarizes the NoSQL database, Firstlyl, chapter 1 introduce the related
background of NoSQL database and the difference with the traditional relational
database. Then chapter 2 briefly introduced the features, classification of
Nosql database and CAP theorem. Finally, according to different types of data
models, chapter 3 introduce some typical NoSQL databases that have appeared in
recent years in detail, that help users find the appropriate alternative to the
traditional relational database based on the type of data and mode of
operation.

Comments are closed.