Is Tarantool just a cache? - database

Like the title says - is it just a cache? What about persistence? What about storing on disk?
A lot of times there can be a wrong idea that Tarantool is just a different version of Memcached.

A simple answer - Tarantool is not just a cache. Historically it was created to cache hot data for a social network, yes. But since then 10 years have passed and a lot have changed in Tarantool.
Tarantool is an in-memory data platform. Sounds fancy, but in simpler terms it is an in-memory DB + a built-in Lua application server.
Tarantool is a multi paradigm DB: it supports key-value, document and relational models.
It is fully persistent. There a regular snapshots to disk and there is a WAL.
You can have as many secondary indexes as you want.
You can write stored procedures in Lua
Tarantool supports sync and async replication and sharding out of the box

Related

When to Redis? When to Tarantool?

I do not want to get a detailed comparison. Neither do I want to define 'what is the best or fastest' in-memory DB.
They are both similar, so I want to get an overview of critical differences.
So what are they?
Let me be the one to offer a solution to the question above, I did a little research. From my point of view, these are the most critical things to know about them.
Overall
both are extremely fast in-memory technologies
open-source and Enterprise versions
store all the data in-memory
offer RPS of 10 000 and greater
Persistence: they both snapshot data to disk
Support async replication
Redis is a key-value storage
Tarantool supports key-value, but also supports documents and relational model
Why Redis is preferable
easier to get started
more Information on the Internet (look ant the number of question here, for example)
a simplier technology overall
more people are familiar with it
Why Tarantool is preferable
supports secondary indexes
supports iteration over indexes
has a UI for cluster administration
has an app-server onboard by default
Conclusion
Redis is a great caching solution. Not recommended to use it as a main storage
Tarantool is a multi parading DB. Can be used as a main storage
Redis has a lower entry barrier
Tarantool has a higher ceiling as a solution (relational model, distributed NoSQL storage, queues)

Using etcd as primary store/database?

Can etcd be used as reliable database replacement? Since it is distributed and stores key/value pairs in a persistent way, it would be a great alternative nosql database. In addition, it has a great API. Can someone explain why this is not a thing?
etcd
etcd is a highly available key-value store which Kubernetes uses for persistent storage of all of its objects like deployment, pod, service information.
etcd has high access control, that it can be accessed only using API in master node. Nodes in the cluster other than master do not have access to etcd store.
nosql database
There are currently more than than 255 nosql databases, which can be broadly classified into Key-Value based, Column based, Document based and Graph based. Considering etcd as an key-value store, lets see the available nosql key-value data stores.
Redis, memcached and memcacheDB are popular key-value stores. These are general-purpose distributed memory caching system often used to speed up dynamic database-driven websites by caching data and objects in memory.
Why etcd not an alternative
etcd cannot be stored in memory(ram) they can only be persisted in disk storage, whereas redis can be cached in ram and can also be persisted in disk.
etcd does not have various data types. It is made to store only kubernetes objects. But redis and other key-value stores have data-type flexibility.
etcd guarantees only high availabilty, but does not give you the fast querying and indexing. All the nosql key-value stores are built with the goal of fast querying and searching.
Eventhough it is obvious that etcd cannot be used as an alternative nosql database, I think the above explanation will prove it cannot be an suitable alternative.
From the ETCD.IO site:
etcd is a strongly consistent, distributed key-value store that
provides a reliable way to store data that needs to be accessed by a
distributed system or cluster of machines. It gracefully handles
leader elections during network partitions and can tolerate machine
failure, even in the leader node.
It has a simple interface using http and json. It is NOT just for Kubernetes. Kubernetes is just an example of a critical application that uses it.
You are right it should be a thing. A nice reliable data store with an easy to use API and a nice way of telling you when things change using raft protocol. This is great for feature toggles and other items where everything needs to know and is much better than things like putting a trigger in an sql database and getting it to send an event to an external application or really horrible polling.
So if you are writing something like the kubernetes use case >> it is perfect a well proven store for a distributed application.
If you are writing something very different to the kubernetes use case, then you are comparing with all the other no-sql databases. But is very different to something like mongodb so it may be better for you if mongodb or similar does not work for you.
Other example users
M3, a large-scale metrics platform for Prometheus created by Uber, uses etcd for rule storage and other functions
Consistency
There is a nice comparison of NOSQL database consistency by Jepson at https://jepsen.io/analyses
ETCD sum up their result at https://etcd.io/blog/jepsen-343-results/
The only answer I've come to see are those between our ears. Guess we need to show first that it can be done, and what the benefits are.
My colleagues seem to shy off it because "it's for storing secrets, and common truth". The etcd v3 revise made etcd capable of much more, but the news hasn't simply rippled down, yet.
Let's make some show cases, success stories. Personally, I like etcd because of the reasons you mentioned, and because of its focus on dependable performance.
First, no. Etcd is not the next nosql replacement. But there are some sort of scenarios, where it can come in handy.
Let's imagine you have (configuration) data, that is mostly static but may change on runtime. Maybe your frontend needs to know the backend endpoints based on the customers country to comply with legal and you know the world wide rollout is done in phases.
So you could just use a k8s configMap to store the array of data (country -> endpoint) and let your backend watch this configMap for changes.
On change, the application just reads in the list and provides a repository to allow access to the data from your service layer.
All operations need to be implemented in the repository (search, get, update, ...) but your data will be in memory (probably a linked hash map). So it will be very quick to retrieve (like a local cache).
If data get changed by the application just serialize the list and patch the configMap. Any other application watching the configMap will update their internal state.
However there is no locking. So quick changes may result in race conditions.
etcd allows for 1Mb to be stored. That's enough for almost static data.
Another application might be feature toggles. They do not changed that much but when they do, every application needs to know quickly and polling sucks.
See if this checklist of limitations of etcd compared to a more full-featured database will work for you:
Your database size is going to be within 2 GB (extensible to max 8 GB)
No sharding and hence data scalability that NoSQL db clusters (Mongo, Redis,...) provide
Meant for simple value stores with payloads limited to 1.5 MB. Can be increased but impacts other queries. Most dbs can store large BLOBs. Redis can store a value of 512 MB.
No query language for more complex searches beyond key prefix. Other databases provide more complex data types like document, graph storage with querying and indexing. Even key-value db Redis supports more complex types through modules along with querying and search capabilities
No ACID transactions
Having a hammer, everything may look like a potential nail. You need to make sure it is indeed one.

Fast JSON/flat data server for mostly reads

I ask this question apprehensively because it is not a pure programming question, and because I am seeking a (well informed) suggestion.
I have an analytic front end, written in JavaScript, with lots of aggregations and charting happening in the browser (dimple.js, even stats.js, ...)
I want to feed this application with JSON or delimited data from some high performance data structure server. No writes except for loading. Data will be maybe 1-5 GB in size and there could be dozens, if not hundreds concurrent readers, but only in peak hours. This data is collected from and fed by Apache Hive.
Now my question is about the selection of a database/datastore server choices for this.
(I have pretty good command of SQL/NoSQL choices, so I am really seeking advice for the very specific requirements)
Requirements and specifications for this datastore are:
Mostly if not all queries will be reads, initiated by the web, JS-based front end.
Data can be served as JSON or flat tabular csv, psv, tsv.
Total data size on this store will be 1-5 GB, with possible future growth, but nothing imminent (6-12 months)
Data on this datastore will be refreshed/loaded into this store daily. Probably never in a real time.
Data will/can be accessed via some RESTful web services, Socket IO, etc.
Faster read access, the better. Speed matters.
There has to be a security/authentication method for sensitive data protection.
It needs to be reasonably stable, not a patching-requiring bleeding edge.
Liberal, open source license.
So far, my initial candidates for examination were Postgres (optimized for large cache) and Mongo. Just because I know them pretty well.
I am also familiar with Redis, Couch.
I did not do benchmark myself, but I have seen benchmarks where Postgres was faster than Mongo (while offering JSON format). Mongo is web-friendlier.
I am considering in-memory stores with persistence such as Redis, Aerospike, Memcached. Redis 3.0 is my favorite so far.
So, I ask you here if you have any recommendations for the production quality datastore that would fit well what I need.
Any civil and informed suggestions are welcome.
What exactly does your data look like? Since you said CSV like exports, I'm assuming this is tabular, structured data that would usually be found in a relational database?
Some options:
1. Don't use a database
Given the small dataset, just serve it out of memory. You can probably spend a few hours to write a quick app with any decent web framework that just loads up the data into memory (for example, from a flat file) and then searches and returns this data in whatever format and way you need.
2. Use an embedded database
You can also try an embedded database like SQLite which gives you in-memory performance but with a reliable SQL interface. Since it's just a single-file database, you can have another process generate a new DB file, then swap it out when you update the data for the app.
3. Use a full database system
Use a regular relational database. mySQL, PostgreSQL, SQL Server (Express Edition) are all free and can handle that dataset easily and will just cache it all in RAM. If it's read queries, I don't see any issues with a few hundred concurrent users. You can also use memSQL community edition if you need more performance. They all support security, are very reliable, and you can't beat SQL for data access.
Use a key/value system if your data isn't relational or tabular and is more of a fit as simple values or documents. However remember KV stores aren't great at scans or aggregations and don't have joins. Memcached is just a distributed cache, don't use it for real data. Redis and Aerospike are both great key/value systems with Redis giving you lots of nice data structures to use. Mongo is good for data flexibility. Elasticsearch is a good option for advanced search-like queries.
If you're going to these database systems though, you will still need a thin app layer somewhere to interface with the database and then return the data in the proper format for your frontend.
If you want to skip that part, then just use CouchDB or Riak instead. Both are document oriented and have a native HTTP interface with JSON responses so you can consume it directly from your frontend, although this might cause security issues since anyone can see the javascript calls.

Why would someone need an in-memory database?

I read that a few databases can be used in-memory but can't think of reason why someone would want to use this feature. I always use a database to persist data and memory caches for fast access.
Cache is also a kind of database, like a file system is. 'Memory cache' is just a specific application of an in-memory database and some in-memory databases are specialized as memory caches.
Other uses of in-memory databases have already been included in other answers, but let me enumerate the uses too:
Memory cache. Usually a database system specialized for that use (and probably known as 'a memory cache' rather than 'a database') will be used.
Testing database-related code. In this case often an 'in-memory' mode of some generic database system will be used, but also a dedicated 'in-memory' database may be used to replace other 'on-disk' database for faster testing.
Sophisticated data manipulation. In-memory SQL databases are often used this way. SQL is a great tool for data manipulation and sometimes there is no need to write the data on disk while computing the final result.
Storing of transient runtime state. There are application that need to store their state in some kind of database but do not need to persist that over application restart. Think of some kind of process manager – it needs to keep track of sub-processes running, but that data is only valid as long as the application and the sub-processes run.
A common use case is to run unit/integration tests.
You don't really care about persisting data between each test run and you want tests to run as quickly as possible (to encourage people to do them often). Hosting a database in process gives you very quick access to the data.
Does your memory cache have SQL support?
How about you consider the in-memory database as a really clever cache?
That does leave questions of how the in-memory database gets populated and how updated are managed and consistency is preserved across multiple instances.
Searching for something among 100000 elements is slow if you don't use tricks like indexes. Those tricks are already implemented in a database engine (be it persistent or in-memory).
A in-memory database might offer a more efficient search feature than what you might be able to implement yourself quickly over self-written structures.
In-memory databases are roughly at least an order of magnitude faster than traditional RDBMS for general purpose (read side) queries. Most are disk backed providing the very same consistency as a normal RDBMS - only catch the entire dataset must fit into RAM.
The core idea is disk backed storage has huge random access penalties which does not apply to DRAM. Data can be index/organized in a random access optimized way not feasible using traditional RDBMS data caching schemes.
Applications, which require real time responses would like to use an in memory database, perhaps application to control aircraft, plants where the response time is critical
An in memory database is also useful in game programming. You can store data in an in memory database which is much faster than permanent databases.
They are used as an advanced data structure to store, query and modify runtime data.
You may need a database if several different applications are going to access the dataset. A database has a consistent interface for accessing / modifying data, which your hash table (or whatever else you use) won't have.
If a single program is dealing with the data, then it's reasonable to just use a data structure in whatever language you are using though.
In-memory database is better than performing database caching.
Database caching works similar to in-memory databases when it comes to READ operations.
On the other hand, when it comes to WRITE operations, in-memory databases are faster when compared to database caches, where the data is persisted onto disk (which leads to IO overhead).
Also, with database caching you can end with cache misses but you will never end up with cache misses when using in-memory databases.
Given their speed and the declining price of RAM, it’s likely that in-memory databases will become the dominant technology in the future. There are already some that have developed sophisticated features like SQL queries, secondary indexes, and engines for processing datasets larger than RAM.

voldemort vs. couchdb

I am trying to decide whether to use voldemort or couchdb for an upcoming healthcare project. I want a storage system that has high availability , fault tolerance, and can scale for the massive amounts of data being thrown at it.
What is the pros/cons of each?
Thanks
Project Voldemort looks nice, but I haven't looked deeply into it so far.
In it current state CouchDB might not be the right thing for "massive amounts of data". Distributing data between nodes and routing queries accordingly is on the roadmap but not implemented so far. The biggest known production setups of CouchDB use "tables" ("databases" in couch-speak) of about 200G.
HA is not natively supported by CouchDB but can build easily: All CouchDB nodes are replicating the database nodes between each other in a multi-master setup. We put two Varnish proxies in front of the CouchDB machines and the Varnish boxes are made redundant with CARP. CouchDBs "build from the Web" design makes such things very easy.
The most pressing issue in our setup is the fact that there are still issues with the replication of large (multi MB) attachments to CouchDB documents.
I suggest you also check the traditional RDBMS route. There are huge issues with available talent outside the RDBMS approach and there are very capable offerings available from Oracle & Co.
Not knowing enough from your question, I would nevertheless say Project Voldemort or distributed hash tables (DHTs) like CouchDB in general are a solution to your problem of HA.
Those DHTs are very nice for high availability but harder to write code for than traditional relational databases (RDBMS) concerning consistency.
They are quite good to store document type information, which may fit nicely with your healthcare project but make development harder for data.
The biggest limitation of most stores is that they are not transactionally safe (See Scalaris for an transactionally safe store) and you need to ensure data consistency by yourself - most use read time consistency by merging conflicting data). RDBMS are much easier to use for consistency of data (ACID)
Joining data is much harder too. In RDBMs you can easily query data over several tables, you need to write code in CouchDB to aggregate data. For other stores Hadoop may be a good choice for aggregating information.
Read about BASE and the CAP theorem on consistency vs. availability.
See
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
http://queue.acm.org/detail.cfm?id=1394128
Is memcacheDB an option? I've heard that's how Digg handled HA issues.

Resources