How do in-memory databases provide durability? - database

More specifically, are there any databases that don't require secondary storage (e.g. HDD) to provide durability?
Note:This is a follow up of my earlier question.

If you want persistence of transations writing to persistent storage is only real option (you perhaps do not want to build many clusters with independent power supplies in independent data centers and still pray that they never fail simultaneously). On the other hand it depends on how valuable your data is. If it is dispensable then pure in-memory DB with sufficient replication may be appropriate. BTW even HDD may fail after you stored your data on it so here is no ideal solution. You may look at http://www.julianbrowne.com/article/viewer/brewers-cap-theorem to choose replication tradeoffs.
Prevayler http://prevayler.org/ is an example of in-memory system backed up with persistent storage (and the code is extremely simple BTW). Durability is provided via transaction logs that are persisted on appropriate device (e.g. HDD or SSD). Each transaction that modifies data is written into log and the log is used to restore DB state after power failure or database/system restart. Aside from Prevayler I have seen similar scheme used to persist message queues.
This is indeed similar to how "classic" RDBMS works except that logs are only data written to underlying storage. The logs can be used for replication also so you may send one copy of log to a live replica other one to HDD. Various combinations are possible of course.

All databases require non-volatile storage to ensure durability. The memory image does not provide a durable storage medium. Very shortly after you loose power your memory image becomes invalid. Likewise, as soon as the database process terminates, the operating system will release the memory containing the in-memory image. In either case, you loose your database contents.
Until any changes have been written to non-volatile memory, they are not truely durable. This may consist of either writing all the data changes to disk, or writing a journal of the change being done.
In space or size critical instances non-volatile memory such as flash could be substituted for a HDD. However, flash is reported to have issues with the number of write cycles that can be written.
Having reviewed your previous post, multi-server replication would work as long as you can keep that last server running. As soon as it goes down, you loose your queue. However, there are a number of alternatives to Oracle which could be considered.
PDAs often use battery backed up memory to store their databases. These databases are non-durable once the battery runs down. Backups are important.

In-memory means all the data is stored in memory for it to be accessed. When data is read, it can either be read from the disk or from memory. In case of in-memory databases, it's always retrieved from memory. However, if the server is turned off suddenly, the data will be lost. Hence, in-memory databases are said to lack support for the durability part of ACID. However, many databases implement different techniques to achieve durability. This techniques are listed below.
Snapshotting - Record the state of the database at a given moment in time. In case of Redis the data is persisted to the disk after every two seconds for durability.
Transaction Logging - Changes to the database are recorded in a journal file, which facilitates automatic recovery.
Use of NVRAM usually in the form of static RAM backed up by battery power. In this case data can be recovered after reboot from its last consistent state.

classic in memory database can't provide classic durability, but depending on what your requirements are you can:
use memcached (or similar) to storing in memory across enough nodes that it's unlikely that the data is lost
store your oracle database on a SAN based filesystem, you can give it enough RAM (say 3GB) that the whole database is in RAM, and so disk seek access never stores your application down. The SAN then takes care of delayed writeback of the cache contents to disk. This is a very expensive option, but it is common in places where high performance and high availability are needed and they can afford it.
if you can't afford a SAN, mount a ram disk and install your database on there, then use DB level replication (like logshipping) to provide failover.
Any reason why you don't want to use persistent storage?

Related

What are the major differences between snapshot, dump, mirror, backup, archive, and checkpoint?

I am new to the SDDC (software-defined data center), and I found these concepts over the Internet are at most vague.
Particularly, the last three concepts differ trivially, and to make things worse, people sometimes use them interchangeably. What are their major differences? I also read this post but the explanation seems still not enough to answer my question.
Archiving is the process of moving data that is no longer
actively used as a separate storage device for long-term retention.
Dumping is a major output of data that can help users either back up or duplicate a database.
Mirroring refers to the real-time operation of copying data, as an exact copy, from one location to a local or remote storage medium.
Snapshotting is the state of a system at a particular point in time.
Backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event.
Checkpoint is a test operation that verifies data retrieved from the database by comparing that data with the baseline copy stored in your project.

Approach to properly archiving and backing up data, preventing data loss and corruption

I'm looking for a proper way to archive and back up my data. This data consists of photos, video's, documents and more.
There are two main things I'm afraid might cause data loss or corruption, hard drive failure and bit rot.
I'm looking for a strategy that can ensure my data's safety.
I came up with the following. One hard drive which I will regularly use to store and display data. A second hard drive which will serve as an onsite backup of the first one. And a third hard drive which will serve as an offsite backup. I am however not sure if this is sufficient.
I would prefer to use regular drives, and not network attached storage, however if it's better suited I will adapt.
One of the things I read about that might help with bit rot is ZFS. ZFS does not prevent bit rot but can detect data corruption by using checksums. This would allow me to recover a corrupted file from a different drive and copy it to the corrupted one.
I need at least 2TB of storage but I'm considering 4TB to ensure potential future needs.
What would be the best way to safely store my data and prevent data loss and corruption?
For your local system plus local backup, I think a RAID configuration / ZFS makes sense because you’re just trying to handle single-disk failures / bit rot, and having a synchronous copy of the data at all times means you won’t lose the data written since your last backup was taken. With two disks ZFS can do a mirror and handles bit rot well, and if you have more disks you may consider using RAIDZ configurations since they use less storage overall to provide single-disk failure recovery. I would recommend using ZFS here over a general RAID solutions because it has a better user interface.
For your offsite backup, ZFS could make sense too. If you go that route, periodically use zfs send to copy a snapshot on the source system to the destination system. Debatably, you should use mirroring or RAIDZ on the backup system to protect against bit rot there too.
That said — there are a lot of products that will do the offsite backup for you automatically, and if you have an offsite backup, the only advantage of having an on-site backup is faster recovery if you lose your primary. Since we’re just talking about personal documents and photos, taking a little while to re-download them doesn’t seem super high stakes. If you use Dropbox / Google Drive / etc. instead, this will all be automatic and have a nice UI and support people to yell at if anything goes wrong. Also, storage at those companies will have much higher failure tolerances because they use huge numbers of disks (allowing stuff like RAIDZ with tens of parity disks and replicated across multiple geographic locations) and they have security experts to make sure that all your data is not stolen by hackers.
The only downsides are cost, and not being as intimately involved in building the system, if that part is fun for you like it is for me :).

In memory databases with LMDB

I have a project which uses BerkelyDB as a key value store for up to hundreds of millions of small records.
The way it's used is all the values are inserted into the database, and then they are iterated over using both sequential and random access, all from a single thread.
With BerkeleyDB, I can create in-memory databases that are "never intended to be preserved on disk". If the database is small enough to fit in the BerkeleyDB cache, it will never be written to disk. If it is bigger than the cache, then a temporary file will be created to hold the overflow. This option can speed things up significantly, as it prevents my application from writing gigabytes of dead data to disk when closing the database.
I have found that the BerkeleyDB write performance is too poor, even on an SSD, so I would like to switch to LMDB. However, based on the documentation, it doesn't seem like there is an option creating a non-persistent database.
What configuration/combination of options should I use to get the best performance out of LMDB if I don't care about persistence or concurrent access at all? i.e. to make it act like an "in-memory database" with temporary backing disk storage?
Just use MDB_NOSYNC and never call mdb_env_sync() yourself. You could also use MDB_WRITEMAP in addition. The OS will still eventually flush dirty pages to disk; you can play with /proc/sys/vm/dirty_ratio etc. to control that behavior.
From this post: https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk.
If the dirty ratio is too small, then you will see frequent synchronous disk writes.

Are there databases that bases durability on redundancy and not on persistent storage?

Sorry that the title isn't exactly obvious, but I couldn't word it better.
We are right now using a conventional DB (oracle) as our job queue, and these "jobs" are consumed by some number of nodes (machines). So the DB server gets hit by these nodes, and we have to pay a lot for the software and hardware for this database server.
Now, it occurred to me the other day that,
1) There are already multiple nodes in the system
2) "Jobs" may not be lost because of node failures, but there is no reason they have to be sitting in a secondary storage (no reason why they couldn't reside in memory, as long as they are not lost)
Given this, couldn't one retain these jobs in-memory, making sure that at least n number of copies of this job is present in the entire cluster, thereby getting rid of the DB server?
Are such technologies available?
Did you take a look at Gigaspaces? On an internet scale, you do not need to persist at all. You just have to know sufficient copies are around. If you have low latency connections to places that are not on the same powergrid (or have battery power), pushing out your transactions to the duplicates is enough.
If you're only looking at storing up to a few terabytes of data, and you're looking for redundancy vs. disk recoverability, then take a look at Oracle Coherence. For example:
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
It depends on how much you expect these technologies to do for you. There are loads of basic in-memory databases (SQLite, Redis, etc) and you can use normal database replication techniques with multiple slaves in multiple data centers to pretty much ensure durability without persistence.
If you're storing in memory you're likely going to run out of space and require horizontal partitioning (sharding) and may want to check out something like VoltDB if you want to stick with SQL.

What is happening to such distributed in-memory cloud databases as Hazelcast and Scalris if there is more Data to store than RAM in the cluster?

What is happening to such distributed in-memory cloud databases as
Hazelcast
Scalaris
if there is more Data to store than RAM in the cluster?
Are they going to Swap? What if the Swap space is full?
I can't see a disaster recovery strategy at both databases! Maybe all data is lost if the memory is full?
Is there a availability to write things down to the hard-disk for memory issues?
Are there other databases out there, which offer the same functionality as Hazelcast or Scalaris with backup features / hdd-storage / disaster recovery?
I don't know what the state of affairs was when the accepted answer by Martin K. was published, but Scalaris FAQ now claims that this is supported.
Can I store more data in Scalaris than ram+swapspace is available in the cluster?
Yes. We have several database
backends, e.g. src/db_ets.erl (ets)
and src/db_tcerl (tokyocabinet). The
former uses the main memory for
storing data, while the latter uses
tokyocabinet for storing data on disk.
With tokycoabinet, only your local
disks should limit the total size of
your database. Note however, that this
still does not provide persistency.
For instructions on switching the
database backend to tokyocabinet see
Tokyocabinet.
Regarding to the teams of Hazelcast and Scalaris, they say both, that writing more Data than RAM is available isn't supported.
The Hazlecast team is going to write a flatfile store in the near future.

Resources