Currently we are evaluating several key+value data stores, to replace an older isam currently in use by owr main application (for 20 something years!) ...
The problem is that our current isam doesn't support crash recoveries.
So LevelDB seemd Ok to us (also checking BerkleyDB, etc)
But we ran into de question of hot-backups, and, given the fact that LevelDB is a library, and not a server, it is odd to ask for 'hot backup', as it would intuitively imply an external backup process.
Perhaps someone would like to propose options (or known solutions) ?
For example:
- Hot backup through an inner thread of the main applicacion ?
- Hot backup by merely copying the LevelDB data directory ?
Thanks in advance
You can do a snapshot iteration through a LevelDB, which is probably the best way to make a hot copy (don't forget to close the iterator).
To backup a LevelDB via the filesystem I have previously used a script that creates hard links to all the .sst files (which are immutable once written), and normal copies of the log (and MANIFEST, CURRENT etc) files, into a backup directory on the same partition. This is fast since the log files are small compared to the .sst files.
The DB must be closed (by the application) while the backup runs, but the time taken will obviously be much less than the time taken to copy the entire DB to a different partition, or upload to S3 etc. This can be done once the DB is re-opened by the application.
LMDB is an embedded key value store, but unlike LevelDB it supports multi-process concurrency so you can use an external backup process. The mdb_copy utility will make an atomic hot backup of a database, your app doesn't need to stop or do anything special while the backup runs. http://symas.com/mdb/
I am coming a bit late to this question, but there are forks of LevelDB that offer good live backup capability, such as HyperLevelDB and RocksDB. Both of these are available as npm modules, i.e. level-hyper and level-rocksdb. For more discussion, see How to backup RocksDB? and HyperDex Question.
Related
I am new to the SDDC (software-defined data center), and I found these concepts over the Internet are at most vague.
Particularly, the last three concepts differ trivially, and to make things worse, people sometimes use them interchangeably. What are their major differences? I also read this post but the explanation seems still not enough to answer my question.
Archiving is the process of moving data that is no longer
actively used as a separate storage device for long-term retention.
Dumping is a major output of data that can help users either back up or duplicate a database.
Mirroring refers to the real-time operation of copying data, as an exact copy, from one location to a local or remote storage medium.
Snapshotting is the state of a system at a particular point in time.
Backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event.
Checkpoint is a test operation that verifies data retrieved from the database by comparing that data with the baseline copy stored in your project.
I'm learning how to create a fast postgresql cluster to a web app in my job. I already know it's possible to create a tablespace on a virtual disk 1 2 3, mounted with ramfs or tmpfs, so my idea is:
One or more masters are only used for writes. They're persistent on (physical) disk
All slaves are mounted on RAM. If it fails because of, for example, an OS crash, no problem, because they're only used for reads.
Considering regular cache (memcached, redis, that kind of cache) isn't enough for our demand, because we need really fast reads with all features PSQL provides, how can I make this architecture reliable? Is there any better idea?
My current idea is create a master-cluster managed by heartbeat, to easy error recovering, and create a script that mounts the disk on ram, downloads the most recent dump and creates the database on it.
you haven't really said how you are replicating the data, and there are so many replication solutions out there....
In general, my view is that with streaming replication you really want your slaves to be identical to the masters in as many was as possible. Failing back is not a simple process and it requires restoring, effectively, the primary with a backup made from the slave. For this reason it is good to plan on having an ability to be without your preferred master for a while or even be able to fail back and forth with neither node being preferred in that role.
Your best bet is to have these to be identical and to scale reads by adding more slaves.
We want to add a load of records to an AS400 database file using COBOL several times a day. This file is also continually being updated and added to by 30 users through an online cobol screen, (albeit different records). My initial thought on this is one of horror, but is the File Sharing on the AS400 robust enough to handle this kind of multi threading, or does one task lock the file and release it when it has finished.
I'm an RPG programmer. I routinely have several hundred jobs adding, changing and deleting records from the same table all day long, for decades.
IBM i file sharing works very well - so well I never even think about it. There are a few tasks which require exclusive access to a file - backup & restore, for instance - but the sort of I/O that application programs perform works quite well with the typical 'shared update' access.
The COBOL application might lock records it has read until they are updated or released, or if processing under commitment control, until a commit or rollback.
lustre, or google file system(GFS) split a file into some kinds of block, and save them to various nodes. So they can acheive scalability, distributed traffic.
ZFS, btrfs, wafl support constant time cloning. By this, they can achieve cloning speed, writable snapshot, saving storage size.
I have been founding any file system which support above two feature.
Though there are a lot file system which support constant time cloning. but I can't find any distributed file system which can support constant time cloning. Lustre team look like developing lustre supporting zfs(and also support cloning). but it revealed yet(moreover it doesn't include 2.0 beta, maybe it will not be revealed in short time).
Nexenta storage seemed like supporting these feature by "namespace nfs". but it wasn't. it just distribute file by file-level distribution. It means, if some file exceed size of volume of one node, it will not able to handle it. If a lot of cloned files grow to big file, they can't handle that(at least, they have to really copy(not shadowing nodes) original file to other node. maybe i can attach SAN disks to zvolume of a ZFS node. but I'm very worry about concentrated traffic of ZFS node.
so I'm looking for a file system or a solution which can handle above two issue.
One working solution is to combine the Lustre filesystem with Robinhood Policy Engine in backup mode to constantly backup your filesystem files. This mode makes it possible to backup a Lustre v2.x filesystem to an external storage. It tracks modifications in the filesystem thanks to Lustre 2+ changelogs feature (FS events), and copy modified files to the backend storage, according to admin-defined migration policies. You can configure your own upcall commands in Robinhood, for example to provide a scalable way to clone your filesystem and schedule sync tasks on several nodes.
With Lustre on ZFS, it should be possible to use ZFS snapshot feature, but even the ZFS stack is not yet ready for production (currently tested on top 1 supercomputer Sequoia at LLNL).
I have an application that uses a SQL FILESTREAM to store images. I insert a LOT of images (several millions images per days).
After a while, the machine stops responding and seem to be out of memory... Looking at the memory usage of the PC, we don't see any process taking a lot of memory (neither SQL or our application). We tried to kill our process and it didn't restore our machine... We then kill the SQL services and it didn't not restore to system. As a last resort, we even killed all processes (except the system ones) and the memory still remained high (we are looking in the task manager's performance tab). Only a reboot does the job at that point. We have tried on Win7, WinXP, Win2K3 server with always the same results.
Unfortunately, this isn't a one-shot deal, it happens every time.
Has anybody seen that kind of behaviour before? Are we doing something wrong using the SQL FILESTREAMS?
You say you insert a lot of images per day. What else do you do with the images? Do you update them, many reads?
Is your file system optimized for FILESTREAMs?
How do you read out the images?
If you do a lot of updates, remember that SQL Server will not modify the filestream object but create a new one and mark the old for deletion by the garbage collector. At some time the GC will trigger and start cleaning up the old mess. The problem with FILESTREAM is that it doesn't log a lot to the transaction log and thus the GC can be seriously delayed. If this is the problem it might be solved by forcing GC more often to maintain responsiveness. This can be done using the CHECKPOINT statement.
UPDATE: You shouldn't use FILESTREAM for small files (less than 1 MB). Millions of small files will cause problems for the filesystem and the Master File Table. Use varbinary in stead. See also Designing and implementing FILESTREAM storage
UPDATE 2: If you still insist on using the FILESTREAM for storage (you shouldn't for large amounts of small files), you must at least configure the file system accordingly.
Optimize the file system for large amount of small files (use these as tips and make sure you understand what they do before you apply)
Change the Master File Table
reservation to maximum in registry (FSUTIL.exe behavior set mftzone 4)
Disable 8.3 file names (fsutil.exe behavior set disable8dot3 1)
Disable last access update(fsutil.exe behavior set disablelastaccess 1)
Reboot and create a new partition
Format the storage volumes using a
block size that will fit most of the
files (2k or 4k depending on you
image files).