MyRocks restoration vs HBase restoration - database

We know that it is possible that HBase can serve queries while a restoration is being performed. But is it possible to say the same about MyRocks?

Related

Should I still truncate a postgresql database if I am truncating and refilling the tables instead of deleting/inserting/upserting?

The production database at my company is running significantly slower than the test database (local ~5ms, test ~18ms, production ~1-2 sec). We've been trying to look into why and will be doing some EXPLAIN ANALYZE on key queries on a secure shell psql instance in our cloud.
I've been trying to read up on database optimization and have come across postgresql's VACUUM and am wondering if running this might help. We don't update the production database often -- once each release, though the migrations involve dropping or truncating tables as necessary and then importing new data. I'm curious if VACUUM would be potentially helpful here? If it would be, would we be seeing similar slowdowns in a spiped instance of our test database?
VACUUM could help if there are a lots of DELETE and UPDATE in the database.
See https://dba.stackexchange.com/questions/36984/how-to-determine-if-a-postgres-database-needs-to-be-vaccumed

Difference between Stream Replication and logical replication

Could anybody tell me more about difference between physical replication and logical replication in PostgreSQL?
TL;DR: Logical replication sends row-by-row changes, physical replication sends disk block changes. Logical replication is better for some tasks, physical replication for others.
Note that in PostgreSQL 12 (current at time of update) logical replication is stable and reliable, but quite limited. Use physical replication if you are asking this question.
Streaming replication can be logical replication. It's all a bit complicated.
WAL-shipping vs streaming
There are two main ways to send data from master to replica in PostgreSQL:
WAL-shipping or continuous archiving, where individual write-ahead-log files are copied from pg_xlog by the archive_command running on the master to some other location. A restore_command configured in the replica's recovery.conf runs on the replica to fetch the archives so the replica can replay the WAL.
This is what's used for point-in-time replication (PITR), which is used as a method of continuous backup.
No direct network connection is required to the master server. Replication can have long delays, especially without an archive_timeout set. WAL shipping cannot be used for synchronous replication.
streaming replication, where each change is sent to one or more replica servers directly over a TCP/IP connection as it happens. The replicas must have a direct network connection the master configured in their recovery.conf's primary_conninfo option.
Streaming replication has little or no delay so long as the replica is fast enough to keep up. It can be used for synchronous replication. You cannot use streaming replication for PITR1 so it's not much use for continuous backup. If you drop a table on the master, oops, it's dropped on the replicas too.
Thus, the two methods have different purposes. However, both of them transport physical WAL archives from primary to replica; they differ only in the timing, and whether the WAL segments get archived somewhere else along the way.
You can and usually should combine the two methods, using streaming replication usually, but with archive_command enabled. Then on the replica, set a restore_command to allow the replica to fall back to restore from WAL archives if there are direct connectivity issues between primary and replica.
Asynchronous vs synchronous streaming
On top of that, there's synchronous and asynchronous streaming replication:
In asynchronous streaming replication the replica(s) are allowed to fall behind the master in time when the master is faster/busier. If the master crashes you might lose data that wasn't replicated yet.
If the asynchronous replica falls too far behind the master, the master might throw away information the replica needs if max_wal_size (was previously called wal_keep_segments) is too low and no slot is used, meaning you have to re-create the replica from scratch. Or the master's pg_wal(waspg_xlog) might fill up and stop the master from working until disk space is freed if max_wal_size is too high or a slot is used.
In synchronous replication the master doesn't finish committing until a replica has confirmed it received the transaction2. You never lose data if the master crashes and you have to fail over to a replica. The master will never throw away data the replica needs or fill up its xlog and run out of disk space because of replica delays. In exchange it can cause the master to slow down or even stop working if replicas have problems, and it always has some performance impact on the master due to network latency.
When there are multiple replicas, only one is synchronous at a time. See synchronous_standby_names.
You can't have synchronous log shipping.
You can actually combine log shipping and asynchronous replication to protect against having to recreate a replica if it falls too far behind, without risking affecting the master. This is an ideal configuration for many deployments, combined with monitoring how far the replica is behind the master to ensure it's within acceptable disaster recovery limits.
Logical vs physical
On top of that we have logical vs physical streaming replication, as introduced in PostgreSQL 9.4:
In physical streaming replication changes are sent at nearly disk block level, like "at offset 14 of disk page 18 of relation 12311, wrote tuple with hex value 0x2342beef1222....".
Physical replication sends everything: the contents of every database in the PostgreSQL install, all tables in every database. It sends index entries, it sends the whole new table data when you VACUUM FULL, it sends data for transactions that rolled back, etc. So it generates a lot of "noise" and sends a lot of excess data. It also requires the replica to be completely identical, so you cannot do anything that'd require a transaction, like creating temp or unlogged tables. Querying the replica delays replication, so long queries on the replica need to be cancelled.
In exchange, it's simple and efficient to apply the changes on the replica, and the replica is reliably exactly the same as the master. DDL is replicated transparently, just like everything else, so it requires no special handling. It can also stream big transactions as they happen, so there is little delay between commit on the master and commit on the replica even for big changes.
Physical replication is mature, well tested, and widely adopted.
logical streaming replication, new in 9.4, sends changes at a higher level, and much more selectively.
It replicates only one database at a time. It sends only row changes and only for committed transactions, and it doesn't have to send vacuum data, index changes, etc. It can selectively send data only for some tables within a database. This makes logical replication much more bandwidth-efficient.
Operating at a higher level also means that you can do transactions on the replica databases. You can create temporary and unlogged tables. Even normal tables, if you want. You can use foreign data wrappers, views, create functions, whatever you like. There's no need to cancel queries if they run too long either.
Logical replication can also be used to build multi-master replication in PostgreSQL, which is not possible using physical replication.
In exchange, though, it can't (currently) stream big transactions as they happen. It has to wait until they commit. So there can be a long delay between a big transaction committing on the master and being applied to the replica.
It replays transactions strictly in commit order, so small fast transactions can get stuck behind a big transaction and be delayed quite a while.
DDL isn't handled automatically. You have to keep the table definitions in sync between master and replica yourself, or the application using logical replication has to have its own facilities to do this. It can be complicated to get this right.
The apply process its self is more complicated than "write some bytes where I'm told to" as well. It also takes more resources on the replica than physical replication does.
Current logical replication implementations are not mature or widely adopted, or particularly easy to use.
Too many options, tell me what to do
Phew. Complicated, huh? And I haven't even got into the details of delayed replication, slots, max_wal_size, timelines, how promotion works, Postgres-XL, BDR and multimaster, etc.
So what should you do?
There's no single right answer. Otherwise PostgreSQL would only support that one way. But there are a few common use cases:
For backup and disaster recovery use pgbarman to make base backups and retain WAL for you, providing easy to manage continuous backup. You should still take periodic pg_dump backups as extra insurance.
For high availability with zero data loss risk use streaming synchronous replication.
For high availability with low data loss risk and better performance you should use asynchronous streaming replication. Either have WAL archiving enabled for fallback or use a replication slot. Monitor how far the replica is behind the master using external tools like Icinga.
References
continuous archiving and PITR
high availability, load balancing and replication
replication settings
recovery.conf
pgbarman
repmgr
wiki: replication, clustering and connection pooling

Backing up PostgreSQL

I'm a new to PostgreSQL and I'm looking to backup the database. I understand that there are 3 methods pg_dump, snapshot and copy and using WAL. Which one do you suggest for full backup of the database? If possible, provide code snippets.
It depends a lot more on your operational requirements than anything else.
All three will require shelling out to an external program. libpq doesn't provide those facilities directly; you'll need to invoke the pg_basebackup or pg_dump via execv or similar.
All three have different advantages.
Atomic snapshot based backups are useful if the filesystem supports them, but become useless if you're using tablespaces since you then need a multivolume atomic snapshot - something most systems don't support. They can also be a pain to set up.
pg_dump is simple and produces compact backups, but requires more server resources to run and doesn't support any kind of point-in-time recovery or incremental backup.
pg_basebackup + WAL archiving and PITR is very useful, and has a fairly low resource cost on the server, but is more complex to set up and manage. Proper backup testing is imperative.
I would strongly recommend allowing the user to control the backup method(s) used. Start with pg_dump since you can just invoke it as a simple command line and manage a single file. Use the -Fc mode and pg_restore to restore it where needed. Then explore things like configuring the server for WAL archiving and PITR once you've got the basics going.

Sql Server distribution and configuration for best performance

I want design and implement an enterprise software with silverlight.I use sql server database for this.many useres run sql queireis on sql server database.
how can i configure sql server database for best performance?
how can i distribute sql server database for best performance?
how can i distribute sql server database between some servers for best performance?
and so what technologies can i use in sql server for best performance?
In addition to replication you can use mirroring or log shipping for this. Note that I am talking only about scaling out reads, not write. So reports etc. can be run from the copies of the database but writes must go to the main copy (unless you are using merge replication, which is frightening to me). There are some caveats of course.
With database mirroring, you can use the secondary as a read-only reporting source by taking a snapshot. There are limits here to how many databases you can mirror and there is of course maintenance to manage the snapshots. It is not quite true distribution of resources here, but it can be helpful to offload some of the load. In the next version of SQL Server (Denali), you will be able to set secondaries as read-only, so you can avoid the maintenance of snapshots.
With log shipping, you can essentially keep a stale version of the database around for reporting, and replace it periodically by restoring logs to it. You have a lot more flexibility here compared to replication or mirroring, as you can actually define a delay (like every 6 hours or once a day, you refresh the copy) - which can also serve as a "recover from a shoot-yourself-in-the-foot" scenario. The downside is that to restore a new copy of the database you need to kick all the current users out, as the database needs to be in single user mode in order to recover.
Those are just a couple of ideas for helping scale out reads, but deep down I agree with #gbn - are you solving a problem you don't have yet? It's one thing to design for scalability, but it's very easy to step over that line and completely over-engineer.
Well, SQL Server doesn't really have a load balancing mchanism in and off itself. What it does support, however, is an active/passive node configuration and also replication.
We are using the replication strategy in one application I support. You can read more about it here:
http://msdn.microsoft.com/en-us/library/ms151198.aspx
In our configuration, we basically have a transactional database and a reporting database. We replicate the data from our transactional DB to the reporting DB. Any reporting is done against this reporting DB, so that we don't slow down work being done on the transactional DB due to some long running report.
Note that the replication isn't truly real time. In other words, there's some time involved in replicating the data from the transactional to the reporting DB, albeit a very small time amount. But replication is certainly one strategy you could consider if you are trying to balance workload.
Other things you might consider are partitioning large tables for better performance.
As gbn pointed out in his comment though, it's better to determine if you actually need these strategies before implementing them, because they add a lot of complexity and maintenance efforts, which may not even be needed. It's important to properly analyze how much data you think you will have, and how much activity will be occurring against that data to determine if strategies such as the ones I just described are even needed.
Also, you can refer to this link for some other helpful information and some links to whitepapers you may find helpful:
http://social.msdn.microsoft.com/Forums/en/sqldisasterrecovery/thread/05cf41b7-c558-44bf-86c6-12f5c2b2ffe2

Has open source ever created a single file database that auto handles transactions?

Has open source ever created a single file database that has better performance when handling large sets of sql queries that aren't delivered in formal SQL transaction sets? I work with a .NET server that does some heavy replication of thousands of rows of data from another server and it does so it a 1-by-1 fashion without formal SQL transactions. So, therefore I cannot use SQLite or FirebirdDB or JavaDB because they all don't automatically batch the transactions and therefore the performance is dismal. Each insert waits for the success of the previous one, etc. So, I am forced to use a heavier database like SQLServer, MySQL, Postgres, or Oracle.
Does anyone know of a flat file database (that has a JDBC connect driver) that would support auto batching transactions and solve my problem?
The main think I dont like about the heavier databases is the lack of the ability to see inside the database with a one-mouse-click operation, like you can with SQLLite.
I tried creating a SQLite database and
then set PRAGMA read_uncommitted=TRUE;
and it didn't result in any
performance improvement.
I think that Firebird can work for this.
Firebird have good dotnet provider and many solution for replication
May be you can read this article for Firebird transaction
Try hypersonic DB - http://hsqldb.org/doc/guide/ch02.html#N104FC
If you want your transactions to be durable (i.e. survive a power failure) then the database will HAVE to write to the disc after each transaction (this is usually a log of some sort).
If your transactions are very small this will result in a huge number of writes, and very poor performance even on your battery backed raid controller or SSD, but worse performance on consumer-grade hardware.
The only way of avoiding this is to somehow disable the flush at txn commit (which of course breaks durability). I have no idea which ones support this, but it should be easy to find out.

Resources