Which built-in Postgres replication fits my Django-based use-case best? - database

I've noticed that Postgres now has built-in replication, including synchronous replication, streaming replication and some other variants. it even provides the ability to control synchrony for specific operations at the application-leve (e.g., use synchronous for important stuff like money transfers, but maybe don't for less critical things like user comments, etc.)
I'm working on a software using Django 1.5 (i.e, dev) and will possibly need synchronous replication (will have commerce related transactions going on).
Do you think that the built-in tools are best for the job in most cases, and do you have any thoughts on one variant of the built-in replication vs another, ease of use related, quality, etc.?
One final thing; Slony and PGPool II seem to be pretty popular (Slony, in particular) for replication. Is there A) a particular, technical reason for their popularity over built-in replication or B) is it just because a lot of people are using versions that don't have built-in replication, or C) have I been under a rock and PG built-in replication is already quite popular?
Update (more details)
I have only 2 physical servers, and they're located in the same rack. My intention is to provide a slave which can automatically turn into the master, should something go catastophically wrong in one machine (or even something simple like double power supply failure, etc.). I don't mind if my clients experience downtime during an automatic failover, so long as the downtime is a few minutes or so, not an hour or something.
I would like for zero data loss, and am willing to sacrifice more time in the failover process for that. Is there a way to make this trade off without going for synchronous replication (e.g, streaming logs without write back confirmation or something)?
What strategy or variant of replication would you recommend?

I think you misunderstand the benefits and cost of synchronous commits on replication. In PostgreSQL, replication works by recovering slaves up to the master, using the standard crash recovery features of PostgreSQL. In the event that, for example, the power goes out, you can be sure that the write-ahead log segments will be run on both master and slave. With asynchronous commit, the commit is written to the WAL, the application is notified and the slave is notified more or less all at the same time depending on network characteristics, etc.
Where synchronous commit comes in handy is where two things are true:
You have more than one slave (this is critical!) and
You need durability guarantees that asynchronous commits can't offer you.
With synchronous commit, the database waits until it hears back from a configurable number of slaves to tell the application that the commit has happened. This offers durability guarantees in a few cases where asynchronous commits are unable to work.
For example, suppose your master server takes a bullet through a raid array and immediately crashes (sorry, I couldn't think of any better examples with good hardware). Or suppose someone trips on a power cord and not only powers off the server but corrupts the software RAID device. In this case it is possible that a couple of transactions may not have been replicated and your WAL is unrecoverable, so those transactions are lost. With synchronized commit, the application would have waited until durability guarantees were met.
One thing this means is that if you do synchronous commit with only one slave your availability cannot outlast a crash on either master or slave, so your availability will be worse than it would have been with just one server. It also means that if your slave is geographically removed, that you introduce significant latency in your application's request to commit transactions.
Switching from async to sync commit it not a big change, but generally, I think that you get the most out of sync commit when you have already done as much as you can assurance and availability-wise on your hardware already. Start with async and move up when you can't further optimize your setup as async.

Re: "Slony and PGPool II seem to be pretty popular (Slony, in particular) for replication. Is there A) a particular, technical reason for their popularity over built-in replication or B) is it just because a lot of people are using versions that don't have built-in replication, or C) have I been under a rock and PG built-in replication is already quite popular?"
Slony is popular because it has been around for quite a long time, and the built-in PostgreSQL replication is relatively new. Cascading replication built in to PostgreSQL is even newer, and is something that Slony-I was built with.
There are two main advantages to Slony-I, first, you can replicate between differing versions of PostgreSQL, whereas the built-in replication system not only must use the same version, but the two servers must also be binary compatible. The other advantage is that you can replicate only certain tables on Slony-I instead of the whole database cluster. The disadvantages of Slony-I are numerous, and include poor user-friendliness, no synchronous commits, and difficult DDL (schema) changes. I believe that use of the built-in replication in Postgres will quickly exceed the Slony-I user base if it hasn't already done so.
As far as I remember, PGPool II uses statement-based replication (like what MySQL has had built-in), and is definitely the least desirable, in my opinion.
I would use the built-in hot standby/streaming replication in PostgreSQL. You can start with synchronous commit turned on and turn it off if you don't need it or the penalty is too high, or vice versa. Over a LAN, asynchronous mode seems to reach the slave in the order of a hundred milliseconds or so (from my own experience).

Related

How does multi table schema create data consistency issues?

As per this answer, it is recommended to go for single table in Cassandra.
Cassandra 3.0
We are planning for below schema:
Second table has composite key. PK(domain_id, item_id). So, domain_id is partition key & item_id will be clustering key.
GET request handler will access(read) two tables
POST request handler will access(write) into two tables
PUT request handler will access(write) details table(only)
As per CAP theorem,
What are the consistency issues in having multi-table schema? in Cassandra...
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
recommended to go for single table in Cassandra.
I would recommend the opposite. If you have to support multiple queries for the same data in Apache Cassandra, you should have one table for each query.
What are the consistency issues in having multi-table schema? in Cassandra...
Consistency issues between query tables can happen when writes are applied to one table but not the other(s). In that case, the application should have a way to gracefully handle it. If it becomes problematic, perhaps running a nightly job to keep them in-sync might be necessary.
You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process. In that case, a given data point may have only a subset of its intended replicas.
This scenario can be countered by running regularly-scheduled repairs. Additionally, consistency can be increased on a per-query basis (QUORUM vs. ONE, etc), and consistency levels of QUORUM and higher will occasionally trigger a read-repair (which syncs all replicas in the current operation).
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
So Apache Cassandra was engineered to be highly-available (HA), thereby embracing the paradigm of eventual consistency. Some might interpret that to mean Cassandra is inconsistent by design, and they would not be incorrect. I can say after several years of supporting hundreds of clusters at web/retail scale, that consistency issues (while they do happen) are rare, and are usually caused by failures to components outside of a Cassandra cluster.
Ultimately though, it comes down to the business requirements of the application. For some applications like product reviews or recommendations, a little inconsistency shouldn't be a problem. On the other hand, things like location-based pricing may need a higher level of query consistency. And if 100% consistency is indeed a hard requirement, I would question whether or not Cassandra is the proper choice for data storage.
Edit
I did not get this: "Consistency issues between query tables can happen when writes are applied to one table but not the other(s)." When writes are applied to one table but not the other(s), what happens?
So let's say that a new domain is added. Perhaps a scenario arises where the domain_details_table gets updated, but the id_table does not. Nothing wrong here on the database side. Except that when the application expects to find that domain_id in the id_table, but cannot.
In that case, maybe the application can retry using a secondary index on domain_details_table.domain_id. It won't be fast, but the decision to be made is more around which scenario is more preferable; no answer, or a slow answer? Again, application requirements come into play here.
For your point: "You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process." How does RDBMS(like MySQL) deal with this?
So the answer to this used to be simple. RDBMSs only run on a single server, so there's only one replica to keep in-sync. But today, most RDBMSs have HA solutions which can be used, and thus have to be kept in-sync. In that case (from what I understand), most of them will asynchronously update the secondary replica(s), while restricting traffic only to the primary.
It's also good to remember that RDBMSs enforce consistency through locking strategies, as well. So even a single-instance RDBMS will lock a data point during an update, blocking any reads until the lock is released.
In a node-down scenario, a single-instance RDBMS will be completely offline, so instead of inconsistent data you'd have data loss instead. In a HA RDBMS scenario, there would be a short pause (during which you would likely encounter connection/query failures) until it has failed-over to the new primary. Once the replica comes up, there would probably be additional time necessary to sync-up the replicas, until HA can be restored.

Using Sqlite in a server application? Alternatives?

Consider the following scenario:
A database is to be used on a local network, to which a handful of clients connect at one time. There is one table. When one of the clients modifies the data in that table, the other connected clients should be notified. Perhaps there could be one thread in the server application per connection.
One constraint is that should not require the user to install and configure any other third-party database software. This is why Sqlite came to mind, since the application itself could just interact with the .db file, which can be bundled with it.
Is this something that is achievable with Sqlite, or is this idea totally wrong and misguided?
Simple diagram illustrating this description.
Summary of the exchange of comments:
the answer is yes, you can use SQLite for this architecture.
You may also want to enable WAL (Write-Ahead Logging), in your case for increased concurrency:
...
There are advantages and disadvantages to using WAL instead of a rollback journal. Advantages include:
WAL is significantly faster in most scenarios.
WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.
Disk I/O operations tends to be more sequential using WAL.
WAL uses many fewer fsync() operations and is thus less vulnerable to problems on systems where the fsync() system call is broken.
But there are also disadvantages:
...
For the notification part of the question, I suggest to investigate triggers:
...
Triggers are database operations that are automatically performed when a specified database event occurs.
Each trigger must specify that it will fire for one of the following operations: DELETE, INSERT, UPDATE.
...

Distributed transactions - why do we save tranlogs to file system?

All transaction managers (Atomikos, Bitronix, IBM WebSphere TM etc) save some "transaction logs" into 'tranlogs' folder to file system.
When something terrible happens and server gets down sometimes tranlogs become broken.
They require some manual recovery procedure.
I've been told that by simply clearing broken tranlogs folder I risk to have an inconsistent state of resources that participated in transactions.
As a "dumb" developer I feel more comfortable with simple concepts. I want to think that distributed transaction management should be alike the regular transaction management:
If something went wrong at any party (network, app error, timeout) - I expect the whole multi-resource transaction not to be committed in any part of it. All leftovers should be cleaned up sooner or later automatically.
If transaction managers fails (file system fault, power supply fault) - I expect all the transactions under this TM to be rollbacked (apparently, at DB timeout level).
File storage for tranlogs is optional if I don't want to have any automatic TX recovery (whatever it would mean).
Questions
Why can't I think like this? What's so complicated about 2PC?
What are the exact risks when I clear broken tranlogs?
If I am wrong and I really need all the mess with 2PC file system state. Don't you feel sick about the fact that TX manager can actually break storage state in an easy and ugly manner?
When I was first confronted with 2 phase commit in real life in 1994 (initially on a larger Oracle7 environment), I had a similar initial reaction. What a bloody shame that it is not generally possible to make it simple. But looking back at algorithm books of university, it become clear that there is no general solution for 2PC.
See for instance how to come to consensus in a distributed environment
Of course, there are many specific cases where a 2PC commit of a transaction can be resolved more easy to either complete or roll back completely and with less impact. But the general problem stays and can not be solved.
In this case, a transaction manager has to decide at some time what to do; a transaction can not remain open forever. Therefor, as an ultimate solution they will always need to have go back to their own transaction logs, since one or more of the other parties may not be able to reliably communicate status now and in the near future. Some transaction managers might be more advanced and know how to resolve some cases more easily, but the need for an ultimate fallback stays.
I am sorry for you. Fixing it generally seems to be identical to "Falsity implies anything" in binary logic.
Summarizing
On Why can't I think like this? and What's so complicated about 2PC: See above. This algorithmetic problem can't be solved universally.
On What are the exact risks when I clear broken tranlogs?: the transaction manager has some database backing it. Deleting translogs is the same problem in general relational database software; you loose information on the transactions in process. Some db platforms can still have somewhat or largely integer files. For background and some database theory, see Wikipedia.
On Don't you feel sick about the fact that TX manager can actually break storage state in an easy and ugly manner?: yes, sometimes when I have to get a lot of work done by the team, I really hate it. But well, it keeps me having a job :-)
Addition: to 2PC or not
From your addition I understand that you are thinking whether or not to include 2PC in your projects.
In my opinion, your mileage may vary. Our company has as policy for 2PC: avoid it whenever possible. However, in some environments and especially with legacy systems and complex environments such a found in banking you can not work around it. The customer requires it and they may be not willing to allow you to perform a major change in other infrastructural components.
When you must do 2PC: do it well. I like a clean architecture of the software and infrastructure, and something that is so simple that even 5 years from now it is clear how it works.
For all other cases, we stay away from two phase commit. We have our own framework (Invantive Producer) from client, to application server to database backend. In this framework we have chosen to sacrifice elements of ACID when normally working in a distributed environment. The application developer must take care himself of for instance atomicity. Often that is possible with little effort or even doesn't require thinking about. For instance, all software must be safe for restart. Even with atomicity of transactions this requires some thinking to do it well in a massive multi user environment (for instance locking issues).
In general this stupid approach is very easy to understand and maintain. In cases where we have been required to do two phase commit, we have been able to just replace some plug-ins on the framework and make some changes to client-side code.
So my advice would be:
Try to avoid 2PC.
But encapsulate your transaction logic nicely.
Allowing to do 2PC without a complete rebuild, but only changing things where needed.
I hope this helps you. If you can tell me more about your typical environments (size in #tables, size in GB persistent data, size in #concurrent users, typical transaction mgmt software and platform) may be i can make some additions or improvements.
Addition: Email and avoiding message loss in 2PC
Regarding whether suggesting DB combining with JMS: No, combining DB with JMS is normally of little use; it will itself already have some db, therefor the original question on transaction logs.
Regarding your business case: I understand that per event an email is sent from a template and that the outgoing mail is registered as an event in the database.
This is a hard nut to crack; I've been enjoying doing security audits and one of the easiest security issues to score was checking use of email.
Email - besides not being confidential and tampersafe in most situations like a postcard - has no guarantees for delivery and/or reading without additional measures. For instance, even when email is delivered directly between your mail transfer agent and the recipient, data loss can occur without the transaction monitor being informed. That even gets worse when multiple hops are involved. For instance, each MTA has it's own queueing mechanism on which a "bomb can be dropped" leading to data loss. But you can also think of spam measures, bad configuration, mail loops, pressing delete file by accident, etc. Even when you can register the sending of the email without any loss of transaction information using 2PC, this gives absolutely no clue on whether the email will arrive at all or even make it across the first hop.
The company I work for sells a large software package for project-driven businesses. This package has an integrated queueing mechanism, which also handles email events. Typically combined in most implementation with Exchange nowadays. A few months we've had a nice problem: transaction started, opened mail channel, mail delivered to Exchange as MTA, register that mail was handled... transaction aborted, since Oracle tablespace full. On the next run, the mail was delivered again to Exchange, again abort, etc. The algorithm has been enhanced now, but from this simple example you can see that you need all endpoints to cooperate in your 2PC, even when some of the endpoints are far away in an organisation receiving and displaying your email.
If you need measures to ensure that an email is delivered or read, you will need to supplement it by additional measures. Please pick one of application controls, user controls and process controls from literature.

Slow XA transactions in JBoss

We are running jboss 4.2.2 with SQL server 2005 (sqljdbc driver 1.2).
We have recently installed new relic and can see a large bottleneck with our transactions.
Generally for any one web request the bottleneck of sits on one of these:
master..xp_sqljdbc_xa_start
master..xp_sqljdbc_xa_commit
org.jboss.resource.adapter.jdbc.WrapperDataSource.getConnection()
master..xp_sqljdbc_xa_end
Several hundred ms are spent on one of these items (in some cases several seconds). Cumulatively most of the response time is spent on these items.
I'm trying to indentify whether its any of the following:
Will moving away from XA transactions help?
Is there a larger problem at my database that I dont have visibility over?
Can I upgrade my SQL driver to help with this?
Or is this an indication that there are just a lot of queries, and we should start by looking at our code, and trying to lower the number of queries overall?
XA transactions are necessary if you are performing work against more than one resource in a single transaction, if you need consistency then you need XA. However you are talking in terms of "queries" which might imply that you are mostly doing read-only activities, and so XA may be overkill. Furthermore you don't speak about using multiple databases or other transactional resources so do you really need XA at all?
So first step: understand the requirements, do you need transaction scopes that span several database interactions? If you are just doing a single query then XA is not needed. If you have mixture of activities needing XA and simple queries not needing XA then use two different connections, one with XA and one without - this clarifies your intention. However I would expect XA drivers to use single resource optimisations so that if XA is not needed you don't pay the overhead so I suspect something more is going on here. (disclaimer I don't use JBoss so my intuition is suspect).
So look to see whether your transaction scopes are appropriate, isolation levels are sensible and so on. Are you getting contention because transactions are unreasonably long, for example are transactions held over user think time?
Next those multi-second waits: that implies contention (or some bizarre network issue) The only reason I can think of for an xa_start being slow is that writing a transaction log is taking unreasonably long - are your logs perhaps on some slow network device? Waits for getConnection() might simply imply that your connection pool is too small (or you're holding connections for too long) If xa_commit and xa_end are taking a long time I'd want to know what the resource managers are doing, can you get any info from the database server.
My overall position: If you truly need XA then you will pay some logging and network message overheads, but these should not be costing you hundreds or thousands of millseconds. Most business systems need XA in a small subset of their overall resource accesses typically when updating two otherwise independent systems, and almost never in read-only scenarios - absolute consistency across distributed systems is pretty much meaningless so using XA for queries is almost certainly overkill.

Articles about replication schemes/algorithms?

I'm designing a distributed system with a certain flow of data in it. I'd like to guarantee that at least N nodes have almost-current data at any given time.
I do not need complete consistency, only eventual consistency (t.i. for any time instant, the current snapshot of data should eventually appear on at least N nodes. It is tricky to define the term "current" here, but still). Nodes may fail and go back up at any moment, and there is no single "central" node.
O overflowers! Point me to some good papers describing replication schemes. I've so far found one: Consistency Management in Optimistic Replication Algorithms and a more broad and recent article by the same author: Optimistic Replication.
A lot of the trick to this is finding your exact requirements, and yours still sound pretty vague. Do you just need to support operations like this?
Update key K to value V.
Look up a somewhat-recent value of key K.
You mentioned you need eventual consistency. So if you do a single update, it will eventually replicate everywhere. If you do two nearly-simultaneous updates, do you care which one wins? If one replica reports that an update was successfully completed, do you care if the value could be lost if that replica were to temporarily crash shortly afterward? Or if that replica were permanently destroyed?
How precise should somewhat-recent be? If there's a netsplit or something, a lookup might return a very stale result or just fail. Do you care which?
Do you ever need to support fancier operations like...
Get the absolute latest value of key K?
Update the value of key K to value V' provided the latest value is currently V?
Do you have rigid reliability, latency, and/or bandwidth requirements? How far apart are your replicas / how good is the network between? This impacts if you can have cross-replica communication on every update and even on every lookup; or even if you can/should fail over operations to a remote replica if the local one seems to be down.
Depending on your answers here, I've worked with a couple different schemes that might meet your requirements. There are several possible variations on them.
The simplest thing is to just have the application always talk to the local replica. Replicas timestamp values (using NTP-synced clocks) and only talk to each other for asynchronous replication. Highest timestamp wins in replication. Of course, if applications on two different replicas each do a read/modify/write near simultaneously, one of the modifications can easily be lost. (In fact, without a conditional update scheme, the same is even true for near-simultaneous changes on the same replica.) If a replica permanently fails, recent-ish updates can be lost. This is more or less what Bigtable's built-in replication does. In the paper you linked, it'd be the "Optimistic - Multimaster" branch but not caring too much about losing some updates makes it simpler than they suggest.
Some databases use the Paxos algorithm (see for example "Data Management for Internet-Scale Single-Sign-On" here to make fancier things possible. Each replica can know how far behind it might be so you can say "give me a value that's no more than 1 minute old" or "give me the absolute latest value". An update isn't considered complete until a quorum of replicas have accepted it, so "give me the absolute latest value" will definitely always return that value until another update happens. You can do the conditional update operation I mentioned to prevent simultaneous writers from tramping each other. This doesn't seem to fit neatly into either the optimistic or pessimistic category as defined by that author because updates are replicated synchronously to a quorum but replicas which didn't vote in the latest Paxos round may still be able to answer some queries. The scheme can be very complicated, though...
Not RDBMS agnostic, but SQL Server 2008 (2005 onwards) supports Peer-To-Peer Replication

Resources