How to handle deleting databases in Couch across multiple clients? - database

Within CouchDB it's sometimes the case that eventually you'll probably have more deleted documents within a database than active documents. After a while this becomes somewhat nonoptimal, as you're syncing more deleted document data than anything else.
The official documentation recommends periodically destroying the databases in order to get around this, but I've noticed that all that happens when doing this is that a client with a local copy of the database (e.g. if you have a database named "username" that's designed to replicate to a client device via Pouch), when it sees the blank database, refills it back up, deleted document records and all.
Short of changing the database name every time, is there any way to signal to other Couch instances that they shouldn't repopulate the new fresh and clean database, and instead take it as a new database entirely? Or, in fact, any other solution at all?

Yes, if you have bidirectional replication then the "other side" will replicate all the deleted docs back to the new DB. The only two options I can think of are to have a new database (with a new name, which is what the docs you linked to probably meant), or to use filtered replication so the client doesn't push up deleted docs (or doesn't push up deleted docs older than a certain point).
The latter of these options is significantly more complex than the former.

Related

why search engines need to reindex periodically but databases don't?

For example , search engines such as Sphinx , Lucene must merge there indexes periodically , but index of database can be updated dynamically . Why must the index of search engine be merged?
I don't know much about Sphinx but I believe the answer to this question will not be related to it.
First, why databases do not need updates periodically? This is because of database is the major data store for the applications most of the time. By this I mean, if you create, delete or update any data; that data is the means of a database record. You're removing data from there to get rid of it within the application or you first get the data from database to update since old version is kept there. All this indicates that databases are being updated all the time and your data is always up-to-date there.
Why an index of a search engine needs periodic reindexing? Index is the data store for a search engine basically that you're processing your data, putting it into index and then retrieving it by the means of your search system. That index is your secondary data resource. This does not hold for all applications but most of the time, you have database as primary resource that is being synchronized with your application as I explained above and then index where you're not reflecting all changes in real-time. Then you find your data in index a little bit outdated according to the database. That reindexing step is necessary for you to keep your data resources consistent.
As I said this explanation does not hold for all applications but it can give you the basic idea.
ps: You have a "index of database" phrase in your question but it is totally a different topic.

SQL Server switching live database

A client has one of my company's applications which points to a specific database and tables within the database on their server. We need to update the data several times a day. We don't want to update the tables that the users are looking at in live sessions. We want to refresh the data on the side and then flip which database/tables the users are accessing.
What is the accepted way of doing this? Do we have two databases and rename the databases? Do we put the data into separate tables, then rename the tables? Are there other approaches that we can take?
Based on the information you have provided, I believe your best bet would be partition switching. I've included a couple links for you to check out because it's much easier to direct you to a source that already explains it well. There are several approaches with partition switching you can take.
Links: Microsoft and Catherin Wilhelmsen blog
Hope this helps!
I think I understand what you're saying: if the user is on a screen you don't want the screen updating with new information while they're viewing it, only update when they pull a new screen after the new data has been loaded? Correct me if I'm wrong. And Mike's question is also a good one, how is this data being fed to the users? Possibly there's a way to pause that or something while the new data is being loaded. There are more elegant ways to load data like possibly partitioning the table, using a staging table, replication, have the users view snapshots, etc. But we need to know what you mean by 'live sessions'.
Edit: with the additional information you've given me, partition switching could be the answer. The process takes virtually no time, it just changes the pointers from the old records to the new ones. Only issue is you have to partition on something patitionable, like a date or timestamp, to differentiate old and new data. It's also an Enterprise-Edition feature and I'm not sure what version you're running.
Possibly a better thing to look at is Read Committed Snapshot Isolation. It will ensure that your users only look at the new data after it's committed; it provides a transaction-level consistent view of the data and has minimal concurrency issues, though there is more overhead in TempDB. Here are some resources for more research:
http://www.databasejournal.com/features/mssql/snapshot-isolation-level-in-sql-server-what-why-and-how-part-1.html
https://msdn.microsoft.com/en-us/library/tcbchxcb(v=vs.110).aspx
Hope this helps and good luck!
The question details are a little vague so to clarify:
What is a live session? Is it a session in the application itself (with app code managing it's own connections to the database) or is it a low level connection per user/session situation? Are users just running reports or active reading/writing from the database during the session? When is a session over and how do you know?
Some options:
1) Pull all the data into the client for the entire session.
2) Use read committed or partitions as mentioned in other answers (however requires careful setup for your queries and increases requirements for the database)
3) Use replica database for all queries, pause/resume replication when necessary (updating data should be faster than your process but it still might take a while depending on volume and complexity)
4) Use replica database and automate a backup/restore from the master (this might be the quickest depending on the overall size of your database)
5) Use multiple replica databases with replication or backup/restore and then switch the connection string (this allows you to update the master constantly and then update a replica and switch over at a certain predictable time)

Standard practice/API for sharing database data without giving direct database access

We would like to give some of our customers the option to read data from our central database. The data is live and new records are being added every few seconds. Our database is MySQL running on Amazon RDS.
I was wondering what is the common practice for doing so.
One option would be to give them select right from specific tables, in that case they would be able to access other customers' data as well.
I have tried searching for database, interface, and API key words and some other key words, but I couldn't find a good answer.
Thanks!
Use REST for exposing specific tables to do CRUD operations. You can control the access on it too.

Database synchronization between a new greenfield project database and old projects database

I thinking about developing a new greenfield app using DDD/TDD/NHibernate with a new database schema reflecting the domain, where changes in the DB would need to be synchronized both ways with the old projects database. The requirement is that both projects will run in parallel, and once the new project starts adding more business value than the old project, the old projects would be shutted down.
One approach I have on my mind is to achieve the db synchronization via db triggers. Once you insert/update/delete in new database, the trigger for the table would need to correctly update the old database. The same for changes in the old database, its triggers would need update the new database.
Example:
old project has one table Quote, with columns QuoteId and QuoteVersion. The correct domain model is one Quote object, with many QuoteVersion objects. So the new database would have two tables, Quote and QuoteVersion. So, if you change Quote table in the new DB, the trigger would need to either update all records with that QuoteId in the old DB or the latest version. Next, if you update Quote record in the old DB, again you either update the record in the new DB or it might update it only if the latest version of the Quote in the old DB was updated.
So, there would need to be some logic in the triggers. Those sql statements might be kind of non-trivial. To ensure maintainability, there would need to be thorough tests for triggers (save data in one db, test data in the second db, for different cases).
The question: do you think this trigger idea for db synchronization is viable (not sure yet how to ensure one trigger wont trigger the other database trigger)? Anybody tried that and found out it goes to hell? Do you have a better idea how to fulfil the requirement of sync databases?
This is a non-trivial challenge, and I would not really want to use triggers - you've identified a number of concerns yourself, and I would add to this concerns about performance and availability, and the distinct likelihood of horrible infinite loop bugs - trigger in legacy app inserts record into greenfield app, causes trigger to fire in greenfield app to insert record in legacy app, causes trigger to fire in legacy app...
The cleanest option I've seen is based on a messaging system. Every change in the application fires a message, which is handled by a recipient at the receiving end. The recipient can validate the message, and - ideally - forward it to the "normal" code which handles that particular data item.
For example:
legacy app creates new "quote" record
legacy app sends a message with a representation of the new "quote"
message bus forwards message to greenfield app "newQuoteMessageHandler"
greenfield app "newQuoteMessageHandler" validates data
greenfield "newQuoteMessageHandler" instantiates "quote" domain entity, and populates it with data
greenfield domain entity deals with remaining persistence and associated business logic.
Your message handlers should be relatively easy to test - and you can use them to isolate each app from the crazy in the underlying data layer. It also allows you to deal with evolving data schemas in the greenfield app.
Retro-fitting this into the legacy app could be tricky - and may well need to involve triggers to capture data updates, but the logic inside the trigger should be pretty straightforward - "send new message".
Bi-directional sync is hard! You can expect to spend a significant amount of time on getting this up and running, and maintaining it as your greenfield project evolves. If you're working on MS software, it's worth looking at http://msdn.microsoft.com/en-us/sync/bb736753.

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.

Resources