What is the best approach for decoupled database design in terms of data sharing? - database

I have a series of Oracle databases that need to access each other's data. The most efficient way to do this is to use database links - setting up a few database links I can get data from A to B with the minimum of fuss. The problem for me is that you end up with a tightly-coupled design and if one database goes down it can bring the coupled databases with it (or perhaps part of an application on those databases).
What alternative approaches have you tried for sharing data between Oracle databases?
Update after a couple of responses...
I wasn't thinking so much a replication, more on accessing "master data". For example, if I have a central database with currency conversion rates and I want to pull a rate into a separate database (application). For such a small dataset igor-db's suggestion of materialized views over DB links would work beautifully. However, when you are dynamically sampling from a very large dataset then the option of locally caching starts to become trickier. What options would you go for in these circumstances. I wondered about an XML service but tuinstoel (in a comment to le dorfier's reply) rightly questioned the overhead involved.
Summary of responses...
On the whole I think igor-db is closest, which is why I've accepted that answer, but I thought I'd add a little to bring out some of the other answers.
For my purposes, where I'm looking at data replication only, it looks like Oracle BASIC replication (as opposed to ADVANCED) replication is the one for me. Using materialized view logs on the master site and materialized views on the snapshot site looks like an excellent way forward.
Where this isn't an option, perhaps where the data volumes make full table replication an issue, then a messaging solution seems the most appropriate Oracle solution. Oracle Advanced Queueing seems the quickest and easiest way to set up a messaging solution.
The least preferable approach seems to be roll-your-own XML web services but only where the relative ease of Advanced Queueing isn't an option.

Streams is the Oracle replication technology.
You can use MVs over database links (so database 'A' has a materialized view of the data from database 'B'. If 'B' goes down, the MV can't be refreshed but the data is still in 'A').
Mileage may depend on DB volumes, change volumes...

It looks to me like it's by definition tightly coupled if you need simultaneous synchronous access to multiple databases.
If this is about transferring data, for instance, and it can be asynchronous, you can install a message queue between the two and have two processes, with one reading from the source and the other writing to the sink.

The OP has provided more information. He states that the dataset is very large. Well how large is large? And how often are the master tables changed?
With the use of materialized view logs Oracle will only propagate the changes made in the master table. A complete refresh of the data isn't necessary. Oracle streams also only communicate the modifications to the other side.
Buying storage is cheap, so why not local caching? Much cheaper than programming your own solutions.
An XML service doesn't help you when its database is not available so I don't understand why it would help? Oracle has many options for replication, explore them.
edit
I've build xml services. They provide interoperability between different systems with a clear interface (contract). You can build a xml service in C# and consume the service with Java. However xml services are not fast.

Why not use Advanced Queuing? Why roll your own XML service to move messages (DML) between Oracle instances - It's already there. You can have propagation move messages from one instance to another when they are both up. You can process them as needed in the destination servers. AQ is really rather simple to set up and use.

Why do they need to be separate databases?
Having a single database/instance with multiple schemas might be easier.
Keeping one database up (with appropriate standby databases etc) will be easier than keeping N up.

What kind of immediacy do you need and how much bi-directionality? If the data can be a little older and can be pulled from one "master source", create a series of simple ETL scripts run on a schedule to pull the data from the "source" database into the others.
You can then tailor the structure of the data to feed the needs of the client database(s) more precisely and you can change the structure of the source data until you're blue in the face.

Related

SQL Server switching live database

A client has one of my company's applications which points to a specific database and tables within the database on their server. We need to update the data several times a day. We don't want to update the tables that the users are looking at in live sessions. We want to refresh the data on the side and then flip which database/tables the users are accessing.
What is the accepted way of doing this? Do we have two databases and rename the databases? Do we put the data into separate tables, then rename the tables? Are there other approaches that we can take?
Based on the information you have provided, I believe your best bet would be partition switching. I've included a couple links for you to check out because it's much easier to direct you to a source that already explains it well. There are several approaches with partition switching you can take.
Links: Microsoft and Catherin Wilhelmsen blog
Hope this helps!
I think I understand what you're saying: if the user is on a screen you don't want the screen updating with new information while they're viewing it, only update when they pull a new screen after the new data has been loaded? Correct me if I'm wrong. And Mike's question is also a good one, how is this data being fed to the users? Possibly there's a way to pause that or something while the new data is being loaded. There are more elegant ways to load data like possibly partitioning the table, using a staging table, replication, have the users view snapshots, etc. But we need to know what you mean by 'live sessions'.
Edit: with the additional information you've given me, partition switching could be the answer. The process takes virtually no time, it just changes the pointers from the old records to the new ones. Only issue is you have to partition on something patitionable, like a date or timestamp, to differentiate old and new data. It's also an Enterprise-Edition feature and I'm not sure what version you're running.
Possibly a better thing to look at is Read Committed Snapshot Isolation. It will ensure that your users only look at the new data after it's committed; it provides a transaction-level consistent view of the data and has minimal concurrency issues, though there is more overhead in TempDB. Here are some resources for more research:
http://www.databasejournal.com/features/mssql/snapshot-isolation-level-in-sql-server-what-why-and-how-part-1.html
https://msdn.microsoft.com/en-us/library/tcbchxcb(v=vs.110).aspx
Hope this helps and good luck!
The question details are a little vague so to clarify:
What is a live session? Is it a session in the application itself (with app code managing it's own connections to the database) or is it a low level connection per user/session situation? Are users just running reports or active reading/writing from the database during the session? When is a session over and how do you know?
Some options:
1) Pull all the data into the client for the entire session.
2) Use read committed or partitions as mentioned in other answers (however requires careful setup for your queries and increases requirements for the database)
3) Use replica database for all queries, pause/resume replication when necessary (updating data should be faster than your process but it still might take a while depending on volume and complexity)
4) Use replica database and automate a backup/restore from the master (this might be the quickest depending on the overall size of your database)
5) Use multiple replica databases with replication or backup/restore and then switch the connection string (this allows you to update the master constantly and then update a replica and switch over at a certain predictable time)

Microservices and database joins

For people that are splitting up monolithic applications into microservices how are you handling the connundrum of breaking apart the database. Typical applications that I've worked on do a lot of database integration for performance and simplicity reasons.
If you have two tables that are logically distinct (bounded contexts if you will) but you often do aggregate processing on a large volumes of that data then in the monolith you're more than likely to eschew object orientation and are instead using your database's standard JOIN feature to process the data on the database prior to return the aggregated view back to your app tier.
How do you justify splitting up such data into microservices where presumably you will be required to 'join' the data through an API rather than at the database.
I've read Sam Newman's Microservices book and in the chapter on splitting the Monolith he gives an example of "Breaking Foreign Key Relationships" where he acknowledges that doing a join across an API is going to be slower - but he goes on to say if your application is fast enough anyway, does it matter that it is slower than before?
This seems a bit glib? What are people's experiences? What techniques did you use to make the API joins perform acceptably?
When performance or latency doesn't matter too much (yes, we don't
always need them) it's perfectly fine to just use simple RESTful APIs
for querying additional data you need. If you need to do multiple
calls to different microservices and return one result you can use
API Gateway pattern.
It's perfectly fine to have redundancy in Polyglot persistence environments. For example, you can use messaging queue for your microservices and send "update" events every time you change something. Other microservices will listen to required events and save data locally. So instead of querying you keep all required data in appropriate storage for specific microservice.
Also, don't forget about caching :) You can use tools like Redis or Memcached to avoid querying other databases too often.
It's OK for services to have read-only replicated copies of certain reference data from other services.
Given that, when trying to refactor a monolithic database into microservices (as opposed to rewrite) I would
create a db schema for the service
create versioned* views** in that schema to expose data from that schema to other services
do joins against these readonly views
This will let you independently modify table data/strucutre without breaking other applications.
Rather than use views, I might also consider using triggers to replicate data from one schema to another.
This would be incremental progress in the right direction, establishing the seams of your components, and a move to REST can be done later.
*the views can be extended. If a breaking change is required, create a v2 of the same view and remove the old version when it is no longer required.
**or Table-Valued-Functions, or Sprocs.
CQRS---Command Query Aggregation Pattern is the answer to thi as per Chris Richardson.
Let each microservice update its own data Model and generates the events which will update the materialized view having the required join data from earlier microservices.This MV could be any NoSql DB or Redis or elasticsearch which is query optimized. This techniques leads to Eventual consistency which is definitely not bad and avoids the real time application side joins.
Hope this answers.
I would separate the solutions for the area of use, on let’s say operational and reporting.
For the microservices that operate to provide data for single forms that need data from other microservices (this is the operational case) I think using API joins is the way to go. You will not go for big amounts of data, you can do data integration in the service.
The other case is when you need to do big queries on large amount of data to do aggregations etc. (the reporting case). For this need I would think about maintaining a shared database – similar to your original scheme and updating it with events from your microservice databases. On this shared database you could continue to use your stored procedures which would save your effort and support the database optimizations.
In Microservices you create diff. read models, so for eg: if you have two diff. bounded context and somebody wants to search on both the data then somebody needs to listen to events from both bounded context and create a view specific for the application.
In this case there will be more space needed, but no joins will be needed and no joins.

Best highperf database for simple read/write (no update) scenario

I'm interested in opinions on what database system to select for this project where I basically need to persist a constant stream of messages at potentially high speed. There's basically four types of messages with some commonalities. No relations needed. I guess you could call it an event store.
I will need to read (query by a non-unique key), but I don't need to update any data. I will have to delete old data though.
Considerations:
Database must be able to scale out
Performance is crucial
as well as up-time (system allowing live updates would be nice)
Preferably something running on Windows Server, but this is not a requirement
I'm familiar with document databases (MongoDB), but don't know what other kinds of NoSQL solutions would fit my problem, or how they compare.
MongoDb would be ideal. But if all you want to do is read from the stream and serve up content, more than database consideration (use any db - mysql, access, sql server express, xml files), I would suggest you look at putting all your data in memory (maybe at app startup); and then serve up data from memory.
You should also look at some caching solutions like Memcached (http://memcached.org/)

The best technology to synchronize data between different database schemas?

I have an existing SQL Server 2005 database that runs our accounting/inventory application. We are looking at using a new on-line ordering framework - which has it's own database.
If we use this new framework, we will need to transfer the on-line ordering data (inventory, prices, orders, customers) - almost realtime - to and from, our existing inventory database. The transfer of data doesn't have to be real-time, but it has to be quick. Both databases will be in SQL Server.
So my question is... what is the best way to transer data back and forth between two databases, with have different schemas?
Replication? SSIS? What would you suggest, and why?
Any help would be appreciated!
The Business Rules are the Hard Part
One-way sync? Two-way sync? Real-time push? Nightly updates? Dump and reload? Compare and update? Conflict resolution? Which side wins? Push read-only info one way, and order info the other way? What about changes/cancellations/etc? Do order statuses get pushed back?
You can see where I'm going here. Technology is a secondary question.
Because of the business rules issue, and because the two systems have different schemas (and different purposes), this isn't a standard data move, and most of the "standard" answers (replication, log shipping, etc) are off the table.
There are frameworks out there designed to help with this, like Microsoft BizTalk or Scribe Insight. These are cumbersome and expensive, though.
It isn't too difficult to create a custom queue-ing system either based on SQL triggers, or scheduled pushes (depending on your needs) in C# or your favorite language. That's probably the route I would go. It would probably involve a third "transfer" database to hold the queue of changes made by one side, and a module to apply the business rules and push the data to the other.
Personally I would run away from this nightmare as fast as I could. Since you have not yet bought this online ordering I would suggest that keeping the data in synch with the existing application is a valid reason for not doing such a thing. If you buy this you will eternally regret how mucked up your data will become and how much time and money you spend trying to get things to work properly. This is a disaster waiting to happen. You''ll end up having people order items supposedly in iventory when there are none in the warehouse. Do not do this. This is a guarantee of angry customers and angry managers. Far far cheaper over time to hire some developers to put together your own online ordering that accesses your data base. If they go ahead over your objections, I'd update my resume.
From personal experience, I would only use replication if there was no other choice. You have to tear it down for any schema change and it has a tendency to just blow up.
For this, I'd most likely use SSIS. It's fairly easy to build a transformation package and fairly simple to maintain.
Replication works well, and if it's two way, it might be your only viable option, since conflict resolution is built in.
If you're going one-way, SSIS or triggers on tables would fine, and would push the data real-time (for triggers) or at whatever interval you want (SSIS). The upside to SSIS would be that it's a background process, whereas triggers could potentially hold up transactions on the supply side while they push the data.
If you're looking to move massive amounts of data, there are other products out there that can do it for you, but if it's not too much data, a solution using SQL Servers tools should do all you need.

Extreme Sharding: One SQLite Database Per User

I'm working on a web app that is somewhere between an email service and a social network. I feel it has the potential to grow really big in the future, so I'm concerned about scalability.
Instead of using one centralized MySQL/InnoDB database and then partitioning it when that time comes, I've decided to create a separate SQLite database for each active user: one active user per 'shard'.
That way backing up the database would be as easy as copying each user's small database file to a remote location once a day.
Scaling up will be as easy as adding extra hard disks to store the new files.
When the app grows beyond a single server I can link the servers together at the filesystem level using GlusterFS and run the app unchanged, or rig up a simple SQLite proxy system that will allow each server to manipulate sqlite files in adjacent servers.
Concurrency issues will be minimal because each HTTP request will only touch one or two database files at a time, out of thousands, and SQLite only blocks on reads anyway.
I'm betting that this approach will allow my app to scale gracefully and support lots of cool and unique features. Am I betting wrong? Am I missing anything?
UPDATE I decided to go with a less extreme solution, which is working fine so far. I'm using a fixed number of shards - 256 sqlite databases, to be precise. Each user is assigned and bound to a random shard by a simple hash function.
Most features of my app require access to just one or two shards per request, but there is one in particular that requires the execution of a simple query on 10 to 100 different shards out of 256, depending on the user. Tests indicate it would take about 0.02 seconds, or less, if all the data is cached in RAM. I think I can live with that!
UPDATE 2.0 I ported the app to MySQL/InnoDB and was able to get about the same performance for regular requests, but for that one request that requires shard walking, innodb is 4-5 times faster. For this reason, and other reason, I'm dropping this architecture, but I hope someone somewhere finds a use for it...thanks.
The place where this will fail is if you have to do what's called "shard walking" - which is finding out all the data across a bunch of different users. That particular kind of "query" will have to be done programmatically, asking each of the SQLite databases in turn - and will very likely be the slowest aspect of your site. It's a common issue in any system where data has been "sharded" into separate databases.
If all the of the data is self-contained to the user, then this should scale pretty well - the key to making this an effective design is to know how the data is likely going to be used and if data from one person will be interacting with data from another (in your context).
You may also need to watch out for file system resources - SQLite is great, awesome, fast, etc - but you do get some caching and writing benefits when using a "standard database" (i.e. MySQL, PostgreSQL, etc) because of how they're designed. In your proposed design, you'll be missing out on some of that.
Sounds to me like a maintenance nightmare. What happens when the schema changes on all those DBs?
http://freshmeat.net/projects/sphivedb
SPHiveDB is a server for sqlite database. It use JSON-RPC over HTTP to expose a network interface to use SQLite database. It supports combining multiple SQLite databases into one file. It also supports the use of multiple files. It is designed for the extreme sharding schema -- one SQLite database per user.
One possible problem is that having one database for each user will use disk space and RAM very inefficiently, and as the user base grows the benefit of using a light and fast database engine will be lost completely.
A possible solution to this problem is to create "minishards" consisting of maybe 1024 SQLite databases housing up to 100 users each. This will be more efficient than the DB per user approach, because data is packed more efficiently. And lighter than the Innodb database server approach, because we're using Sqlite.
Concurrency will also be pretty good, but queries will be less elegant (shard_id yuckiness). What do you think?
If you're creating a separate database for each user, it sounds like you're not setting up relationships... so why use a relational database at all?
If your data is this easy to shard, why not just use a standard database engine, and if you scale large enough that the DB becomes the bottleneck, shard the database, with different users in different instances? The effect is the same, but you're not using scores of tiny little databases.
In reality, you probably have at least some shared data that doesn't belong to any single user, and you probably frequently need to access data for more than one user. This will cause problems with either system, though.
I am considering this same architecture as I basically wanted to use the server side SQLLIte databases as backup and synching copy for clients. My idea for querying across all the data is to use Sphinx for full-text search and run Hadoop jobs from flat dumps of all the data to Scribe and then expose the results as webservies. This post gives me some pause for thought however, so I hope people will continue to respond with their opinion.
Having one database per user would make it really easy to restore individual users data of course, but as #John said, schema changes would require some work.
Not enough to make it hard, but enough to make it non-trivial.

Resources