Best solution for reporting database - sql-server

Here is the situation:
There is a transaction intensive database - used for both routine transactions and reports.
I was wondering if I could isolate these two operations and 2 independent databases, so reports could run off of one database and all the transactions could occur in another one. This would improve performance for the OLTP SQL database.
I have gone over a few options like, Mirroring, Log shipping, Replication, Snapshots, Clustering - but would like to discuss the best possible strategy for the desired result.
Please advise the best solution to implement this strategy, or any other thoughts/suggestion you may have.

I am thinking this is a classic textbook case of separation of frontend and backend database.
For the projects and people I have worked with, there was a strong agreement that the two should be separated.
In one case, there were three tiers of databases:
Frontend transactions Middle summary
repository for reference by frontend transactions
Backend information repository
The frontend transaction speed was so critical, even that layer was dissected into multiple databases, one database per manufacturing area. The transactions were performed by equipment requiring very fast response.
Data from the frontend databases were used, together with customer and management -oriented databases to construct records for the backend reporting repository at an hourly frequency, because management needed short information latency for their operational and engineering decisions. If we could perform the information-compilation at 15 minute intervals, we would have done it. Depending on project, that backend repository could either be Oracle or Sybase IQ.
However, the frontend transactions performed by equipment needed to refer to some meta information. Response time required by the equipment could not run the risk of being interrupted by someone running a huge adhoc query on the backend repository, which was frequent.
So, a middle layer bridging database was created, which consists of nightly abstracts of information from the backend repository.
Schema designed with commonality-keys
Schema design is very important, to optimise the response and performance of all the databases. You have to ensure your database records are commonality-key-indexed and discrete-time-indexed.
For a manufacturing plant filled with robots and equipment, divided into manufacturing areas, each area has a frontend transaction database. Each area database needs to have a commonality-key dispatcher. When
a piece of equipment needed to perform a batch of operations, the beginOp event requests for a discrete-key from the dispatcher. An operation cycle may take seconds, or days, or weeks. Every time a piece of equipment needed to perform a transaction on its state of operation, it includes that commonality-key. An operation could have sub-operations and sub-sub-operations, etc - but each of such operation is required to obtain a commonality-key from its area dispatcher.
The commonality-key dispatcher is simply the beginOp table in the database with an auto-increment key. Any equipment sharing a same begun operation, it is able to infer/obtain that commonality-key from the table due to meticulous process sequencing strategy.
For areas where we could ensure that no two operations on the whole floor could start at the same 100 millisecond, there was no need for a dispatcher because we could simply use the date-time of a beginOp event, where the datetime function of the database server is the natural/spontaneous key dispatcher.
The reason for this discussion on commonality-key is because the transaction response required is so quick, you do not want pieces of equipment to have to communicate with each other unnecessarily just to tell each other they are recording events of the same operation. The robots and equipment simply perform the transactions with the commonality-key they are holding.
The hourly compilation of information for insertion into the backend repository conveniently uses the composite-key of commonality+area, to construct the hierarchy of events.
Frontend piping database
OK, this is really extreme. In some areas, the transactions were so frequent, that we had a FIFO database. We introduced a fourth tier database. For optimal transaction response, we had to keep a database size below 1GB. A transaction-piping process existed to empty old transactions into the fourth tier databases. I found that it was easier (and better response) to create a pool of new databases, so that every time its size reaches 1GB, it is moved out and immediately replaced with a new database from the pool - leaving the machines performing the hourly compilation to join up the databases. So that left us with depending on an existing metadata database to house the commonality-key dispatcher table with some meta-data tables.
In retrospect, one might think the commonality-key dispatcher table and metadata tables could have been housed in the middle tier bridging database, but because the database management processes were automated and cookie-cut, it was cleaner to create a new process than to modify the process managing the mid-tier bridging database. Those management routines were used across the world, so you cannot willy-nilly change them without causing havoc to the financial performance of the company or offending the respective data layer architects maintaining them.
It took a lot of organisational skills for the managers to pull all these together. So transactional data design is not just simply a technical skill but process planning skills involving a whole lot of people head-butting each other until you get it right.

What you ask for is totally standard - OLAP and OLTP do not mix in heavy load scenarios.
You use SQL Server. Look into SSAS (SQL Server Analytical Processing) for something to build cubes (different approach than SQL) that you can then report against.
If you do not wnat that, then mirriroring is the next best solution - you can put a mirror online in read only mode for reporting, and it gives you, also, a backup to activate if the main server fails ;) Always good.
CLustering is a non-issue - it will allow you to move the database to another node, but it does not solve the performance issue at all. Log file shipping, replication - good, though I would go with mirroring, read only copy for reporting, loading the data into SSAS.

We have a read/write Cluster which replicates (using transactional replication) to "read only" servers (not physically read only , the web app just performs reads on them). We do the same for reporting and this scales pretty good.
We have multiple sites, 32+ servers and a couple of reporting servers in this configuration with very high volume of inserts, updates and reads.
We primarily use reporting services for internal reporting. Reporting doesnt effect our core business , which I guess is your main concern.

Related

Load balancer and multiple instance of database design

The current single application server can handle about 5000 concurrent requests. However, the user base will be over millions and I may need to have two application servers to handle requests.
So the design is to have a load balancer to hope it will handle over 10000 concurrent requests. However, the data of each users are being stored in one single database. So the design is to have two or more servers, shall I do the followings?
Having two instances of databases
Real-time sync between two database
Is this correct?
However, if so, will the sync process lower down the performance of the servers
as Database replication seems costly.
Thank you.
You probably want to think of your service in "tiers". In this instance, you've got two tiers; the application tier and the database tier.
Typically, your application tier is going to be considerably easier to scale horizontally (i.e. by adding more application servers behind a load balancer) than your database tier.
With that in mind, the best approach is probably to overprovision your database (i.e. put it on its own, meaty server) and have your application servers all connect to that same database. Depending on the database software you're using, you could also look at using read replicas (AWS docs) to reduce the strain on your database.
You can also look at caching via Memcached / Redis to reduce the amount of load you're placing on the database.
So – tl;dr – put your DB on its own, big, server, and spread your application code across many small servers, all connecting to that same DB server.
Best option could be the synchronizing the standby node with data from active node as cost effective solution since it can be achievable using open source relational database(e.g. Maria DB).
Do not store computable results and statistics that can be easily doable at run time which may help reduce to data size.
If history data is not needed urgent for inquiries , it can be written to text file in easily importable format to database(e.g. .csv).
Data objects that are very oftenly updated can be kept in in-memory database as key value pair, use scheduled task to perform batch update/insert to relation database to achieve persistence
Implement retry logic for database batch update tasks to handle db downtimes or network errors
Consider writing data to relational database as serialized objects
Cache configuration data to memory from database either periodically or via API to refresh the changing part.

Synchronize data b/w two data stores

I have two different databases, one's an old legacy one which I'll be decommissioning due to the old service not being used anymore. The other one's is a new service and will eventually replace the old system. Before that happens we need both services running for a while.
Both have two tables for users for storing the email address, password and the other table is for simple user related data (addresses.)
I need to synchronize data between these two databases. The old one is a MS SQL Server DB and the new one's a NoSQL DB, (DynamoDB.)
My strategy would be that before going live, copy all the users from the old DB to the new one and then once the new system is running then synchronize the users between each DB.
I'll do this by having a tool run periodically to check any users added after last run by querying the users table something like this WHERE CreationDate >= LastRunTime and then for each user query it if it exists in the other database. I'll do this two way i.e. from old DB -> new DB and from new DB -> old DB.
Is this a good way of doing this? Any other better, fast solutions to achieve this?
How can I detect changes to existing user's data? Is there any better solution than checking & matching every user's record in both systems' tables and then taking the one that's last modified (by checking at the LastModifiedDate timestamp for each record) and updating it in the other system's table?
Solution 1 (My Recommended): Whenever system insert/update a record in either of the databases you add/update a record data in the database and add that information in a Queue.
A sperate reader will read from the queue and replicate the data to respective database periodically this way your data will get sync between the databases.
Note: Another advantage of using the queue would be that you don't have to set very high throughput in your DynamoDB table.
Solution 2: What you had suggested in your question, you can add a CRON job that will replicate the databases by checking the record based on timestamp.
I've executed several table migrations from Oracle / MySQL to DynamoDB with no downtime and the approach I used was a little different than what you described. This approach ends up requiring more coding but I would consider it a lower risk approach than the hard cutover you described.
This approach requires multiple phases as described below:
Phase 1
Create the new DynamoDB table(s) for the data in your legacy system.
Phase 2
Update your application to write/update data in both the legacy database and in DynamoDB. Your application will still read and write to the legacy system so this should be a low risk change.
Immediately before deploying this code load DynamoDB up with all of the old data.
Immediately after deploying audit the database to make sure they are in sync.
Phase 3
Update your application to start reading from DynamoDB. This should be low risk because your application will have been maintaining data in DynamoDB for some time.
Keep your application writing to the legacy database so you can cut back if you identify any problems in the new implementation. This ensures the cutover is low risk and you can easily roll back.
Phase 4
Remove the code from your application that reads and writes to the legacy database and deploy this to production.
You can now decommission the legacy database!
This is definitely more steps and will take more time than just taking the application down, migrating all of the data, and then deploying a new version of the application to read/write from DynamoDB. However, the main benefit to this approach is that it not only requires no downtime but is lower risk as it tests the change in phases and allows for easy rollback if any issues are encountered.
On high level, a sync job could be 1> cron job based or 2> notification based.
The cron job could do sync as well as auditing if you have "creation time" and "last_updated_by time". In this case the master DB (from where the data should be synced from) is normally a SQL Db since it's much easier to do table scan in SQL than in NoSQL (like in DynamoDB you need to use its scan function and it's limited by the table's hash key).
The second option is to build a notification machenism and this could be based on DynamoDB's stream http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html. It's a mature feature for DynamoDB, it guarantees event order and could achieve near real time event deliver. What you need to do is to build a listen for those events.
Lastly, you could take a look at AWS Database Migration Service https://aws.amazon.com/dms/ to see if it satisfies your requirement.

Architecting a high performing "inserting solution"

I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.

Reliable alternative to replication for continous data sync between two databases

I have one central database and 25 client databases and all have same schema.
I want that whenever some changes are done in some tables of the central database then these changes flow down to the client database.
The databases used is SQL Express so I cannot use replication.
The solution that I have today is to make keep track of the changes in the central database and then a program makes a text file with these changes and sends them down to the client databases.Another program reads these text files and updates the client database.
There are three problems with this:-
1. The files get lost or arrive in jumbled order which messes up the client data
2. the process is slow
3. the programs are sometimes shutdown so the whole sync flow gets stopped.
Is there a reliable alternative that is fast and secure ?
I wonder how banking software are made ...they never lose transactions and they are fast.
Add an UpdateDate column to all the entities that need to be replicated. At each client add a linked server to the central repository. Now, every 5 minutes or so, poll your central repository for changes using the last UpdateDate of a client entity and grab the delta.
Then use merge or insert and update to merge data on the client. That's a very reliable way of doing homebrew replication. To keep track of deleted elements you would either want to mark them as deleted or have another table to keep track of entity kind and its reference, again combined with UpdateDate for replication.
Update
Then you mention transactions and banking software. When you do your replication via files, we ain't talkin' about no transactional replication here, not by a long shot.
If you need transactional consistency you need to subscribe to the transaction flow of the data warehouse.
I don't want to be unhelpful and you haven't given any background about your business needs, but you have to decide if your priority is really "fast and secure" or if it's actually "cheap". Replicating changes between multiple databases in a reliable, consistent way is not easy (as you know) and it's highly unlikely that you will be able to develop a solution yourself that has the features, stability and performance of SQL Server replication.
SQL Express can be a replication subscriber, by the way, so it's not clear why it doesn't meet your needs. But if it doesn't, you should estimate the cost to your business (or customer) of dealing with issues caused by an unreliable solution: your time, business downtime, finding and correcting incorrect data, customer complaints, lost business etc. Then compare that to the cost of 25 SQL Server licenses (you should certainly be able to get a good discount when you order that volume), additional hardware (if any) and the costs of training, consulting and/or learning how to use replication. Then extrapolate those costs over 5 years or so. You may find that it's cheaper just to buy the solution you need. And of course buying the full SQL Server edition means you get a lot of other new features that might be useful to you.
If you (or your boss) is really determined to get something for nothing, you might want to investigate PostgreSQL or MySQL. They both have free replication solutions that seem to be widely enough used to be reliable for many companies. Of course, you then need to calculate the costs of switching to a new database platform.
If you have one central database and 25 clients, you can easily do it with one (yes only one) SQL server licence for the main database. Subscribers to this database can run SQL express. As long as users access the the client databases, you are not even obliged to buy SQL CALs.
Back to banking software, be sure that they are paying good money for their server licenses! So don't be surprised if these are reliable and fast ...

Transactional Replication For Write Heavy Medium Sized Database

We have a decent sized, write-heavy database that is about 426 GB (including indexes) and about 300 million rows . We currently collect location data from devices that report to our server every couple of minutes, and we serve about 10,000 devices - so lots of writes every second. The location table that stores the location of each device has about 223 million rows. The data is currently archived by year.
Problems occur when users run large reports on this database, the whole database grinds down almost to a stop.
I understand I need a reporting database, but my question is if anyone has experience of using SQL Server Transactional Replication on a database of equivalent size, and their experience of using this technology?
My rough plan is to point all the reports in our application to the Reporting Database, use Transactional Replication to replicate the data over from the master to the slave (Reporting Database).
Anyone have any thoughts on this strategy and the problems I may encounter?
Many thanks!
Transactional replication should work well in this scenario (the only effect the size of the database will have is the time taken to generate the initial snapshot). However, it may not solve your problem.
I think the issue you'll have if you choose transactional replication is that the slave server is going to be under the same load as the master machine as changes are applied - it will still crawl when users run large reports (assuming it's of a similar spec).
Depending on the acceptable latency of reporting data to the live data, this may or may not be OK for your users.
If some latency is acceptable you may get better performance from log shipping, since changes are applied in batches.
Before acquiring a reporting server, another approach would be to investigate the queries that your users are running and look at modifying either their code or the indexing strategy to better match what they're trying to do.
Transactional Replication could work well for you. The things to consider:
The target database tables must be read-only.
The server containing the target database should be stout enough to handle the SELECT traffic from the reporting applications.
Depending on the INSERT/UPDATE traffic, you may need to have a third server act as the Distribution server.
You also have to consider the size of the Distribution database.
Based on what I read here, I'd use a pull subscription from the Reporting server to offload traffic from the OLTP server.
You can skip the torment of a snapshot by initializing the reporting database from a backup of the OLTP database. See https://msdn.microsoft.com/en-us/library/ms151705.aspx
There will be INSERT/UPDATE/DELETE traffic from the Replication into both the Distribution and the Subscriber databases. That requires consideration, but lock/block issues should be no worse (and probably better) than running those reports off of OLTP.
I am running multiple publications on a 2.6TB database with 2.5GB/day of growth, using both pure transactional to drive reports (to two reporting servers) and Peer-to-Peer Transactional to replicate data in a scale-out for a SaaS offering (to three more servers). Because of this, we have a separate distributor.
Hope this helps.
Thanks
John.

Resources