I have a situation I'm sure has been resolved by many, many others. I have an idea on how to resolve it but when I research, it doesn't seem like a lot of folks have implemented what I'm thinking about.
Here is the state of affairs we are in--
--We have a single Database (SQL Server) that's structured and used as an OLTP DB
--We have a need to Batch in data that's needed in transactional context
--We have a need for up-to-date Reporting via an internal UI
The problem, as you may have already guessed, is that when we Batch data in and out of the OTLP database, it's competing for resources and in some cases, locking transactional tables.
What I'd like to do is introduce a second database, let the 2 databases mirror, and only let one DB be accessible by transactional applications. If data needs to be Batched in or out of the transactional database, it goes in or out of the secondary, non-transactional database, and the mirroring will take care of the syncing.
When I researched this approach, I didn't get many hits.
Is there a better / more accepted way of handling this?
Just my opinion without being able to find numbers doing the same searching, but I would think batching into the mirrored database to let it handle the synchronization would cause the exact same, or possibly worse locking.
I suggest as a first step to enable snapshot isolation (ALLOW_SNAPSHOT_ISOLATION) and READ_COMMITTED_SNAPSHOT.
https://msdn.microsoft.com/en-us/library/tcbchxcb(v=vs.110).aspx
https://www.brentozar.com/archive/2013/01/implementing-snapshot-or-read-committed-snapshot-isolation-in-sql-server-a-guide/
Note the warnings by Brent Ozar about the possible repercussions if you have long running transactional transactions that you're expecting synchronous results on across multiple threads.
Related
I'm familiar with replication from other systems and I've read the docs about sync, async and semi-sync replication in Memgraph. What I'm most interested in is the why: what benefits do we get from replication? Disaster recovery, high availability, parallel processing, etc?
You get multiple things. Basically, replication makes your data redundant. It copies it to multiple machines and achieves multiple benefits with it.
The queries you are executing on your database can be performed on any machine that could contain that data, it, therefore, provides a higher throughput on your read queries. On the other hand, if one machine goes down, others can compensate for it with the same data since it is replicated. So all of the things you said are true
Need some sanity check.
Imagine having 1 SQL Server instance, a beefy system (i.e 48GB of RAM and tons of storage). Obviously there comes a point where it gets hammered in a situation where there are lots of jobs running.
These jobs/DB are part of an external piece of software and cannot be controlled or modified by us directly.
Now, when these jobs run, besides the queries probably being inefficient, do bring the DB down - they become very slow so any "regular" users are having slow responses.
The immediate thing I can think of is replication of some kind where maybe, the "secondary" DB would be the one where these jobs point to and do their hammering, still leaving the primary available and active but would receive any updates from secondary for data consistency/integrity.
Would this be the right thing to do? Ultimately I want the load to be elsewhere but have the primary be aware of updates and update itself without bringing it down or being very slow.
What is this called in MS SQL Server? Does such a thing exist? The jobs will be doing a read-write FYI.
There are numerous approaches to this, all of which are native to SQL Server, but I think you should look into Transactional Replication:
https://learn.microsoft.com/en-us/sql/relational-databases/replication/transactional/transactional-replication?view=sql-server-ver16
It effectively creates a read-only replica based on log shipping that, for reporting purposes, is practically real time.
From the documentation:
"By default, Subscribers to transactional publications should be treated as read-only, because changes are not propagated back to the Publisher. However, transactional replication does offer options that allow updates at the Subscriber."
Your scenario likely has nuances I don't know about, but you can use various flavors of SQL Replication, custom triggers, linked servers, 3-part queries, etc. to fill in the holes.
I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.
I will build a system where I want to reduce single-point-of-failures, and I need a database. Is there any (free) relational database systems that can handle multi-master setups good (i.e where it is easy to add and remove nodes) or is it better to go with a NoSQL-database?
As what I have understood, a key-value store will handle this better. What database system do you recommend for a multi-master (cluster) setup?
Mysql's NDB Cluster WILL do this. But it's far from easy to set up and has a lot of gotchas.
And also, its performance is generally fairly sucky and it keeps data in memory (yes, I know they sound contradictory).
Essentially, updates need to acquire distributed locks throughout the cluster (or at least in the storage node group where those table(s) are held)
It is not easy to manage, but you can do some level of hot-add.
Unless you require very rapid failover and consistency, I'd recommend against it.
I'd recommend ignoring multi-master, and using a HA MySQL instead (with e.g. InnoDB) which is easy to set up and works very well with typical sub 30-second failover times. This is a master-slave system where the slave cannot even do reads (but you can add read slaves with replication provided you don't need them to be completely up to date)
Key-value stores are not necessarily fault tolerant. They are primarily performance tools. Only when data is stored on more than one server is there any form of fault tolerance. If it is just safety, reducing single point of failure the simplest solution is probably set up a mirroring solution, where you have a mirror that just tracks the master database. When the master somehow fails, you quickly switch over (hopefully automatically).
The complexity of this is much lower as there is no consistency management needed during normal operation. The mirror is read-only and just tracks the master database. When the master fails, the mirror is switched to master and the link broken. After the master gets back up the state between them is inconsistent and you must make sure to update the original master from the mirror now acting as master. Most database systems can handle this scenario, and if you have no insane uptime requirements or a very heavy load it is the most pragmatic solution.
I think Oracle has nailed this concept. However, if you're a mortal without a swiss bank account, then maybe you should look into MySQL's NDB Cluster.
I am working with a half dozen DBs. The DBs all have the same schemas, the same SPs, etc. Speaking to the person who originally designed the DBs, a big part of the motivation for using many DBs was efficiency; the alternative would be to add a column to pretty much every table and sp in the database indicating which set of data was being worked in, resulting in one giant (and thus slower) DB instead of several samll DBs. In place of having a column to indicate which set of data is being queried, the connection string is used to select which database is being hit.
The only reason I really dislike this organization is that it involves a lot of code duplication and thus hurts maintenance. For example, every time I wish to change a stored procedure, I need to run the alter statement on every database.
One solution I have considered is to combine all of the data into one big database, adding an extra column all over the place to indicate which database the data would be in if I had not combined it. Then, I could partition all of the tables by this column's value. In theory, the result of all of this is that the underlying representation of all of the data itself will be morally the same as it is now, but without the redundancies in the indexes, schemas, SPs, etc.
My questions are this:
Is this a good idea? Is there a better way to accomplish this?
Are there any gotchas in doing this?
Will this have any impact on performance?
Everyone will deal with this at some point. My own personal opinion is that multiple databases are a pain in the backside and are not faster. They are a pain because of the maintenance headaches. Adding an extra column in each table as necessary will not slow your process done that much, if indexing is set properly. And your maintenance will be much easier. Plus, doing transactions across multiple DB's can be a hassle and involve MTC.
BTW, using a single database is often called a multi-tenant database. You might want to research this a bit. But I would avoid multiple DB's like this if possible.
I'm of a different mind than Randy.
The multi-tenant model has its advantages.
For one, maintenance is not really much different whether you have 5 databases or 500. At some point you stop looking at maintenance of individual databases and look at the set. Yes you must serialize backups and you can't be performing index reorg/rebuild across all databases at once.
But for code changes across multiple more-or-less identical databases, there are easy ways to script a lot of things to be done to multiple databases without really lifting an extra finger. I use a tool called SQLFarms Combine (now sold by JNetDirect), but there are other offerings such as RedGate MultiScript that I haven't played with.
What I like most about the multi-tenant model is that when you grow and scale and suddenly need a new database server, it is very easy to move one of the tenants (say, the busiest or fastest growing) to the new server. If everybody is jammed into the same database, this extraction of only their data becomes quite difficult, especially if there is to be minimized downtime. In the multi-tenant model, you can set up mirroring for just their database, and then switch the primary when you're ready.
I'd be in favor of combining these databases. There are other facilities built into SQL Server to account for the potential performance downfalls of a very large database, like additional indexing on a second physical disk, partitioning, clustering, etc. The headache and overhead involved in deploying schema updates to that many different databases can be time consuming when it's easily handled in a single database. I think SQL Server scales really well in cases like this - let the database server do what it's designed to do and provide responsive access to your data. You can focus on application design and leave the storage model to SQL Server.
Also, though this isn't mentioned above, I'd suspect that there's some level of dynamic SQL involved in the applications that use this "many database" model because you've got to switch between databases based on something you know, so it can't be hard coded into the application or in a configuration file, meaning that either connection strings or actual SQL statements have to be generated on the fly, and that can be a really big security risk (read about "SQL Injection" if you're unfamiliar with the potential risks of dynamic SQL).