What's the difference between a transaction manager and a database manager? - database

Reading on both it seems that they both have similar responsibilities of managing the sharing and integrity of resources as well as prioritizing execution but I cannot seem to find how they differ? Can someone clarify this misunderstanding.
Thank You

In addition to what Oded already said:
A transaction manager manages transactions - and a transaction can include/address other resources than just databases. I have given the example of a printer at some occasions before.
A database manager manages data - and not necessarily in a transactional way. There is a very popular SQL system whose 1.0 version did not have commit/rollback, iow, did not offer transactional functionality and thus did not offer much of support for data integrity.
The distinction is mostly rather obtuse, however, because:
a great many real-life transactions involve no other recoverable resources than just the database,
in order to guarantee data consistency, DBMS's cannot avoid having to offer most if not all of the functionality of transactions.

A transaction manager manages transactions - these can be distributed (i.e. involving several databases/systems).
A database manager deals with a single database - managing it on the disk, memory consumption, query parsing etc...

Just to ensure understanding:
Transaction Manager deals with multiple levels of control and the physical database.
Database Manager deals with the direct access of the physical database.
I would also like to add to both of these answers that the transaction manager is also responsible to enforce ACID (Atomicity, Consistency, Isolation, and Durability). I was pretty much confused as well.

Related

In a horizontally scalable architecture, how to deal with database concurrency when multiple instances read/write to the same row simultaneously?

I know SQL Server is very robust in this sense (transactions and locking), but how would that work with NoSQL databases like AWS DocumentDB with Mongo API?
There's no shortcut to diving in and learning each individual systems concurrency model and offerings :/
These guarentees can be found by searching for "Isolation Levels" Or "Default Isolation Levels" for your target database.
https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/
https://www.postgresql.org/docs/7.2/xact-read-committed.html
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
One thing to note is that the MySQL and PostgreSQL default isolation level is "Read Committed". Which actually can lead to incorrect applications in concurrent environments for common types of queries.
For example if you have a multi threaded web application which allows users to set to their account balance. If both threads fetch the account balance this will result in a logical race where the last thread ends up overwriting the result of first thread. This is described in detail in each of the documents above.
In Amazon DocumentDB, all CRUD statements (findAndModify, update, insert, delete) guarantee atomicity and consistency, even for operations that modify multiple documents. For more information, see Implicit Transactions.
Additionally, reads from an Amazon DocumentDB cluster’s primary instance are strongly consistent under normal operating conditions and have read-after-write consistency. For more information, see Read Preference Options.

Should I use In memory SQL (Hekaton) for queue messaging system?

I work with a platform that have a messaging system that uses SQL server tables as queues.
That system was based on this: Using Tables as queues
ATM we are facing some scalability issues, since this distributive schema is mainly based on SQL Locks and disk operations in order to guarantee the durability/coherence of data.
In order to solve the disk based I/O bottleneck and to improve the bad distributive logic, we are thinking in changing disk based SQL tables to In memory SQL (Hekaton) available at SQL 2014 & 2016.
I've read some stuff about Hekaton already, but I'm still not sure if this is the best approach, or if is possible to implement those queues into In memory and if this is the best approach.
Most of those queues are implementing pessimistic concurrency, and Hekaton uses no locks system only optimistic concurrency (based on multi-versioning). Is it "always" (I know this is a bad word) possible to change the pessimistic concurrency into an optimistic one? For example on the above queues.
Is Hekaton made for many inserts/deletes (enqueue/dequeue), order rows (FIFO queues), and a lot of variations of table sizes (workload variations on the server will increase/decrease the queues size)? It will be possible to update properly the Statistics for the query performance of native store procedures?
I feel like native compiled SQL store procedures will improve a lot the performance, but I'm not sure if this kind of implementation (correlated FIFO queues) are good to be used on Hekaton, since I'm not finding any examples of "In memory queues" implementations using Hekaton.
You can implement in Hekaton what you described - as you mentioned, the app will have to take care of retrying in case of the transaction being aborted due to concurrency on the same row. Having said that, you also have to consider that SQL 2014 does not support large binary objects, you will need to use SQL 2016 or workaround it as we do it for ASP.NET Session state:
http://blogs.msdn.com/b/kenkilty/archive/2014/07/03/asp-net-session-state-using-sql-sever-in-memory.aspx
Hekaton is designed for OLTP, that means, lots of inserts, updates, deletes.
Plan ahead the memory requirements:
https://msdn.microsoft.com/en-us/library/dn133186.aspx

Architecting a high performing "inserting solution"

I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.

What's the best DB to store banking transactions?

We are planning to create a web app to store banking transactions for customers, e.g purchases, transfers etc and allow them to tag / categorize each transaction.
Could someone point us to the best DB for this purpose? It needs to scale horizontally and we also need to perform analysis on all transactions.
Thanks
The best database to store banking transactions is the one the banks use, DB2/z.
But, since I doubt you'd be able to afford a System z mainframe, that's probably not an option. That doesn't make it any less the best database of course.
If, however, you're talking about storing transaction for Joe Bloggs or Dodgy Brothers Rug Emporium (as opposed to the two hundred million or so customers of ICBC), pretty well any database will be up to the task - Oracle (despite its inability to differentiate NULLs from empty strings), SQL Server, MySQL PostgreSQL, even SQLite probably.
I'm going to start this by saying its almost impossible to recommend a system based on what you've described. It could be for such a varied number of uses, ranging from mission critical real time financial data that needs to be there and needs to be accurate, through to a web app that sucks in financial records from a bank/credit card statement and lets the user annotate them, in which case it isn't as sensitive.
If you're storing mission critical, sensitive data, I'd go with a commercial option that includes significant support. Also a DBA would be a good idea.
Oracle or MS SQL would be my inclination, and probably Oracle over MS SQL, over because of its multi-platform support. If you're happy to run on Windows then MS SQL is fine.
If you're storing existing transactions that can be tagged (ala Blippy), then any database would be sufficient. If you're thinking of scaling this out to the n'th degree, you might like one of the document database flavours of the month, (MongoDB, Couch etc).
Really I think the question should be reconsidered from the context of what your application will do, not that it happens to do it with financial data. The fact that financial data may require additional security, or additional accuracy checks, that forms part of what the system will do, as does the way the user interacts with your web app etc.
This may not answer your question directly, but here is what I have experienced.
I think, its really about how you'd save your banking transactions. Most database vendors provide sufficient amount of database performance, so all you have to do is to choose one over other.
What you are left with is the actual information to be saved(besides schema). You might think about using database encryption option, but then its not really realistic in your case; because you are talking about transactions, I assume there are quite alot of transactions coming in, and you doing large of amount of reads for your reporting(besides write), possibly for mining, etc.
Usually(sql server), using encryption any data that is written into the database file is encrypted. Snapshots and backups are also use encryption. The transaction log is also protected, so it would hit the performance that you might desire.
So, I see your question really boiling down to How to protect sensitive data?
Here are couple of articles that might help:
Btw, I have deployed solutions with Oracle, SQL Server, and even Sybase as backends, with several transactions pouring in from ATMs, and what I really look for is the performance, besides security. Except for minute limitations of one over other, all are same.
Following articles might help:
Database security: protecting sensitive and critical information
Using One-Way Functions to Protect Sensitive Information in SQL Server Databases

What alternatives do I have if I want a distributed multi-master database?

I will build a system where I want to reduce single-point-of-failures, and I need a database. Is there any (free) relational database systems that can handle multi-master setups good (i.e where it is easy to add and remove nodes) or is it better to go with a NoSQL-database?
As what I have understood, a key-value store will handle this better. What database system do you recommend for a multi-master (cluster) setup?
Mysql's NDB Cluster WILL do this. But it's far from easy to set up and has a lot of gotchas.
And also, its performance is generally fairly sucky and it keeps data in memory (yes, I know they sound contradictory).
Essentially, updates need to acquire distributed locks throughout the cluster (or at least in the storage node group where those table(s) are held)
It is not easy to manage, but you can do some level of hot-add.
Unless you require very rapid failover and consistency, I'd recommend against it.
I'd recommend ignoring multi-master, and using a HA MySQL instead (with e.g. InnoDB) which is easy to set up and works very well with typical sub 30-second failover times. This is a master-slave system where the slave cannot even do reads (but you can add read slaves with replication provided you don't need them to be completely up to date)
Key-value stores are not necessarily fault tolerant. They are primarily performance tools. Only when data is stored on more than one server is there any form of fault tolerance. If it is just safety, reducing single point of failure the simplest solution is probably set up a mirroring solution, where you have a mirror that just tracks the master database. When the master somehow fails, you quickly switch over (hopefully automatically).
The complexity of this is much lower as there is no consistency management needed during normal operation. The mirror is read-only and just tracks the master database. When the master fails, the mirror is switched to master and the link broken. After the master gets back up the state between them is inconsistent and you must make sure to update the original master from the mirror now acting as master. Most database systems can handle this scenario, and if you have no insane uptime requirements or a very heavy load it is the most pragmatic solution.
I think Oracle has nailed this concept. However, if you're a mortal without a swiss bank account, then maybe you should look into MySQL's NDB Cluster.

Resources