What is the meaning of transaction in OLTP? Is it same as ACID Transaction or there is something else?
NoSql database like cassandra doesn't follow ACID properties. They follow CAP. In which category Cassandra falls? Is it OLAP or OLTP?
Despite not supporting true ACID, Cassandra is firstly an OLTP database which by definition means it's about capture, storage and processing of data from transactions in real time.
Cassandra can be used for OLAP workloads usually in combination with Apache Spark for analytics queries against the OLTP data. This is one of the really powerful use cases for Cassandra + Spark -- real time analytics so you can get insights in real time.
If you are running OLAP queries against Cassandra, the recommendation is to deploy in a multi-datacentre configuration such that:
the primary DC is used for OLTP workloads
Spark/OLAP queries are run against a separate second DC
By isolating OLTP workloads in its own DC, users of your application don't get affected by the load that analytics queries impose on the database. This ensures that the performance of your application and the user experience is consistent. Cheers!
In those context, they both mean database transaction.
OLTP emphasizes that it's a transactional system and usually came with better transaction supports such as ACID.
Related
Currently, I generate data on a different datastore and replicate to Snowflake Staging, then that data moves to the Data Warehouse DB through ELT ingestion for Analytics purpose. However this approach can be considered as creating data-silos in itself, since we already have 3 copies of the same data:
Transactional data-store DB
Replicated snowflake staging
Snowflake Data Warehouse DB
From a technical architecture point of view, is it a good idea to use Snowflake as a direct datastore for transactional application? (application that does many CRUD operations). That may help in avoiding the cost of replication and ingestion.
The main problem I see with this approach is that: Snowflake does not enforce any referential integrity (primary keys, foreign keys) so within the CRUD app, I have to either use a MERGE statement always or somehow make sure I don't create duplicate records.
The other problem being in the cloud, the distance (aka network) between the app and snowflake decides the performance of the transactions, I want good, consistent performance of my CRUD operations.
Any thoughts/suggestions are much appreciated.
Snowflake as of today does not perform well with singleton updates and inserts, which is what we see mostly with transactional databases. I have seen a performance degradation when using singleton inserts are submitted against Snowflake.
On the contrary, they are very optimized for bulk ingestion of unstructured data and structured data though and are designed for OLAP warehouses. You can still use it but you may see the same performance degradation. Also, primary keys can be defined but they are not enforced.
In my opinion, if you are faced with that challenge, you have the option to use a Postgre SQL DB (open source) in the cloud as your transactional database and it acts as a good complement to Snowflake as the OLAP database.
No. Snowflake isn't good as a transactional / OLTP database for the reasons you've mentioned. Plus, it won't perform well with many individual CRUD operations due to how they structure the data (optimised for OLAP workloads).
Just want to point out that there are benefits to creating separate databases, for one you want to isolate your transactional database from that of your analytics database otherwise you could be significantly affect the performance of the application. Secondly, the data in the transactional database could change and if you had to reprocess the data for whatever reason you may not be able to do so. There are many more, but I will stop here :-)
I've been working on a project for dating-like app, kind of tinder/bumble. I've been debating which database to use Cassandra or MongoDB. So far I have experience only with MS SQL, mysql and Unidata... I've been looking into Cassandra and MongoDB because of scalability, but I've heard Tinder had issues with their MongoDB, thus they had to call in for help. Even if it is not any of those 2, what else would you suggest? Learning DB would not be an issue for me, but I am looking for performance and scalability. Main programming language will be C# (if it helps) and preferably I am looking for building this in cloud (Azure Cosmos DB, aws dynamoDB or similar). My thoughts are NoSQL DB because of scalability but I wouldn't be opposed to select RDBMS if there is strong reason.
Suggestions, comments, thoughts?
Cassandra has some advantages over mongodb.
There is no master-slave in cassandra. Any node can receive any
query. If master goes down on mongodb, you'll face with little down time.
It is easy to scale cassandra, adding a node is not a challange.
Writes are very fast.
Read query with primary key is fast.
Also
There is no aggregation in cassandra
Bad performance for very high update/delete (increasing tombstones causes bad performance impact : http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html)
Not efficient for fulltext search applications
No transactions
No joins
Secondary indexes are not equal to rdbs indexes and should not use very often
So you can not use cassandra for every use cases. If your data model does not fit for cassandra just consider another db which fits your requirements.
Also take a look at : https://blog.pythian.com/cassandra-use-cases/
I'm currently scoping out at a potential development project where we will develop an analytical solution to support a production application. Obviously we want to run queries on reasonably up-to-date data, but we don't want the operational risk of querying the main database directly with (possibly expensive) analytical queries.
To do this I believe we would like to do the following:
Make a replica of a "production" PostgreSQL database into a separate "analytics" database
Add additional tables / views etc to the "analytics" database, which will support the analytics solution only and not be part of the application DB.
Maintain the replica copy of the production data in a reasonably up-to-date fashion (realtime replication not strictly needed, but no more than a few seconds lag would be good)
The database will not be excessively large (it is a web/mobile application with a lot of users but most not likely to be active at any one time).
Is this likely to be feasible with PostgreSQL, and if so what is the best strategy / replication technique to use?
You cannot use streaming replication for that, because you cannot add tables to a read-only database. But you might rethink the requirement to not add the additional tables to the production database.
However, there are other replication techniques like Slony, Bucardo or Londiste.
One thing that you should keep in mind is that a data model that is suitable for an online transaction processing database is usually not well suited for analytical applications, and you might end up being pretty unhappy with the performance of your analytical queries. For these, the normal thing to do is to build some sort of data warehouse where data are stored in a more denormalized form, usually in something like a star schema.
But for that you cannot have “no more than a few seconds lag”. Double check if that is really essential, it usually isn't for analytical queries.
How would you scale writings without recurring to sharding (specially with SQL Server 2008)?
Normally ... avoid indexes and foreign keys in big tables. Every insert/update on a indexed column implies rebuilding partially the index and sometimes this can be very costly. Of course, you'll have to trade query speed VS writing speed but this is a known issue in database design. You can combine this with a NoSQL database with a some sort of mechanism for caching queries. Maybe a fast NoSQL system sitting in front of your transactional system.
Another option is to use transactions in order to do many writes in one go, when you commit the transaction the indexes will be rebuilt but just once per transaction not one per write.
Why not shard? The complexities in the code can be avoided by using transparent sharding tools, which ease all the heavy lifting associated with sharding.
Check out ScaleBase for more info
What is difference between 3 type of replication?
Is replication suitable for data archiving?
What is the replication steps?
Following are the three types of replication in SQL server.
Transactional replication
Merge replication
Snapshot replication
For more See http://technet.microsoft.com/en-us/library/ms152531.aspx
Replication can be used for archiving purposes as well but with some additional mechanisms. Most of the time I have seen, it is used in data warehousing scenarios to reduce load on the OLTP system.
See http://msdn.microsoft.com/en-us/library/ms151198.aspx
From MSDN:
Transactional replication is typically used in server-to-server scenarios that require high throughput, including: improving scalability and availability; data warehousing and reporting; integrating data from multiple sites; integrating heterogeneous data; and offloading batch processing. Merge replication is primarily designed for mobile applications or distributed server applications that have possible data conflicts. Common scenarios include: exchanging data with mobile users; consumer point of sale (POS) applications; and integration of data from multiple sites. Snapshot replication is used to provide the initial data set for transactional and merge replication; it can also be used when complete refreshes of data are appropriate. With these three types of replication, SQL Server provides a powerful and flexible system for synchronizing data across your enterprise.
There are lots of other articles on the Internet and lots of good books on SQL Server. It's a bit of a broad question to ask what it is AND how to implement it.