I've been working on a project for dating-like app, kind of tinder/bumble. I've been debating which database to use Cassandra or MongoDB. So far I have experience only with MS SQL, mysql and Unidata... I've been looking into Cassandra and MongoDB because of scalability, but I've heard Tinder had issues with their MongoDB, thus they had to call in for help. Even if it is not any of those 2, what else would you suggest? Learning DB would not be an issue for me, but I am looking for performance and scalability. Main programming language will be C# (if it helps) and preferably I am looking for building this in cloud (Azure Cosmos DB, aws dynamoDB or similar). My thoughts are NoSQL DB because of scalability but I wouldn't be opposed to select RDBMS if there is strong reason.
Suggestions, comments, thoughts?
Cassandra has some advantages over mongodb.
There is no master-slave in cassandra. Any node can receive any
query. If master goes down on mongodb, you'll face with little down time.
It is easy to scale cassandra, adding a node is not a challange.
Writes are very fast.
Read query with primary key is fast.
Also
There is no aggregation in cassandra
Bad performance for very high update/delete (increasing tombstones causes bad performance impact : http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html)
Not efficient for fulltext search applications
No transactions
No joins
Secondary indexes are not equal to rdbs indexes and should not use very often
So you can not use cassandra for every use cases. If your data model does not fit for cassandra just consider another db which fits your requirements.
Also take a look at : https://blog.pythian.com/cassandra-use-cases/
Related
Currently, I generate data on a different datastore and replicate to Snowflake Staging, then that data moves to the Data Warehouse DB through ELT ingestion for Analytics purpose. However this approach can be considered as creating data-silos in itself, since we already have 3 copies of the same data:
Transactional data-store DB
Replicated snowflake staging
Snowflake Data Warehouse DB
From a technical architecture point of view, is it a good idea to use Snowflake as a direct datastore for transactional application? (application that does many CRUD operations). That may help in avoiding the cost of replication and ingestion.
The main problem I see with this approach is that: Snowflake does not enforce any referential integrity (primary keys, foreign keys) so within the CRUD app, I have to either use a MERGE statement always or somehow make sure I don't create duplicate records.
The other problem being in the cloud, the distance (aka network) between the app and snowflake decides the performance of the transactions, I want good, consistent performance of my CRUD operations.
Any thoughts/suggestions are much appreciated.
Snowflake as of today does not perform well with singleton updates and inserts, which is what we see mostly with transactional databases. I have seen a performance degradation when using singleton inserts are submitted against Snowflake.
On the contrary, they are very optimized for bulk ingestion of unstructured data and structured data though and are designed for OLAP warehouses. You can still use it but you may see the same performance degradation. Also, primary keys can be defined but they are not enforced.
In my opinion, if you are faced with that challenge, you have the option to use a Postgre SQL DB (open source) in the cloud as your transactional database and it acts as a good complement to Snowflake as the OLAP database.
No. Snowflake isn't good as a transactional / OLTP database for the reasons you've mentioned. Plus, it won't perform well with many individual CRUD operations due to how they structure the data (optimised for OLAP workloads).
Just want to point out that there are benefits to creating separate databases, for one you want to isolate your transactional database from that of your analytics database otherwise you could be significantly affect the performance of the application. Secondly, the data in the transactional database could change and if you had to reprocess the data for whatever reason you may not be able to do so. There are many more, but I will stop here :-)
What is the meaning of transaction in OLTP? Is it same as ACID Transaction or there is something else?
NoSql database like cassandra doesn't follow ACID properties. They follow CAP. In which category Cassandra falls? Is it OLAP or OLTP?
Despite not supporting true ACID, Cassandra is firstly an OLTP database which by definition means it's about capture, storage and processing of data from transactions in real time.
Cassandra can be used for OLAP workloads usually in combination with Apache Spark for analytics queries against the OLTP data. This is one of the really powerful use cases for Cassandra + Spark -- real time analytics so you can get insights in real time.
If you are running OLAP queries against Cassandra, the recommendation is to deploy in a multi-datacentre configuration such that:
the primary DC is used for OLTP workloads
Spark/OLAP queries are run against a separate second DC
By isolating OLTP workloads in its own DC, users of your application don't get affected by the load that analytics queries impose on the database. This ensures that the performance of your application and the user experience is consistent. Cheers!
In those context, they both mean database transaction.
OLTP emphasizes that it's a transactional system and usually came with better transaction supports such as ACID.
I'm currently scoping out at a potential development project where we will develop an analytical solution to support a production application. Obviously we want to run queries on reasonably up-to-date data, but we don't want the operational risk of querying the main database directly with (possibly expensive) analytical queries.
To do this I believe we would like to do the following:
Make a replica of a "production" PostgreSQL database into a separate "analytics" database
Add additional tables / views etc to the "analytics" database, which will support the analytics solution only and not be part of the application DB.
Maintain the replica copy of the production data in a reasonably up-to-date fashion (realtime replication not strictly needed, but no more than a few seconds lag would be good)
The database will not be excessively large (it is a web/mobile application with a lot of users but most not likely to be active at any one time).
Is this likely to be feasible with PostgreSQL, and if so what is the best strategy / replication technique to use?
You cannot use streaming replication for that, because you cannot add tables to a read-only database. But you might rethink the requirement to not add the additional tables to the production database.
However, there are other replication techniques like Slony, Bucardo or Londiste.
One thing that you should keep in mind is that a data model that is suitable for an online transaction processing database is usually not well suited for analytical applications, and you might end up being pretty unhappy with the performance of your analytical queries. For these, the normal thing to do is to build some sort of data warehouse where data are stored in a more denormalized form, usually in something like a star schema.
But for that you cannot have “no more than a few seconds lag”. Double check if that is really essential, it usually isn't for analytical queries.
We have multiple database which we query and generate report. Since we have to create complex queries and do lot of joins etc, Is it a good Idea if we use Cassandra or Hadoop or Elasticsearch to load data (daily jobs to load data or incremental updates) and query this database for all the task.
Which would be preferred choice Cassandra or Hadoop or Elasticsearch or MongoDB ?
We also want to build a Web UI for reporting and analytics on the consolidated database.
Want to improve this post? Add citations from reputable sources by editing the post. Posts with unsourced content may be edited or deleted.
I cannot recommend MongoDB. It's a subpar in terms of big data analysing, its Map-Reduce implementation is poor, Map-Reduce is slow and single-threaded. Cassandra + Hadoop or HDFS + Hadoop is your choice. In case of Hadoop you are not limited with storage type, you can flush (or store initially) your data in HDFS and iterate it with MapReduce.
If you need a durability look at the Cassandra. First, Cassandra is very easy in maintenance and very reliable. I believe Cassandra is the most reliable noSQL db in the world. It's absolutely horizontally scallable, no name nodes, no master/slaves, all nodes a leveled in rights.
With Elasticsearch you can do only search. If you have a lot of data and you needed an analytics you should look towards Hadoop and MapReduce.
With Hadoop you can to start using Hive or Pig - the most powerfull map-reduce abstractions I've ever seen. With Hadoop you can even start thinking about migration to Spark/Shark.
Cassandra would be a best if your choice is limited to those three as writing joins in MapReduce programs involves lot of efforts with multiple and chaining of MapReduce programs to get one join correctly. If your options are open, Apache Hive can be leveraged to non interactive or reporting applications as it supports quite number of SQL functions such as joins, group by, order by etc. Apache Hive is again supports SQL like queries and there wouldn't be much different from the traditional SQLs.
You could also consider Apache Drill, Hortonworks Stinger and Cloudera Impala for interactive reporting applications.
I am a newbie in NoSQL databases and this may sound a bit stupid but I was wondering if NoSQL databases use or need indexes?
If yes, how to make or manage them? any links?
Thanks
CouchDB and MongoDB definitely yes. I mentioned that in my book:
http://use-the-index-luke.com/sql/testing-scalability/response-time-throughput-scaling-horizontal
Here are the respective docs:
http://guide.couchdb.org/draft/btree.html
http://www.mongodb.org/display/DOCS/Indexes
NoSQL is, however, too fragmented to give a definite "yes, all NoSQL systems need indexes", I believe. Most systems require and provide indexes but not at level most SQL databases do. Recently, the Cassandra people were proudly introducing secondary indexes, i.e., more than a single clustered index.
http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes (well, not so recently as I remember)
Definitely nosql databases need index,
i.e. but in most popular databases you need not to maintain index by yourself because as per current needs of nosql databases communities of nosql databases is developing with new features and with "Code Less, Get More"