Eventually consistent document store database similar to cassandra - database

I'm looking for an open source data store that scales as easily as Cassandra but data can be queried via documents like MongoDB.
Are there currently any databases out that do this?

In this website http://nosql-database.org you can find a list of many NoSQL databases sorted by datastore types, you should check the Document stores there.
I'm not naming any specific database to avoid a biased/opinion-based answer, but if you are interested in a data store that is as scalable as Cassandra, you probably want to check those which use master-master/multi-master/masterless (you name it, the idea is the same) architecture, where both writes and reads can be split among all nodes in the cluster.
I know Cassandra is optimized towards writes rather than reads, but without further details in the question can't refine the answer with more information.
Update:
Disclaimer: I haven't used CouchDB at all, and haven't tested it's performance either.
Since you spotted CouchDB I'll add what I've found in the official documentation, in the distributed database and replication section.
CouchDB is a peer-based distributed database system. It allows users
and servers to access and update the same shared data while
disconnected. Those changes can then be replicated bi-directionally
later.
The CouchDB document storage, view and security models are designed to
work together to make true bi-directional replication efficient and
reliable. Both documents and designs can replicate, allowing full
database applications (including application design, logic and data)
to be replicated to laptops for offline use, or replicated to servers
in remote offices where slow or unreliable connections make sharing
data difficult.
The replication process is incremental. At the database level,
replication only examines documents updated since the last
replication. Then for each updated document, only fields and blobs
that have changed are replicated across the network. If replication
fails at any step, due to network problems or crash for example, the
next replication restarts at the same document where it left off.
Partial replicas can be created and maintained. Replication can be
filtered by a javascript function, so that only particular documents
or those meeting specific criteria are replicated. This can allow
users to take subsets of a large shared database application offline
for their own use, while maintaining normal interaction with the
application and that subset of data.
Which looks quite scalable to me, as it seems you can add new nodes to the cluster and then all the data gets replicated.
Also partial replicas seems an interesting option for really big data sets, which I'd configure these very carefully, in order to prevent situations where a given query to the database might not yield valid results, for example, in the case of a network partition and having only access to a partial set.

Related

Are CoW snapshots the solution to safely pull data from critial OLTP databases for reporting?

Our IT team copies data from mission-critical SQL Server OLTP databases in what seems to be a naive way - basically just INSERT INTO ... SELECT * every night. We use this copied data database for reporting. This is unsatisfactory for various reasons but we're told it is the only way because uncontrolled user query execution could compromise OLTP performance & data integrity. I want an improvement that still addresses their concerns.
Copy-on-write snapshots are the best solution I've read about (we don't need up-to-the-minute data for reporting), but please comment on the following:
The snapshot's sparse files should be placed on a separate physical drive (so that snapshot reads/writes can occur without limiting disk throughput for OLTP tasks).
There should be a single NTFS filesystem spanning all physical disks (on a hunch that would work better than putting the online database its snapshots on logically separated volumes).
Create the filesystem with the /L:enable flag (so it works better with large sparse files).
Avoid multiple snapshots (since original data would have to be copied to each one).
We could use a single snapshot MyDB_LatestSnapshot that could be deleted and very quickly re-created every day, or even throughout the day (so long as kicking users running reports off it is acceptable).
Since the database snapshots will always be recent, most data will not have changed and so it will still have to be retrieved from the same drive as the online OLTP database, so increased resource (CPU/RAM) use is inevitable. Won't a long-running reporting query that pulls years of historical data (including data that hasn't changed and therefore doesn't exist in the snapshot) block writes just as if it were running against the online database?
Is there any way to tell SQL Server to prioritize resources for the needs of the OLTP database?
I've found examples of how original rows are copied from the online database when they're updated, but how do snapshots handle structural changes in the new database, like new/altered tables, indexes, etc.?
Can snapshots have different user permissions versus the online database (so that users can read from the snapshot, but not the online database)?
The OLTP system runs core banking applications, so I understand utmost caution is justified, but I can't believe the current approach is best practice in 2022.

Loading data from SQL Server to Elasticsearch

Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).

When to prefer master-slave and when to cluster?

I know there have been many articles written about database replication. Trust me, I spent some time reading those articles including this SO one that explaints the pros and cons of replication. This SO article goes in depth about replication and clustering individually, but doesn't answer these simple questions that I have:
When do you replicate your database, and when do you cluster?
Can both be performed at the same time? If yes, what are the inspirations for each?
Thanks in advance.
MySQL currently supports two different solutions for creating a high availability environment and achieving multi-server scalability.
MySQL Replication
The first form is replication, which MySQL has supported since MySQL version 3.23. Replication in MySQL is currently implemented as an asyncronous master-slave setup that uses a logical log-shipping backend.
A master-slave setup means that one server is designated to act as the master. It is then required to receive all of the write queries. The master then executes and logs the queries, which is then shipped to the slave to execute and hence to keep the same data across all of the replication members.
Replication is asyncronous, which means that the slave server is not guaranteed to have the data when the master performs the change. Normally, replication will be as real-time as possible. However, there is no guarantee about the time required for the change to propagate to the slave.
Replication can be used for many reasons. Some of the more common reasons include scalibility, server failover, and for backup solutions.
Scalibility can be achieved due to the fact that you can now do can do SELECT queries across any of the slaves. Write statements however are not improved generally due to the fact that writes have to occur on each of the replication member.
Failover can be implemented fairly easily using an external monitoring utility that uses a heartbeat or similar mechanism to detect the failure of a master server. MySQL does not currently do automatic failover as the logic is generally very application dependent. Keep in mind that due to the fact that replication is asynchronous that it is possible that not all of the changes done on the master will have propagated to the slave.
MySQL replication works very well even across slower connections, and with connections that aren't continuous. It also is able to be used across different hardware and software platforms. It is possible to use replication with most storage engines including MyISAM and InnoDB.
MySQL Cluster
MySQL Cluster is a shared nothing, distributed, partitioning system that uses synchronous replication in order to maintain high availability and performance.
MySQL Cluster is implemented through a separate storage engine called NDB Cluster. This storage engine will automatically partition data across a number of data nodes. The automatic partitioning of data allows for parallelization of queries that are executed. Both reads and writes can be scaled in this fashion since the writes can be distributed across many nodes.
Internally, MySQL Cluster also uses synchronous replication in order to remove any single point of failure from the system. Since two or more nodes are always guaranteed to have the data fragment, at least one node can fail without any impact on running transactions. Failure detection is automatically handled with the dead node being removed transparent to the application. Upon node restart, it will automatically be re-integrated into the cluster and begin handling requests as soon as possible.
There are a number of limitations that currently exist and have to be kept in mind while deciding if MySQL Cluster is the correct solution for your situation.
Currently all of the data and indexes stored in MySQL Cluster are stored in main memory across the cluster. This does restrict the size of the database based on the systems used in the cluster.
MySQL Cluster is designed to be used on an internal network as latency is very important for response time.
As a result, it is not possible to run a single cluster across a wide geographic distance. In addition, while MySQL Cluster will work over commodity network setups, in order to attain the highest performance possible special clustering interconnects can be used.
In Master-Salve configuration the write operations are performed by Master and Read by slave. So all SQL request first reaches the Master and a queue of request is maintained and the read operation get executed only after completion of write. There is a common problem in Master-Salve configuration which i also witnessed is that when queue becomes too large to be maintatined by master then this achitecture collapse and the slave starts behaving like master.
For clusters i have worked on Cassandra where the request reaches a node(table) and a commit hash is maintained which notices the differences made to a node and updates the other nodes based on that commit hash. So here all operations are not dependent on a single node.
We used Master-Salve when write data is not big in size and count otherwise we use clusters.
Clusters are expensive in space and Master-Salve in time so your desicion of what to choose depends on what you want to save.
We can also use both at the same time, i have done this in my current company.
We moved the tables with most write operations to Cassandra and we have written 4 API to perform the CRUD operation on tables in Cassandra. As whenever an HTTP request comes it first hits our web server and from the code running on our web server we can decide which operation has to be performed (among CRUD) and then we call that particular API to make changes to the cassandra database.

SQL Server vs. No-SQL Database

I have inherited a legacy content delivery system and I need to re-design & re-build it. The content is delivered by content suppliers (e.g. Sony Music) and is ingested by a legacy .NET app into a SQL Server database.
Each content has some common properties (e.g. Title & Artist Name) as well as some content-type specific properties (e.g. Bit Rate for MP3 files and Frame Rate for video files).
This information is stored in a relational database in multiple tables. These tables might have null values in some of their fields because those fields might not belong to a property of the content. The database is constantly under write operations because the content ingestion system is constantly receiving content files from the suppliers and then adds their metadata to the database.
Also, there is a public facing web application which lets end users buy the ingested contents (e.g. musics, videos etc). This web application totally relies on an Elasticsearch index. In fact this application does not see the database at all and uses the Elasticsearch index as the source of data. The reason is that SQL Server does not perform as fast and as efficient as Elasticsearch when it comes to text-search.
To keep the database and Elasticsearch in sync there is a Windows service which reads the updates from SQL Sever and writes them to the Elasticsearch index!
As you can see there are a few problems here:
The data is saved in a relational database which makes the data hard to manage. e.g. there is a table of 3 billion records to store metadata of each contents as a key value pairs! To me using a NoSQL database or index would make a lot more sense as they allow to store documents with different formats in them.
The Elasticsearch index needs to be kept in Sync with the database. If the Windows services does not work for any reason then the index will not get updated. Also when there are too many inserts/updates in the database it takes a while for the index to get updated.
We need to maintain two sources of data which has cost overhead.
Now my question: is there a NoSQL database which has these characteristics?
Allows me to store documents with different structures in it?
Provides good text-search functions and performance? e.g. Fuzzy search etc.
Allows multiple updates to be made to its data concurrently? Based on my experience Elasticsearch has problems with concurrent updates.
It can be installed and used at Amazon AWS infrastructure because our new products will be hosted on AWS. Auto scaling and clustering is important. e.g. DynamoDB.
It would have a kind of GUI so that support staff or developers could modify the data to some extent.
A combination of DynamoDB and ElasticSearch may work for your use case.
DynamoDB certainly supports characteristics 1, 3, 4, and 5.
There is now a Logstash Input Plugin for DynamoDB that can be combined with an ElasticSearch output plugin to keep your table and index in sync in real time. ElasticSearch provides characteristic 2.

Architecting a high performing "inserting solution"

I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.

Resources