Are staging tables / staging databases an anti-pattern? - database

Are staging tables an anti-pattern that is used when rpc (such as Java RMI or some kind of Web Service call) or messaging queue (such as JMS) would be a better solution, or are there problems better served by staging tables?
To clarify:
By staging tables I mean those cases where records are appended to a table or tables by a process which is then read by and acted on by second process or processes. I am not referring to tables which tables which are meant to reflect end of interval status (end of day, end of pay period etc). In most cases, the schema of the staging tables closely mimics an application data type(s) such as customer or account.
Potential causes for this anti-pattern:
1) Business Unit Wall between owners of the two processes prevents process that writes to or reads from staging being modified.
2) Low confidence in process that writes to or reads from staging leads developers to use table to prevent data loss "in case something fails"
3) Lack of knowledge or DGAS (don't give a ^%$#) attitude

Staging tables, as you describe are an essential part of most data warehouse or BI environments. You could argue that reliable/resilient rpc would do the same job, but I think you'd be incorrect.
By pulling data to a staging table, you're moving it out of the production environment, potentially to do further calculation, summary, re-index, re-keying and so on, the majority of these are acheived 'in database'. Replacing this with an RPC you're moving the code and CPU cycles out of the DB and into an app server for no real benefit. For instance an app server has a much higher chance of crashing - you can't (easily) rollback an RPC.
Of course there are many ways of moving data reliably between systems, staging tables just happen to be one of the easiest, most performant, reliable and in development terms cheapest, doesn't always mean they're the right approach - but more often than not.

Why would they be an anti-pattern? Staging tables are incredibly useful for decoupling a receiving service from a processing service. When two such services are decoupled you are much more resilient to processing errors and network errors as all messages are stored in the staging table.

The only real time I have seen this is for reporting reason when denormalised tables are used to hold data while a report is generated. I don't think it is a problem for that use.

My first response is yes, but its mostly just because of my situation - yours may be different. We have a system where some relatively time sensitive information needs to go from a command component to a receiver component. The command information is put into a database table and then the receiver polls the table for updates. This is horrible. They did it so there would be a record of the commands in the database, but it ends up just making the actual commanding take forever and the decoupling sometimes causes the receiver to be out of sync with the database.
I'd rather see an EMS (like JMS) broadcast the message to a topic that both the receiver and a database inserter listen to, or a queue from commander to receiver, and then the receiver notify a status listener to put its status in the database.
I can't wait to fix that code.

Related

Problems and solutions when using a secondary datastore alongside the main database?

I am in the middle of an interview simulation and I got stock with one question. Can someone provide the answer for me please?
The question:
We use a secondary datastore (we use elasticsearch alongside our main database) for real time analytics and reporting. What problems might you anticipate with this sort of approach? Explain how would go about solving or mitigating them?
Thank you
There are several problems:
No transactional cover : If your main database is transactional (which it usually is), so you either commit or you don't. After the record is inserted into your main database, there is no guarentee that it will be committed to ES. In fact if you commit several records to your primary DB, you may have a situation where some of them are committed to ES, and few others are not. This is a MAJOR issue.
Refresh Interval : Elasticsearch by default refreshes every second. That means "Real-time" is generally 1 second later, or at least when the data is queried for. If you commit a record into your primary db, and immediately query for it via ES, it may not get found. THe only way around this is to GET the record using its ID.
Data-Duplication : Elasticsearch cannot do joins. You need to denormalize all data that is coming from a RDBMS. If one user has many posts, you cannot "join" to search. You have to add the user id an any other user specific details to every post object.
Hardware : Elasticsearch needs RAM (bare minimum of 1 gb) to work properly. This is assuming you don't use anything else from the ELK stack. THis is an important cost wise consideration.
One problem might be synchronization issues, where the elastic search store gets out of sync and starts service stale data. To avoid issues, you will have to implement monitoring on your data pipeline, elastic search and the primary database, to detect any problem by checking for update times, delay, number of records (within some level of error) in each of them and overall system operation status (up / down).
Another is disconnection and recovery - what happens if your data pipeline or elastic search loses connection to the rest of the system? You will need an automatic way to re-connect, when network is restored and start synchronising data again.
You also have to take into account sudden influx of data - how to scale ElasticSearch ingestion or your data processor (data pipeline) if there is large amount of updates and inserts in peak hours or after re-connection when there was network issues.

Make microservice application resilient to db downtime

We have a microservice application which is saving the data into an Oracle Db.
So far the DB is our single point of failure which we want to improve (we are using a single Oracle DB with a cold failover instance).
Now the company is asking us to upgrade the oracle DB, the issue is that it requires downtime.
For that reason we were thinking about:
add a global/geo replicated cache layer (e.g redis) between the microservice and the DB
for each new record that should be saved on the db:
Add the record in the cache (storing the entries on the HD in case the whole cache layer crashes)
throw an event to a queue (we have RabbitMQ). On the other side of the queue we can create a new service to consume the events and add them to the DB in an asynch way.
It's basically adding a write-behind cache layer.
In the above scenario we are confident that we can save easily 1 week data in the cache or more.
If the DB is down the new service which is listening to the queue will simply re-trying adding the rows in the DB, as soon as an event is added to the Db then the event can be ack and the next one will be consumed. In this way, if the DB is down or if we have to do some maintenance, it should not affect the main application: the users can still "save" the data and retrieve it (with the 1 week max constraint whenever the db is down).
The down side is that the architecture is more complex and we can have now data eventual consistency.
Is there another design pattern to better deal with database downtime without having the users feel that something is wrong?
Do you know any already-existing tools that we can use to automatically read an event from Rabbit and save it in the db? (we are already doing it with logstash to automatically forwards some rabbit events to elastic).
The next step would be to have a cluster of DB (cassandra,mongo etc) but for now we do not have the capacity for that.
Adding cache for increase availability is, probably, an awkward solution - as you will eventually get to the same issue of keeping cache available. Also, handling cold caches is not a simple task.
I am not familiar with Oracle, but most databases do support replication; and you have options for synchronous/asynchronous/semi-synchronous patterns.
Quick search helped me to discover "Oracle Data Guard" - seems that's the tool you need. Docs say that the Guard supports data replication and failover.
As for using Cassandra - I highly recommend to evaluate that first - Oracle gives you ACID properties and joins; this makes application code much simpler. Also, consistency patterns will be different. Lots of details to think about.
My general recommendation is to look into your data layer (oracle in this case) and follow their recommendation to achieve high availability. Oracle is mature product, and availability is well-supported.

Can Snowflake be used to mitigate application failure for business continuity?

I would like your opinions or experiences around the following possible solution idea. I know Snowflake is primarily a data analytics platform. But why could we not use it for some creative scenarios like business continuity?
Problem
Imagine an application that supports a critical business process. There is a risk that the application could become unavailable for an extended period. The application in this case is a SaaS solution by a reputable vendor, Salesforce. So it does not go down often. And when it does, they normally restore it in less than a day. But the business process is a critical medical logistics process - meaning if a transaction is delayed for a few days, lives may be lost.
Background
Our transaction volumes are moderate. We probably serve 25 new patients per day, with a few hundred interactions each day to support those. In the even of an outage, a subset of those might need immediate manual intervention to keep things moving. Others might be able to wait a couple of days.
We already use Snowflake to store replicas of the application's data. We use Looker to write analytics reports.
Proposed Solution
Write reports that expose critical data that may be needed if the primary application fails. Then, when the primary application fails, users can view reports using the latest replicated data to enable manual activities to keep things going until the primary application is restored to working order.
If data changes are needed, they must be written down somewhere and then applied to the application when its availability is restored
Your only issue could be latency, as it is today Snowflake is not built for OLTP workloads, but OLAP workloads.
If the latency you get when running queries from Snowflake is fine then you have a valid use case.
Snowflake is used as an Application backend - particularly if the query is about historical analysis and latency is acceptable at a few seconds as opposed to immediate.
See: https://www.snowflake.com/workloads/data-applications/

How does multi table schema create data consistency issues?

As per this answer, it is recommended to go for single table in Cassandra.
Cassandra 3.0
We are planning for below schema:
Second table has composite key. PK(domain_id, item_id). So, domain_id is partition key & item_id will be clustering key.
GET request handler will access(read) two tables
POST request handler will access(write) into two tables
PUT request handler will access(write) details table(only)
As per CAP theorem,
What are the consistency issues in having multi-table schema? in Cassandra...
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
recommended to go for single table in Cassandra.
I would recommend the opposite. If you have to support multiple queries for the same data in Apache Cassandra, you should have one table for each query.
What are the consistency issues in having multi-table schema? in Cassandra...
Consistency issues between query tables can happen when writes are applied to one table but not the other(s). In that case, the application should have a way to gracefully handle it. If it becomes problematic, perhaps running a nightly job to keep them in-sync might be necessary.
You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process. In that case, a given data point may have only a subset of its intended replicas.
This scenario can be countered by running regularly-scheduled repairs. Additionally, consistency can be increased on a per-query basis (QUORUM vs. ONE, etc), and consistency levels of QUORUM and higher will occasionally trigger a read-repair (which syncs all replicas in the current operation).
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
So Apache Cassandra was engineered to be highly-available (HA), thereby embracing the paradigm of eventual consistency. Some might interpret that to mean Cassandra is inconsistent by design, and they would not be incorrect. I can say after several years of supporting hundreds of clusters at web/retail scale, that consistency issues (while they do happen) are rare, and are usually caused by failures to components outside of a Cassandra cluster.
Ultimately though, it comes down to the business requirements of the application. For some applications like product reviews or recommendations, a little inconsistency shouldn't be a problem. On the other hand, things like location-based pricing may need a higher level of query consistency. And if 100% consistency is indeed a hard requirement, I would question whether or not Cassandra is the proper choice for data storage.
Edit
I did not get this: "Consistency issues between query tables can happen when writes are applied to one table but not the other(s)." When writes are applied to one table but not the other(s), what happens?
So let's say that a new domain is added. Perhaps a scenario arises where the domain_details_table gets updated, but the id_table does not. Nothing wrong here on the database side. Except that when the application expects to find that domain_id in the id_table, but cannot.
In that case, maybe the application can retry using a secondary index on domain_details_table.domain_id. It won't be fast, but the decision to be made is more around which scenario is more preferable; no answer, or a slow answer? Again, application requirements come into play here.
For your point: "You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process." How does RDBMS(like MySQL) deal with this?
So the answer to this used to be simple. RDBMSs only run on a single server, so there's only one replica to keep in-sync. But today, most RDBMSs have HA solutions which can be used, and thus have to be kept in-sync. In that case (from what I understand), most of them will asynchronously update the secondary replica(s), while restricting traffic only to the primary.
It's also good to remember that RDBMSs enforce consistency through locking strategies, as well. So even a single-instance RDBMS will lock a data point during an update, blocking any reads until the lock is released.
In a node-down scenario, a single-instance RDBMS will be completely offline, so instead of inconsistent data you'd have data loss instead. In a HA RDBMS scenario, there would be a short pause (during which you would likely encounter connection/query failures) until it has failed-over to the new primary. Once the replica comes up, there would probably be additional time necessary to sync-up the replicas, until HA can be restored.

Managing high-volume writes to SQL Server database

I have a web service that is used to manage files on a filesystem that are also tracked in a Microsoft SQL Server database. We have a .NET system service that watches for files that are added using the FileSystemWatcher class. When a file-added callback comes from FileSystemWatcher, metadata about the file is added to our database, and it works fairly well.
I've now come to a bit of a scalability problem. I'm adding large quantities of files to the filesystem in rapid succession, and this ends up hammering the database with file adds which results in locking up my web front-end.
I have yet to work on database scability issues, so I'm trying to come up with mitigate tactics. I was thinking of perhaps caching file adds and only writing them off to the database every five minutes or so, but I'm not sure how practical that is. This is data that needs to find its way into our database at some point anyway, and so it's going to have to get hammered at some point. Maybe I could limit the number of file db entries written per second to a certain amount, but then I risk having that amount be less than the rate at which files are added. How can I best tackle this?
Have you thought about using something like SQL Server Service Broker? That way you could push through tons of entries in a burst and it would level out the inserts into your database.
Basically you'd be pushing messages onto a queue which would then be consumed by a receiver stored procedure that would perform the insert for you. You could limit the maximum number of receivers executing to help with the responsiveness issues in your web interface.
There's a nice intro paper here. Although it's for 2005, not much has changed between 2005 and the newer versions of SQL Server.
You have a performance problem and you should approach it with a performance investigation methodology like Waits and Queues. Once you identify the actual problem, we can discuss solutions.
This is just a guess but, assuming the notification 'update metadata' code is a stright forward insert, the likely problem is that you're generating one transaction per notification. This results in commit flush waits, see Diagnosing Transaction Log Performance . Batch commit (aggregate multiple notifications before committing) is the canonical solution.
first option is using Caching to handle high-volume data. or using clusters for analysis high volume data. please click here for more information.

Resources