what does "Online" in Online analytical processing means? - database

Is there any offline analytical processing? If there is, what is the difference with online analytical processing?
In What are OLTP and OLAP. What is the difference between them?, OLAP is deals with Historical Data or Archival Data. OLAP is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations.
I don't understand what does the word online in online analytical processing mean?
Is it related to real time processing(About real-time understanding: In a short time after the data is generated, this data can be analyzed. Am I wrong with this understanding)?
When does the analysis happen?
I imagine a design like this:
log generated in many apps -> kafka -> (relational DB) -> flink as ETL -> HBase, and the analysis will happen after data is inserted into HBase. Is this correct?
If yes, why is it called online?
If no, when does analysis happen? Please correct me if this design is usually not in the industry.
P.S. Assuming that the log generated by the apps in a day has a PB level

TLDR: as far as I can tell "Online" appears to stem from characteristics of a scenario where handling transactions with satellite devices (ATM's) was a new thing.
Long version
To understand what "online" in OLTP means, you have to go back to when ATM's first came out in the 1990s.
If you're a bank in the 1990s, you've got two types of system: your central banking system (i.e. mainframe), and these new fangled ATM's connected to it... online.
So if you're a bank, and someone wants to get money out, you have to do a balance check; and if cash is withdrawn you need to do a debit. That last action - or transaction - is key, because you don't want to miss that - you want to update your central record back in the bank's central systems. So that's the transactional processing (TP) part.
The OL part just refers to the remote / satellite / connected devices that participate in the transaction processing.
OLTP is all about making sure that happens reliably.

Related

Can Snowflake be used to mitigate application failure for business continuity?

I would like your opinions or experiences around the following possible solution idea. I know Snowflake is primarily a data analytics platform. But why could we not use it for some creative scenarios like business continuity?
Problem
Imagine an application that supports a critical business process. There is a risk that the application could become unavailable for an extended period. The application in this case is a SaaS solution by a reputable vendor, Salesforce. So it does not go down often. And when it does, they normally restore it in less than a day. But the business process is a critical medical logistics process - meaning if a transaction is delayed for a few days, lives may be lost.
Background
Our transaction volumes are moderate. We probably serve 25 new patients per day, with a few hundred interactions each day to support those. In the even of an outage, a subset of those might need immediate manual intervention to keep things moving. Others might be able to wait a couple of days.
We already use Snowflake to store replicas of the application's data. We use Looker to write analytics reports.
Proposed Solution
Write reports that expose critical data that may be needed if the primary application fails. Then, when the primary application fails, users can view reports using the latest replicated data to enable manual activities to keep things going until the primary application is restored to working order.
If data changes are needed, they must be written down somewhere and then applied to the application when its availability is restored
Your only issue could be latency, as it is today Snowflake is not built for OLTP workloads, but OLAP workloads.
If the latency you get when running queries from Snowflake is fine then you have a valid use case.
Snowflake is used as an Application backend - particularly if the query is about historical analysis and latency is acceptable at a few seconds as opposed to immediate.
See: https://www.snowflake.com/workloads/data-applications/

When is data consistency not an issue?

I am new in learning distributed systems and I read about the CAP theorem, I am interested in an AP system such as Cassandra.
My question is in what cases can you actually sacrifice consistency? Effectively what I am saying is sacrificing consistency means serving inaccurate data. In what cases would then you actually use an AP datastore like Cassandra? I can't think of any case where I wouldn't want my reads to be consistent.
By AP system, I assume you will at least target to ensure eventual consistency.
Imagine you're developing a social network where users have friends and their own news feeds. It doesn't matter if a particular user's feed has occasional five minutes lag (his feed list has eventual consistency). Missing 2/3 very recent updates in the news feed is okay in this scenario as long as those feeds will eventually appear. And in fact, Facebook built it's news feed using Cassandra.
Imagine a distributed key-value store cache system where update is very rare. If there is almost no update operations, ensuring strong consistency is un-necessary, so you can focus on availability. Occasional cache miss (the key-value entry is not populated yet) and request to database due to eventual consistency should be okay.
My question is in what cases can you actually sacrifice consistency?
One case would be when building a recommendation engine data set and serving it with Cassandra. These data sets are essentially the aggregation of many, many users to determine purchasing/viewing patterns.
For example: If I add a Rey Star Wars action figure to my shopping cart, the underlying recommendation engine runs a query for similar resulting purchasing patterns based on others who have also purchased an action figure of Rey. The query returns the top 5 product results, and puts them at the bottom of the page.
Those 5 products returned are the result of analysis and aggregation of several thousand prior purchases. Let's assume that some of that data isn't consistent, causing a variance in the 5 products returned. Is that really a big deal?
tl;dr; The real question to ask; is whether or not getting a somewhat-accurate list of 5 product recommendations in less than 10ms, is better than getting a 100% accurate list of 5 product recommendations in 100ms?
Both result sets will help drive sales. But the one which is returned fast enough that it doesn't hinder the user experience is much more preferred.
'C' in CAP refers to linearizability which is a very strong form of consistancy that you don't need most of the time.
Linearizability is a recency guarantee which makes it appear that there is a single copy of data. As soon as you make a change in the data, all subsequent reads will return the changed data. Such a level of consistency is expensive and doesn't scale well. Yet in certain scenarios we need linearizability, viz.
Leader election
Allowing end users to create their unique user id
Distributed locking etc.
When you have these usecases, you'd use something like ZooKeeper, etcd etc. Cassandra also has Light Weight Transaction (LWT) which uses an extension of the classic Paxos algorithm to implement linearizability. This feature can be used to address those rare use cases where you must have linearizability and serializability, but it is expensive. And in vast majority of cases you are just fine with a little weaker consistency to get better scalability and performance. You trade a little bit of consistency with scalability and performance.
Some eCommerce websites send apology letter to customers for not being able to fulfill their orders. That is because the last copy of the product has been sold to more than one customers due to lack and linearizability. They prefer to deal with that over not being able to scale with the customer base and not being able to respond to their requests within stringent SLAs.
Cassandra is said to have a tuneable consistency. You may want to record user clicks or activities for analysis. You are okay if some data are lost, but you cannot compromise with the performance. You'd probably use a write consistency level of ANY with hints enabled (sloppy quorum).
If you want a little more consistency, you'd use a QUORUM consistency level to read and write along with hints and read repair. In vast majority of case all nodes are updated instantaneously. Even if one or two nodes go down, a majority of nodes will have the data and failed nodes would be repaired when they come back using hints, read repair, anti entropy repair.
Cassandra is particularly useful for cases where you'd not have many concurrent updates on same data. The reason is, unlike the dynamo architecture, it does not use vector clocks for conflict resolution between replicas. Instead it uses Last Write Wins (LWW) based on timestamp. If timestamps are same, it uses lexicographical order. Since the time on nodes cannot be accurate even in the presence of NTPD, there is a possibility of data loss, although Cassandra has taken some steps to avoid that - for e.g. client side timestamp instead of server side timestamp.
The CAP theorem says that given partition tolerence, you can either choose availability or consistency in a distributed database (no one would want to give up partition tolerence in any case). So if you want to have maximum availability, you'll have to give up on the consistency. This depends of course, on how critical the business is.
You answered something on SO but the answer doesn't show up when you visit the page? Can be tolerated. SO being down? Can't be. Critical financial systems would rather have strong consistency than availability. Every once-in-a-while, my bank's servers would go offline when I try to make a payment.
Normally, you choose availability and eventual consistency. The answer you wrote into SO would eventually show up.
Apart from the above mentioned cases where inconsistent data is tolerable, there are also scenarios where we can defer to the user to solve the inconsistency.
For example, if we found two different versions of someone's address in the database, we can prompt the user to identity the correct address.

Best DB architecture to maintain/update counters in near real time

I am at the beginning of a project where we will need to manage a near real-time flow of messages containing some ids (e.g. sender's id, receiver's id, etc.). We expect a throughput of about 100 messages per second.
What we will need to do is to keep track of the number of times these ids appeared in a specific time frame (e.g. last hour or last day) and store these values somewhere.
We will use the values to perform some real time analysis (i.e. apply a predictive model) and update them when needed while parsing the messages.
Considering the high throughput and the need to be in real time what DB solution would be the better choice?
I was thinking about a key-value in memory DB that will persist data on disk periodically (like Redis).
Thanks in advance for the help.
The best choice depends on many factors we don’t know, like what tech stack is your team already using, how open are they to learning new things, how much operational burden are you willing to take on, etc.
That being said, I would build a counter on top of DynamoDB. Since DynamoDB is fully managed, you have no operational burden (no database server upgrades, etc.). It can handle very high throughput, and it has single-digit millisecond latency for writes and reads to a single row. AWS even has documentation describing how to use DynamoDB as a counter.
I’m not as familiar with other cloud platforms, but you can probably find something in Azure or GCP that offers similar functionality.

Tech-stack for querying and alerting on GB scale (streaming and at rest) datasets

Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.

Distributed store with transactions

I currently develop an application hosted at google app engine. However, gae has many disadvantages: it's expensive and is very hard to debug since we can't attach to real instances.
I am considering changing the gae to an open source alternative. Unfortunately, none of the existing NOSQL solutions which satisfy me support transactions similar to gae's transactions (gae support transactions inside of entity groups).
What do you think about solving this problem? I am currently considering a store like Apache Cassandra + some locking service (hazelcast) for transactions. Did anyone has any experience in this area? What can you recommend
There are plans to support entity groups in cassandra in the future, see CASSANDRA-1684.
If your data can't be easily modelled without transactions, is it worth using a non transcational database? Do you need the scalability?
The standard way to do transaction like things in cassandra is described in this presentation, starting at slide 24. Basically you write something similar to a WAL log entry to 1 row, then perform the actual writes on multiple rows, then delete the WAL log row. On failure, simply read and perform actions in the WAL log. Since all cassandra writes have a user supplied time stamp, all writes can be made idempotent, just store the time stamp of your write with the WAL log entry.
This strategy gives you the Atomic and Durable in ACID, but you do not get Consistency and Isolation. If you are working at scale that requires something like cassandra, you probably need to give up full ACID transactions anyway.
You may want to try AppScale or TyphoonAE for hosting applications built for App Engine on your own hardware.
If you are developing under Python, you have very interesting debugging options with the Werkzeug debugger.

Resources