Best DB architecture to maintain/update counters in near real time - database

I am at the beginning of a project where we will need to manage a near real-time flow of messages containing some ids (e.g. sender's id, receiver's id, etc.). We expect a throughput of about 100 messages per second.
What we will need to do is to keep track of the number of times these ids appeared in a specific time frame (e.g. last hour or last day) and store these values somewhere.
We will use the values to perform some real time analysis (i.e. apply a predictive model) and update them when needed while parsing the messages.
Considering the high throughput and the need to be in real time what DB solution would be the better choice?
I was thinking about a key-value in memory DB that will persist data on disk periodically (like Redis).
Thanks in advance for the help.

The best choice depends on many factors we don’t know, like what tech stack is your team already using, how open are they to learning new things, how much operational burden are you willing to take on, etc.
That being said, I would build a counter on top of DynamoDB. Since DynamoDB is fully managed, you have no operational burden (no database server upgrades, etc.). It can handle very high throughput, and it has single-digit millisecond latency for writes and reads to a single row. AWS even has documentation describing how to use DynamoDB as a counter.
I’m not as familiar with other cloud platforms, but you can probably find something in Azure or GCP that offers similar functionality.

Related

When is data consistency not an issue?

I am new in learning distributed systems and I read about the CAP theorem, I am interested in an AP system such as Cassandra.
My question is in what cases can you actually sacrifice consistency? Effectively what I am saying is sacrificing consistency means serving inaccurate data. In what cases would then you actually use an AP datastore like Cassandra? I can't think of any case where I wouldn't want my reads to be consistent.
By AP system, I assume you will at least target to ensure eventual consistency.
Imagine you're developing a social network where users have friends and their own news feeds. It doesn't matter if a particular user's feed has occasional five minutes lag (his feed list has eventual consistency). Missing 2/3 very recent updates in the news feed is okay in this scenario as long as those feeds will eventually appear. And in fact, Facebook built it's news feed using Cassandra.
Imagine a distributed key-value store cache system where update is very rare. If there is almost no update operations, ensuring strong consistency is un-necessary, so you can focus on availability. Occasional cache miss (the key-value entry is not populated yet) and request to database due to eventual consistency should be okay.
My question is in what cases can you actually sacrifice consistency?
One case would be when building a recommendation engine data set and serving it with Cassandra. These data sets are essentially the aggregation of many, many users to determine purchasing/viewing patterns.
For example: If I add a Rey Star Wars action figure to my shopping cart, the underlying recommendation engine runs a query for similar resulting purchasing patterns based on others who have also purchased an action figure of Rey. The query returns the top 5 product results, and puts them at the bottom of the page.
Those 5 products returned are the result of analysis and aggregation of several thousand prior purchases. Let's assume that some of that data isn't consistent, causing a variance in the 5 products returned. Is that really a big deal?
tl;dr; The real question to ask; is whether or not getting a somewhat-accurate list of 5 product recommendations in less than 10ms, is better than getting a 100% accurate list of 5 product recommendations in 100ms?
Both result sets will help drive sales. But the one which is returned fast enough that it doesn't hinder the user experience is much more preferred.
'C' in CAP refers to linearizability which is a very strong form of consistancy that you don't need most of the time.
Linearizability is a recency guarantee which makes it appear that there is a single copy of data. As soon as you make a change in the data, all subsequent reads will return the changed data. Such a level of consistency is expensive and doesn't scale well. Yet in certain scenarios we need linearizability, viz.
Leader election
Allowing end users to create their unique user id
Distributed locking etc.
When you have these usecases, you'd use something like ZooKeeper, etcd etc. Cassandra also has Light Weight Transaction (LWT) which uses an extension of the classic Paxos algorithm to implement linearizability. This feature can be used to address those rare use cases where you must have linearizability and serializability, but it is expensive. And in vast majority of cases you are just fine with a little weaker consistency to get better scalability and performance. You trade a little bit of consistency with scalability and performance.
Some eCommerce websites send apology letter to customers for not being able to fulfill their orders. That is because the last copy of the product has been sold to more than one customers due to lack and linearizability. They prefer to deal with that over not being able to scale with the customer base and not being able to respond to their requests within stringent SLAs.
Cassandra is said to have a tuneable consistency. You may want to record user clicks or activities for analysis. You are okay if some data are lost, but you cannot compromise with the performance. You'd probably use a write consistency level of ANY with hints enabled (sloppy quorum).
If you want a little more consistency, you'd use a QUORUM consistency level to read and write along with hints and read repair. In vast majority of case all nodes are updated instantaneously. Even if one or two nodes go down, a majority of nodes will have the data and failed nodes would be repaired when they come back using hints, read repair, anti entropy repair.
Cassandra is particularly useful for cases where you'd not have many concurrent updates on same data. The reason is, unlike the dynamo architecture, it does not use vector clocks for conflict resolution between replicas. Instead it uses Last Write Wins (LWW) based on timestamp. If timestamps are same, it uses lexicographical order. Since the time on nodes cannot be accurate even in the presence of NTPD, there is a possibility of data loss, although Cassandra has taken some steps to avoid that - for e.g. client side timestamp instead of server side timestamp.
The CAP theorem says that given partition tolerence, you can either choose availability or consistency in a distributed database (no one would want to give up partition tolerence in any case). So if you want to have maximum availability, you'll have to give up on the consistency. This depends of course, on how critical the business is.
You answered something on SO but the answer doesn't show up when you visit the page? Can be tolerated. SO being down? Can't be. Critical financial systems would rather have strong consistency than availability. Every once-in-a-while, my bank's servers would go offline when I try to make a payment.
Normally, you choose availability and eventual consistency. The answer you wrote into SO would eventually show up.
Apart from the above mentioned cases where inconsistent data is tolerable, there are also scenarios where we can defer to the user to solve the inconsistency.
For example, if we found two different versions of someone's address in the database, we can prompt the user to identity the correct address.

Tech-stack for querying and alerting on GB scale (streaming and at rest) datasets

Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.

Immutable database to huge write volume

I'm building a application that need to be created using an immutable database, I know about Datomic, but's not recommended to huge data volume (my application will have thousands, or more, writes per second).
I already did some search about it and I could't find any similar database that do not have this "issue".
My application will use event sourcing and microservices pattern.
Any suggestions about what database should I use?
Greg Young's Event Store appears to fit your criteria.
Stores your data as a series of immutable events over time.
Claims to be benchmarked at 15,000 writes per second and 50,000 reads per second.
Amazon's DynamoDB can scale to meet very high TPS demands. It can certainly handle 10 to 100 of thousands of writes per second sustained if your schema is designed properly but it is not cheap.
Your question is a bit vague about whether you need to be able to sustain tens of thousands of writes per second, or you need to be able to burst to tens of thousands of writes. It's also not clear how you intend to read the data.
How big is a typical event/record?
Could you batch the writes?
Could you partition your writes?
Have you looked into something like Amazon's Kinesis Firehose? With small events you could have a relatively cheap ingestion pipeline and then perhaps use S3 for long term storage. It would certainly be cheaper than DynamoDB.
Azure offers similar services as well but I'm not as familiar with their offerings.

How expensive is access to database? How often do we access to it?

I'm about to write an application for Android, and it will use Mysql.
I know that access to DB is really expensive in terms of time, and would like to know how often do applications like instant messaging, online gaming access to databases?
For example in a game, we would like to save the positions of a player in the world, when he's moving all the time.
Is the database access actually not expensive, and there is a way to be connected to it all the time and just do request that are actually not expensive?
Or is IT really expensive in anyway, and there are techniques to access to it for example every X interval of time, and saving it locally in the meantime?
I Know that my question is really general, and it depends always on what we need and want.
My question came out because i made a really simple login application that connects and does 1 request to database, and it takes 1 second (a lot!!) to get the result, so how online applications can be so fast?
Thank you
Before answering this I would recommend simulating the process as much as possible, benchmarking and you can work towards the best solution for your use case.
e.g. If I have an application submitting data to a database simulate the submission so I can easily run multiple submissions at the same time and see what the bottle neck is...and see how it compares when I using caching, replication, indexes, etc.
Also reading company blogs can be helpful as they often share success stories that support the usage of a particular approach
How expensive is access to database?
Accessing a database can be a pretty quick operation
SELECT 1; // 0.005 Secs :D
However there are situations that can lead to poor performance (slow reads, writes and updates) but there are some relatively simple ways to combat this
Indexes
The best way to improve the performance of SELECT operations is to
create indexes on one or more of the columns that are tested in the
query. The index entries act like pointers to the table rows, allowing
the query to quickly determine which rows match a condition in the
WHERE clause, and retrieve the other column values for those rows.
Replication
spreading the load among multiple slaves to improve performance. In
this environment, all writes and updates must take place on the master
server. Reads, however, may take place on one or more slaves. This
model can improve the performance of writes (since the master is
dedicated to updates), while dramatically increasing read speed across
an increasing number of slaves.
How often do we access to it?
If you are solely using a database you will access it every time you n position and every time you need to find out their position.
This is where you would explore options to prevent accessing the database.
Memory caches such as redis or memcache
Replication - Only read from slaves
It depends on your design and requirement.
1) Most of the applications manage Connection Pools to minimize the initialization time.
2) Most of the ORM frameworks have external Cache to improve the reading performance. So if you do heavy data reading in your application then don't worry about storing it in locally. The Cache will be effective in this case.
3) When you store locally either in File (or) some format, then it will also add extra performance delay.
4) If you keep the data in primary memory, then obviously Game performance would be better. That's why Gamers prefer high end graphics card, and huge RAM.
For most databases there is the option of batch insertions. Obviously even a small overhead will accumulate if you have to many connections over time. And performing single insertions will have a greater overhead than on batch. The only issue is how often?.... And you should test how often you wan't to insert and how much information you should store locally before doing a batch insertion.

Trading applications trade-offs : Databases and low-latency

Databases are required for almost every business application to store data and the transactions done on that data. The transactions typically take a time of the order of milliseconds. At the same time, in a trading application the one thing which is not at all acceptable is "latency". So, what are the trade-offs made in such applications which require an upper limit on latency?
For example, a trade has been placed by the customer, it must pass a few checks, which are stored in the database, requiring a DB fetch. Then, the trade should be passed on to an OMS/ORS or the exchange. And, at each layer, it would be required to store some sort of transaction data in the database. How should one maintain a balance between transaction persistence and low-latency?
Two things:
Caching: Cache the rules in your application, so you don't need to hit the database for every trade
Threading: Put the code that stores the transaction data in the database into another thread. Like this, you can route the trade to the exchange immediately and safe the data in the database parallel to that.
Now a days people use in-memory transaction system, so you cut down network latency that you get by database.
Here are some things that you can consider for low latency
To achieve low latency in java you have to take control of GC in java, there are many ways to do that for eg pre-allocate objects(i.e use flyweight design pattern), use primitive objects - trove is very good for that, all data structure are based on primitive, Reuse object instance for eg create system wide dictionary to reduce creating new objects, very good option when reading data from stream/socket/db
-Try to reduce contention use wait-free algo( which is bit difficult), lock free algo. You can find tons of example for that
-Use in-memory computing. Memory is cheap, you can have tera byte of data in memory.
-Use mechnical sympathy - Refer lmax disruptor, excellent framework

Resources