Immutable database to huge write volume - database

I'm building a application that need to be created using an immutable database, I know about Datomic, but's not recommended to huge data volume (my application will have thousands, or more, writes per second).
I already did some search about it and I could't find any similar database that do not have this "issue".
My application will use event sourcing and microservices pattern.
Any suggestions about what database should I use?

Greg Young's Event Store appears to fit your criteria.
Stores your data as a series of immutable events over time.
Claims to be benchmarked at 15,000 writes per second and 50,000 reads per second.

Amazon's DynamoDB can scale to meet very high TPS demands. It can certainly handle 10 to 100 of thousands of writes per second sustained if your schema is designed properly but it is not cheap.
Your question is a bit vague about whether you need to be able to sustain tens of thousands of writes per second, or you need to be able to burst to tens of thousands of writes. It's also not clear how you intend to read the data.
How big is a typical event/record?
Could you batch the writes?
Could you partition your writes?
Have you looked into something like Amazon's Kinesis Firehose? With small events you could have a relatively cheap ingestion pipeline and then perhaps use S3 for long term storage. It would certainly be cheaper than DynamoDB.
Azure offers similar services as well but I'm not as familiar with their offerings.

Related

Best DB architecture to maintain/update counters in near real time

I am at the beginning of a project where we will need to manage a near real-time flow of messages containing some ids (e.g. sender's id, receiver's id, etc.). We expect a throughput of about 100 messages per second.
What we will need to do is to keep track of the number of times these ids appeared in a specific time frame (e.g. last hour or last day) and store these values somewhere.
We will use the values to perform some real time analysis (i.e. apply a predictive model) and update them when needed while parsing the messages.
Considering the high throughput and the need to be in real time what DB solution would be the better choice?
I was thinking about a key-value in memory DB that will persist data on disk periodically (like Redis).
Thanks in advance for the help.
The best choice depends on many factors we don’t know, like what tech stack is your team already using, how open are they to learning new things, how much operational burden are you willing to take on, etc.
That being said, I would build a counter on top of DynamoDB. Since DynamoDB is fully managed, you have no operational burden (no database server upgrades, etc.). It can handle very high throughput, and it has single-digit millisecond latency for writes and reads to a single row. AWS even has documentation describing how to use DynamoDB as a counter.
I’m not as familiar with other cloud platforms, but you can probably find something in Azure or GCP that offers similar functionality.

Tech-stack for querying and alerting on GB scale (streaming and at rest) datasets

Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.

Database and large Timeseries - Downsampling - OpenTSDB InfluxDB Google DataFlow

I have a project where we sample "large" amount of data on per-second basis. Some operation are performed as filtering and so on and it needs then to be accessed as second, minute, hour or day interval.
We currently do this process with an SQL based system and a software that update different tables (daily average, hourly averages, etc...).
We are currently looking if other solution could fit our needs and I went across several solutions, as open tsdb, google cloud dataflow and influxdb.
All seem to address timeseries needs, but it gets difficult to get information about the internals. opentsdb do offer downsampling but it is not clearly specified how.
The need is since we can query vast amount of data, for instance a year, if the DB downsample at the query and is not pre-computed, it may take a very long time.
As well, downsampling needs to be "updated" when ever "delayed" datapoint are added.
On top of that, upon data arrival we perform some processing (outliner filter, calibration) and those operation should not be written on the disk, several solution can be used like a Ram based DB but perhaps some more elegant solution that would work together with the previous specification exists.
I believe this application is not something "extravagant" and that it must exist some tools to perform this, I'm thinking of stock tickers, monitoring and so forth.
Perhaps you may have some good suggestions into which technologies / DB I should look on.
Thanks.
You can accomplish such use cases pretty easily with Google Cloud Dataflow. Data preprocessing and optimizing queries is one of major scenarios for Cloud Dataflow.
We don't provide a "downsample" primitive built-in, but you can write such data transformation easily. If you are simply looking at dropping unnecessary data, you can just use a ParDo. For really simple cases, Filter.byPredicate primitive can be even simpler.
Alternatively, if you are looking at merging many data points into one, a common pattern is to window your PCollection to subdivide it according to the timestamps. Then, you can use a Combine to merge elements per window.
Additional processing that you mention can easily be tacked along to the same data processing pipeline.
In terms of comparison, Cloud Dataflow is not really comparable to databases. Databases are primarily storage solutions with processing capabilities. Cloud Dataflow is primarily a data processing solution, which connects to other products for its storage needs. You should expect your Cloud Dataflow-based solution to be much more scalable and flexible, but that also comes with higher overall cost.
Dataflow is for inline processing as the data comes in. If you are only interested in summary and calculations, dataflow is your best bet.
If you want to later take that data and access it via time (time-series) for things such as graphs, then InfluxDB is a good solution though it has a limitation on how much data it can contain.
If you're ok with 2-25 second delay on large data sets, then you can just use BigQuery along with Dataflow. Dataflow will receive, summarize, and process your numbers. Then you submit the result into BigQuery. HINT, divide your tables by DAYS to reduce costs and make re-calculations much easier.
We process 187 GB of data each night. That equals 478,439,634 individual data points (each with about 15 metrics and an average of 43,000 rows per device) for about 11,512 devices.
Secrets to BigQuery:
LIMIT your column selection. Don't ever do a select * if you can help it.
;)

Is GAE optimized for database-heavy applications?

I'm writing a very limited-purpose web application that stores about 10-20k user-submitted articles (typically 500-700 words). At any time, any user should be able to perform searches on tags and keywords, edit any part of any article (metadata, text, or tags), or download a copy of the entire database that is recent up-to-the-hour. (It can be from a cache as long as it is updated hourly.) Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity. This usage pattern is set in stone.
Is GAE a wise choice for this application? It appeals to me for its low cost (hopefully free), elasticity of scale, and professional management of most of the stack. I like the idea of an app engine as an alternative to a host. However, the excessive limitations and quotas on all manner of datastore usage concern me, as does the trade-off between strong and eventual consistency imposed by the datastore's distributed architecture.
Is there a way to fit this application into GAE? Should I use the ndb API instead of the plain datastore API? Or are the requirements so data-intensive that GAE is more expensive than hosts like Webfaction?
As long as you don't require full text search on the articles (which is currently still marked as experimental and limited to ~1000 queries per day), your usage scenario sounds like it would fit just fine in App Engine.
stores about 10-20k user-submitted articles (typically 500-700 words)
Maximum entity size in App Engine is 1 MB, so as long as the total size of the article is lower than that, it should not be a problem. Also, the cost for reading data in is not tied to the size of the entity but to the number of entities being read.
At any time, any user should be able to perform searches on tags and keywords.
Again, as long as the search on the tags and keywords are not full text searches, App Engine's datastore queries could handle these kind of searches efficiently. If you want to search on both tags and keywords at the same time, you would need to build a composite index for both fields. This could increase your write cost.
download a copy of the entire database that is recent up-to-the-hour.
You could use cron/scheduled task to schedule a hourly dump to the blobstore. The cron could be targeted to a backend instance if your dump takes more than 60 seconds to be finished. Do remember that with each dump, you would need to read all entities in the database, and this means 10-20k read ops per hour. You could add a timestamp field to your entity, and have your dump servlet query for anything newer than the last dump instead to save up read ops.
Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity.
This is where GAE shines, you could have very efficient instance usages with GAE in this case.
I don't think your application is particularly "database-heavy".
500-700 words is only a few KB of data.
I think GAE is a good fit.
You could store each article as a textproperty on an entity, with tags in a listproperty. For searching text you could use the search service https://developers.google.com/appengine/docs/python/search/ (which currently has quota limits).
Not 100% sure about downloading all the data, but I think you could store all the data in the blobstore (possibly as pdf?) and then allow users to download that blob.
I would choose NDB over regular datastore, mostly for the built-in async functionality and caching.
Regarding staying below quota, it depends on how many people are accessing the site and how much data they download/upload.

Trading applications trade-offs : Databases and low-latency

Databases are required for almost every business application to store data and the transactions done on that data. The transactions typically take a time of the order of milliseconds. At the same time, in a trading application the one thing which is not at all acceptable is "latency". So, what are the trade-offs made in such applications which require an upper limit on latency?
For example, a trade has been placed by the customer, it must pass a few checks, which are stored in the database, requiring a DB fetch. Then, the trade should be passed on to an OMS/ORS or the exchange. And, at each layer, it would be required to store some sort of transaction data in the database. How should one maintain a balance between transaction persistence and low-latency?
Two things:
Caching: Cache the rules in your application, so you don't need to hit the database for every trade
Threading: Put the code that stores the transaction data in the database into another thread. Like this, you can route the trade to the exchange immediately and safe the data in the database parallel to that.
Now a days people use in-memory transaction system, so you cut down network latency that you get by database.
Here are some things that you can consider for low latency
To achieve low latency in java you have to take control of GC in java, there are many ways to do that for eg pre-allocate objects(i.e use flyweight design pattern), use primitive objects - trove is very good for that, all data structure are based on primitive, Reuse object instance for eg create system wide dictionary to reduce creating new objects, very good option when reading data from stream/socket/db
-Try to reduce contention use wait-free algo( which is bit difficult), lock free algo. You can find tons of example for that
-Use in-memory computing. Memory is cheap, you can have tera byte of data in memory.
-Use mechnical sympathy - Refer lmax disruptor, excellent framework

Resources