Flink or kafka stream in case where any change in stream result in processing all data - apache-flink

I have a use case where lets i get balances based on date and I want to show correct balances of each day. If get an update on older date all my balances of that account from that date gets changed.
for eg
Account Date balance Total balance
IBM 1Jun 100 100
IBM 2Jun 50 150
IBM 10Jun 200 350
IBM 12Jun 200 550
Now I get a message of date 4 Jun (this is the scenario some transaction is done back dated, or some correction and its frequent scenario)
Account Date balance Total balance
IBM 1Jun 100 100
IBM 2Jun 50 150
IBM 4Jun 300 450 ----------- all data from this point changes
IBM 10Jun 200 650
IBM 12Jun 200 850
Its a streaming data and at any point I want the correct balance to be shown for each account.
I know flink and kafka are good for streaming use case where if an update of a particular date doesnt trigger update on all data from that point onwards. But can we achieve this scenario as well efficiently or is this NOT a use case of these streaming tech at all ?
Please help

You can't modify a past message in the queue, therefore you should introduce a new message that invalidates previous one. For instance, you can use an ID for each transaction (and repeat it if you need to modified it). In case you have two or more messages with the same ID, you keep with the last one.
Take a look to KTable from Kafka Streams. It can help you to aggregate data using that ID (or any other aggregation factor) and generate a table as a result with the valid resume until now. If a new message arrives, table updates will be emitted

Related

Two question about Time Travel storage-costs in snowflake

I read the snowflake document a lot. Snowflake will has storage-costs if data update.
"tables-storage-considerations.html" mentioned that:
As an extreme example, consider a table with rows associated with
every micro-partition within the table (consisting of 200 GB of
physical storage). If every row is updated 20 times a day, the table
would consume the following storage:
Active 200 GB | Time Travel 4 TB | Fail-safe 28 TB | Total Storage 32.2 TB
The first Question is, if a periodical task run 20 times a day, and the task exactly update one row in each micro-partition, then the table still consume 32.2TB for the total storage?
"data-time-travel.html" mentioned that:
Once the defined period of time has elapsed, the data is moved into
Snowflake Fail-safe and these actions can no longer be performed.
So my second question is: why Fail-safe cost 28TB, not 24TB (reduce the time travel cost)?
https://docs.snowflake.com/en/user-guide/data-cdp-storage-costs.html
https://docs.snowflake.com/en/user-guide/tables-storage-considerations.html
https://docs.snowflake.com/en/user-guide/data-time-travel.html
First question: yes, it's the fact that the micro-partition is changing that is important not how many rows within it change
Question 2: fail-safe is 7 days of data. 4Tb x 7 = 28Tb

Merging different granularity time series in influxdb

I want to store trades as well as best ask/bid data, where the latter updates much more rapidly than the former, in InfluxDB.
I want to, if possible, use a schema that allows me to query: "for each trade on market X, find the best ask/bid on market Y whose timestamp is <= the timestamp of the trade".
(I'll use any version of Influx.)
For example, trades might look like this:
Time Price Volume Direction Market
00:01.000 100 5 1 foo-bar
00:03.000 99 50 0 bar-baz
00:03.050 99 25 0 foo-bar
00:04.000 101 15 1 bar-baz
And tick data might look more like this:
Time Ask Bid Market
00:00.763 100 99 bar-baz
00:01.010 101 99 foo-bar
00:01.012 101 98 bar-baz
00:01.012 101 99 foo-bar
00:01:238 100 99 bar-baz
...
00:03:021 101 98 bar-baz
I would want to be able to somehow join each trade for some market, e.g. foo-bar, with only the most recent ask/bid data point on some other market, e.g. bar-baz, and get a result like:
Time Trade Price Ask Bid
00:01.000 100 100 99
00:03.050 99 101 98
Such that I could compute the difference between the trade price on market foo-bar and the most recently quoted ask or bid on market bar-baz.
Right now, I store trades in one time series and ask/bid data points in another and merge them on the client side, with logic along the lines of:
function merge(trades, quotes, data_points)
next_trade, more_trades = first(trades), rest(trades)
quotes = drop-while (quote.timestamp < next_trade.timestamp) quotes
data_point = join(next_trade, first(quotes))
if more_trades
return merge(more_trades, quotes, data_points + data_point)
return data_points + data_point
The problem is that the client has to discard tons of ask/bid data points because they update so frequently, and only the most recent update before the trade is relevant.
There are tens of markets whose most recent ask/bid I might want to compare a trade with, otherwise I'd simply store the most recent ask/bid in the same series as the trades.
Is it possible to do what I want to do with Influx, or with another time series database? An alternative solution that produces lower quality results is to group the ask/bid data by some time interval, say 250ms, and take the last from each interval, to at least impose an upper bound on the amount of quotes the client has to drop before finding the one that's closest to the next trade.
NB. Just a clarification on InfluxDB terminology. You're probably storing trade and tick data in different measurements(analogous to a table). Series is a subdivision withing a measurement based on tag values. e.g
Time Ask Bid Market
00:00.763 100 99 bar-baz
is one series
Time Ask Bid Market
00:01.010 101 99 foo-bar
is another series(assuming you are storing Market name/id as a tag and not a field)
Answer
InfluxQL https://docs.influxdata.com/influxdb/v1.7/query_language/spec/ - I can't think of a way to achieve what you need with InfluxQL (Influx Query Language) as it does not support joins.
Perhaps what you could do on the client side is instead of requesting all tick data for a period and discarding most of it, make a request per trade and market to get exactly the (the most recent with respect to the trade) ask/bid datapoint that you need. Something like:
function merge(trades, market)
points = <empty list>
for next_trade in trades
quote = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp and Market=market and time>next_trade.timestamp - 1m")
// or to get a list per market with one query
// quote_per_market = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp group by Market")
points = points + join(next_trade, quote)
return points
Of course you'd have the overhead of querying the database more frequently but depending on the number of trades and your resource constraints it may be more efficient. NB. A potential pitfall here is that ask and bid retrieved this way are not retrieved as a pair but independently and while they are returned as a pair it could happen that they have different timestamps. If for some timestamp for some reason you only have an ask or a bid price you might run into this problem. However, as long as you write them in pairs and have no missing data it should be ok.
Flux https://www.influxdata.com/products/flux/ - Flux is a more sophisticated query language that is part of Influxdb 1.7 and 2 that allows you to do joins and operations across different measurements. I can't give you any examples yet but it's worth having a look at.
Other (relational) Times Series DBs that you could have a look at that would also allow you to do joins are CrateDB https://crate.io/ or Postgres + TimescaleDB https://www.timescale.com/products

Redis storing historical data

I have data generated on daily basis.
let me explain through a example:
On World Market, the price of Gold change on seconds interval basis. and i want to store that price in Redis DBMS.
Gold 22 JAN 11.02PM X-price
22 JAN 11.03PM Y-Price
...
24 DEC 11.04PM X1-Price
Silver 22 JAN 11.02PM M-Price
22 JAN 11.03PM N-Price
I want to store this data on daily basis. want to apply ML (Machine Leaning) on last 52 Week data. Is this possible?
Because As my knowledge goes. redis work on Key Value.
if this is possible. Can i get data from a specific date(04 July) and DateRange(01 Feb to 31 Mar)
In redis, a Sorted Set is appropriate for time series data. If you score each entry with the timestamp of the price quote you can quickly access a whole day or group of days using the ZRANGEBYSCORE command (or ZSCAN if you need paging).
The quote data can be stored right in the sorted set. If you do this make sure each entry is unique. Adding a record to a sorted set that is identical to an existing one just updates the existing record's score (timestamp). This moves the old record to the present and erases it from the past, which is not what you want.
I would recommend only storing a unique key/ID for each quote in the sorted set, and store the data in it's own key or hash field. This will allow you to create additional indexes for the data as needed and access specific records more easily if necessary.

Datastore Read Operations Calculation

So I am currently performing a test, to estimate how much can my Google app engine work, without going over quotas.
This is my test:
I have in the datastore an entity, that according to my local
dashboard, needs 18 write operations. I have 5 entries of this type
in a table.
Every 30 seconds, I fetch those 5 entities mentioned above. I DO
NOT USE MEMCACHE FOR THESE !!!
That means 5 * 18 = 90 read operations, per fetch right ?
In 1 minute that means 180, in 1 hour that means 10800 read operations..Which is ~20% of the daily limit quota...
However, after 1 hour of my test running, I noticed on my online dashboard, that only 2% of the read operations were used...and my question is why is that?...Where is the flaw in my calculations ?
Also...where can I see in the online dashboard how many read/write operations does an entity need?
Thanks
A write on your entity may need 18 writes, but a get on your entity will cost you only 1 read.
So if you get 5 entries every 30 secondes during one hour, you'll have about 5reads * 120 = 600 reads.
This is in the case you make a get on your 5 entries. (fetching the entry with it's id)
If you make a query to fetch them, the cost is "1 read + 1 read per entity retrieved". Wich mean 2 reads per entries. So around 1200 reads in one hour.
For more details informations, here is the documentation for estimating costs.
You can't see on the dashboard how many writes/reads operations an entity need. But I invite you to check appstats for that.

How do I measure response time in seconds given the following benchmarking data?

We recently got some data back on a benchmarking test from a software vendor, and I think I'm missing something obvious.
If there were 17 transactions (I assume they mean successfully completed requests) per second, and 1500 of these requests could be served in 5 minutes, then how do I get the response time for a single user? Is this sort of thing even possible with benchmarking? I have a lot of other data from them, including apache config settings, but I'm not sure how to do all the math.
Given the server setup they sent, I want to know how I can deduce the user response time. I have looked at other similar benchmarking tests, but I'm having trouble measuring requests to response time. What other data do I need to provide here to get that?
If only 1500 of these can be served per 5 minutes then:
1500 / 5 = 300 transactions per min can be served
300 / 60 = 5 transactions per second can be served
so how are they getting 17 completed transactions per second? Last time I checked 5 < 17 !
This doesn't seem to fit. Or am I looking at it wrongly?
I presume be user response time, you mean the time it takes to serve a single transaction:
If they can serve 5 per second than it takes 200ms (1/5) per transaction
If they can serve 17 per second than it takes 59ms (1/17) per transaction
That is all we can tell from the given data. Perhaps clarify how many transactions are being done per second.

Resources