Could I set DataStream time window to a large value like 24 hours? The reason for the requirement is that I want to make data statistics based on the latest 24 hours client traffic to the web site. This way, I can check if there are security violations.
For example, check if a user account used multiple source IPs to log on to the web site. Or check how many unique pages a certain IP accessed in the latest 24 hours. If security violation is detected, the configured action will be taken in real time such as blocking the source IP or locking the relevant user account.
The throughput of the web site is around 200Mb/s. I think setting the time window to a large value will cause memory issue. Should I store the statistics results of each time window like 5 minutes into database?
Then make statistics based on database query for the date generated in the latest 24 hours?
I don't have any experience with big data analysis. Any advice will be appreciated.
It depends on what type of window and aggregations we're talking about:
Window where no eviction is used: in this case Flink will only save one accumulated result per physical window. This means that for a sliding window of 10h with 1h slide that computes a sum it would have to have a number 10 times. For a tumbling window (regardless of the parameters) it only saves the result of the aggregation once. However this is not the whole story: because state is keyed you have to multiply all of this for every distinct value of the field used in the group by.
Window with eviction: saves all events that were processed but still weren't evicted.
In short, generally the memory consumption is not tied to how many events you processed or the window's durations but to:
The number of windows (considering that one sliding window actually maps to several physical windows).
The cardinality of the field you're using in the group by.
All things considered, I'd say a simple 24-hour window has an almost nonexistent memory footprint.
You can check the relevant code here.
Related
A couple of points i'll volunteer up front:
I'm new to Flink (working with it for about a month now)
I'm using Kinesis Analytics (AWS hosted Flink solution). By all accounts this doesn't really limit the versatility of Flink or the options for fault tolerance, but I'll call it out anyways.
We have a fairly straight forward sliding window application. A keyed stream organizes events by a particular key, IP address for example, and then processes them in a ProcessorFunction. We mostly use this to keep track of counts of things. For example, how many logins for a particular IP address in the last 24 hours. Every 30 seconds we count the events in the window, per key, and save that value to an external data store. State is also updated to reflect the events in that window so that old events expire and aren't taking up memory.
Interestingly enough, cardinality is not an issue. If we have 200k folks logging in, in a 24 hour period, everything is perfect. Things start to get hairy when one IP logs in 200k times in 24 hours. At this point, checkpoints start to take longer and longer. An average checkpoint takes 2-3 seconds, but with this user behaviour, the checkpoints start to take 5 minutes, then 10, then 15, then 30, then 40, etc etc.
The application can run smoothly in this condition for a while, surprisingly. Perhaps 10 or 12 hours. But, sooner or later checkpoints completely fail and then our max iterator age starts to spike, and no new events are processed etc etc.
I've tried a few of things at this point:
Throwing more metal at the problem (auto scaling turned on as well)
Fussing with CheckpointingInterval and MinimumPauseBetweenCheckpoints https://docs.aws.amazon.com/kinesisanalytics/latest/apiv2/API_CheckpointConfiguration.html
Refactoring to reduce the footprint of the state we store
(1) didn't really do much.
(2) This appeared to help but then another much larger traffic spike then what we'd seen before squashed any of the benefits
(3) It's unclear if this helped. I think our application memory footprint is fairly small compared to what you'd imagine from a Yelp or an Airbnb who both use Flink clusters for massive applications so I can't imagine that my state is really problematic.
I'll say I'm hoping we don't have to deeply change the expectations of the application output. This sliding window is a really valuable piece of data.
EDIT: Somebody asked about what my state looks like it's a ValueState[FooState]
case class FooState(
entityType: String,
entityID: String,
events: List[BarStateEvent],
tableName: String,
baseFeatureName: String,
)
case class BarStateEvent(target: Double, eventID: String, timestamp: Long)
EDIT:
I want to highlight something that user David Anderson said in the comments:
One approach sometimes used for implementing sliding windows is to use MapState, where the keys are the timestamps for the slices, and the values are lists of events.
This was essential. For anybody else trying to walk this path, I couldn't find a workable solution that didn't bucket events into some time slice. My final solution involves bucketing events into batches of 30 seconds and then writing those into map state as David suggested. This seems to do the trick. For our high periods of load, checkpoints remain at 3mb and they always finish in under a second.
If you have a sliding window that is 24 hours long, and it slides by 30 seconds, then every login is assigned to each of 2880 separate windows. That's right, Flink's sliding windows make copies. In this case 24 * 60 * 2 copies.
If you are simply counting login events, then there is no need to actually buffer the login events until the windows close. You can instead use a ReduceFunction to perform incremental aggregation.
My guess is that you aren't taking advantage of this optimization, and thus when you have a hot key (ip address), then the instance handling that hot key has a disproportionate amount of data, and takes a long time to checkpoint.
On the other hand, if you are already doing incremental aggregation, and the checkpoints are as problematic as you describe, then it's worth looking more deeply to try to understand why.
One possible remediation would be to implement your own sliding windows using a ProcessFunction. By doing so you could avoid maintaining 2880 separate windows, and use a more efficient data structure.
EDIT (based on the updated question):
I think the issue is this: When using the RocksDB state backend, state lives as serialized bytes. Every state access and update has to go through ser/de. This means that your List[BarStateEvent] is being deserialized and then re-serialized every time you modify it. For an IP address with 200k events in the list, that's going to be very expensive.
What you should do instead is to use either ListState or MapState. These state types are optimized for RocksDB. The RocksDB state backend can append to ListState without deserializing the list. And with MapState, each key/value pair in the map is a separate RocksDB object, allowing for efficient lookups and modifications.
One approach sometimes used for implementing sliding windows is to use MapState, where the keys are the timestamps for the slices, and the values are lists of events. There's an example of doing something similar (but with tumbling windows) in the Flink docs.
Or, if your state can fit into memory, you could use the FsStateBackend. Then all of your state will be objects on the JVM heap, and ser/de will only come into play during checkpointing and recovery.
I'd like to know if a time-series database will crumble with this scenario:
I have tens of thousands of IoTs sending 4 different values each 5min.
I will query those values for each IoT, for certain time spans. My question is:
Is a tsdb approach feasible and scalable up to, e.g., a million IoTs, having metrics like:
iot.key1.value1
iot.key1.value2
iot.key1.value3
iot.key1.value4
iot.key2.value1
.
.
.
iot.key1000000.value4
? Or are they way too much "amount of metrics"?
The retention policy will be 2 years, with possible roll ups maybe after (TBA) months. But I think this consideration only matters for disk size afaik.
Right now I'm using graphite
A reporting frequency of five minutes that should be fairly manageable, just be sure to set your storage schema to five-minutes being the smallest resolution data in order to save space, as you won't be needing to hold on to data at shorter periods.
With that said, scaling a graphite cluster to meet your needs isn't easy as Whisper isn't optimized for this. There are several resources/stories where others have shared their dismay trying to achieve this, for example: here and here
There are other limitations to consider too, Whisper is configured in such a way that it can record only one datapoint per timestamp, and the last datapoint received "wins". This might not be an issue to you now, but later down the road you might find that you need to increase the datapoint reporting requency to get a better insight into your data.
Therein comes the question, how can I get around that? Often, StatsD is the answer - it's an aggregator that takes your individual metrics over a defined period of time, and churns out a histogram-like set of metrics with different statistical derivatives of your data (minimum, maximum, X-percentile, and so on). Suddenly you're then faced with the prospect of managing a Graphite instance or cluster, one (or more) StatsD service, and that's before you even get to the fun part of visualising your data: Grafana is often used here and also requires you to set up and maintain.
Conversely, assuming you will maintain that reporting frequency, but increase the number of devices (as you mentioned), you might find another component of your Graphite stack - Carbon-relay - running into some bottlenecking issues (as described here).
I work at MetricFire, formerly Hosted Graphite, where we had a lot of these considerations in mind when building our product/service. Collectively we process millions of datapoints per second across hundreds of accounts. Data is rolled up and stored at four resolutions: 5-seconds, 30-seconds, 5-minute, 1-hour, where each resolution is available for 24 hours, 3 days, six months and two years, respectively.
A key component of our set-up is that our storage is not built on the typical Whisper backend - instead we use a custom-built data store using Riak allowing us to do many things: scale easily and aggregate datapoints per metric into Data Views, to name a few. That article about Data Views was written by one of our engineers and goes into some detail about the decisions we made when building our storage layer.
When my cluster runs for a certain time(maybe a day, maybe two days), some queries may become very slow, about 2~10min to finish, when this happens, I need to restart the whole cluster, and the query become normal, but after some time, very slow queries happen again
The query response time depends on multiple factor including
1. Table size, if table size grows with time then response time will also increase
2. If it is open source version then time spent in GC pauses, which in turn will depend on number of objects/grabage present in the JVM Heap
3. Number of Concurrent queries being run
4. Amount of data overflown to disk
You will need to describe in detail your usage pattern of snappydata. Only then It would be possible to characterise the issue.
Some of the questions that should be answered are like
1. What is cluster size?
2. What are the table sizes?
3. Are writes happening continuously on the tables or only queries are getting executed?
You can engage us at slack channel to provide us informations related to your clusters.
In Flink, I am reading a file using readTextFile and applying SlidingProcessingTimeWindows.of(Time.milliseconds(60), Time.milliseconds(60)) of 60 msec with slide of 60 msec on it. On windowed stream I am calculating the mean of the second filed of the tuple. My text file contains 1100 lines and each line is tuple (String, Integer). I have set the parallelism to 1 and keyed the messages on first field of the tuple.
When I run the code, each time I get different answers. I mean that it seems like, sometime it reads entire file and sometime it reads one first some lines of the file. Does it have some relation with window size of sliding amount? How this relation can be found out so that I can decide the size and sliding amount of window?
The answer in the comment of AlpineGizmo is correct. I'll add a few more details here.
Flink aligns time windows to the begin of epoch (1970-01-01-00:00:00). This means that a window operator with a 1 hour window starts a new window with every new hour (i.e., at 00:00, 01:00, 02:00, ...) and not with the first arriving record.
Processing time windows are evaluated based on the current time of the system.
As said in the comment above, this means that the amount of data which can be processed depends on the processing resources (hardware, CPU/IO load, ...) of the machine that an operator runs on. Therefore, processing time window cannot produce reliable and consistent results.
I your case, both described effects might cause results which are inconsistent across jobs. Depending on when you start the job, the data will be assigned to different windows (if the first record arrives just before the first 60 msecs window is closed, only this element will be in the window). Depending on the IO load of the machine it might take more or less time to access and read the file.
If you want to have consistent results, you need to use event-time. In this case, the records are processed based on the time which is encoded in the data, i.e., the results depend on the data only and not on external effects such as the starting time of the job or the load of the processing machine.
I need a sequencer for the entire application's data.
Using a counter entity is a bad idea (5 writes per second limit), and Sharding counters are not an option.
GMT time stamp seems unsafe due to clock variances with servers, plus a possible server time being set/reset.
Any idea?
How do I get a entity property which I can query for all entities changed since a given value?
TIA
Distributed datastores such as the app engine datastore don't have a global sequence - there's literally no way to determine if entity A was written to server A' before entity B was written to server B' if those events occur sufficiently close together, unless you have a single machine mediating all transactions and serializing them, which places a hard upper bound on how scalable your system can be.
For your actual practical problem, the easiest solution would be to assign a modification timestamp to each record, and each time you need to sync, look for records newer than (that timestamp) - (epsilon), where epsilon is a short time interval that is longer than the expected difference in time synchronization between servers (something like 10 seconds should be ample). Your client can then discard any duplicate records it receives.