Apache Flink - Filter Performance Tips - apache-flink

Let's say you are working on a big flink project. And also you are keyBy the client ip addresses of your customers.
And realized that you are going to filter the same things in the different code places like that:
public void calculationOne(){
kafkaSource.filter(isContainsSmthA).keyBy(clientip).process(processA).sink(...);
}
public void calculationTwo(){
kafkaSource.filter(isContainsSmthA).keyBy(clientip).process(processB).sink(...);
}
And assumed that they are many kafkaSource.filter(isContainsSmthA)..
Now this structure leads to performance issue in the flink?
If I did something like the below, would be much better?
public Stream filteredA(){
return kafkaSource.filter(isContainsSmthA);
public void calculationOne(){
filteredA().keyBy(clientip).process(processA).sink(...);
}
public void calculationTwo(){
filteredA().keyBy(clientip).process(processB).sink(...);
}

It depends a bit on how it should behave operationally.
The first way is a more friendly to the Kafka cluster: all records are read once. The filter itself is a very cheap operation, so you don't need to worry to much about it. However, the big downside of this approach is that if one calculations is much slower than the others, it will slow them down. If you do not process historic events, it shouldn't matter as you'd size your application cluster to keep up with all events anyways. Another current downside is that if you have a failure in calculationTwo also tasks in calculationOne are restarted. The community is actively working to mitigate that though.
The second way would allow only the affected source -> ... -> sink subtopology to be restarted. So if you expect frequent restarts or need to guarantee certain SLAs, this approach is better. An extension is to actually have separate Flink applications for each of these pipelines. You can share the same jar, but use different arguments to select the correct pipeline on submission. This approach also makes updating of applications much easier as you would only experience downtime for the pipeline that you actually modify.

I might do something like below, where a simple wrapper operator can run data through two different functions, and generate two side outputs.
SingleOutputStreamOperator comboResults = kafkaSource
.filter(isContainsSmthA)
.keyBy(clientip)
.process(new MyWrapperFunction(processA, processB));
comboResults
.getSideOutput(processATag)
.sink(...);
comboResults
.getSideOutput(processBTag)
.sink(...);
Though I don't know how that compares with what Arvid suggested.

Related

Any examples of using a Wandsearcher in vespa ? (After a weighted set query)

Currently i am using the REST interface to query vespa, which seems to work great but something tells me that i should be using searchers in the application to make the client(server side code) a bit lighter (bundle the jar file in the application package) to make it a bit smoother. I have managed to do some simple searcher/processor applications. But this is a bit overwhelming.
So are there any readily available examples ?
Basicially i want to:
Send to /search?query=someId
Do a ordinary search for the weighted set on this documentID (I guess this one can be handy: https://docs.vespa.ai/documentation/reference/inspecting-structured-data.html)
Take those items in the response and add it to a wand item(s) and query for a wand with wandsearcher on a given field. Similar to the yql:
"select * from sources * where wand(interest, some weightedsets));","ranking":"combined_score" and return the matches.
Just curious also, apart from the trouble of string building with the http request i am doing at the moment are there any performance gains of using a searcher or go the java route vs rest?
thanks for any insight or code help i can start with.
There is an example of using the WandItem (YQL wand)here https://docs.vespa.ai/documentation/advanced-ranking.html and see also https://docs.vespa.ai/documentation/using-wand-with-vespa.html as there are two wand implementations available in Vespa, it sounds from the description that the wand() is what you want to use for this use case. For the first call you probably want to have a dedicated document summary to reduce the amount of data fetched for your first query and also the option of serving it out of memory only (See https://docs.vespa.ai/documentation/document-summaries.html)
Also see https://docs.vespa.ai/documentation/searcher-development.html as a general resource on writing searchers.
For your use case it makes a lot of sense to write a searcher to perform these two queries as your second query depends on the first and you avoid the cost of rendering/http/yql parsing which might matter if your client is remote with high network latency.

GAE MapReduce Huge Query

Abstract: Is MapReduce a good idea when processing a collection of data from the database, instead of finding some answer to a somewhat complex (or just big) question?
I would like to sync a set of syndication sources (e.g. urls like http://xkcd.com/rss.xml ), which are stored in GAE's datastore as a collection/table. I see two options, one is straight forward. Make simple tasks which you put in a queue, where each task handle's 100 or 1000 or whatever natural number seems to fit in each task. The other option is MapReduce.
In the latter case, the Map does everything, and the Reduce does nothing. Moreover, the map has no result, it just alters the 'state' (of the datastore).
#Override public void map(Entity entity) {
String url = (String)entity.getProperty("url");
for(Post p : www.fetchPostsFromFeed(url)) {
p.save();
}
}
As you can see, one source can map to many posts, so my map might as well be called "Explode".
So no emits and nothing for reduce to do. The reason I like this map-approach, is that I tell google: Here, take my collection/table, split it however you see fit to different mappers, and then store the posts wherever you like. The datastore uses 'high replication'. So availability of the data is high and a best choice for what 'computational unit' does what entity doesn't really reduce network communication. The same goes for the save of the posts, as they need to go to all datastore units. What I do like is that mapreduce has some way of fault-recovery for map-computations that get stuck, and that it knows how many tasks to send to which node, instead of queueing some number of entities somewhere hoping it makes sense.
Maybe my way of thinking here is wrong, in which case, please correct me. Anyhow, is this approach 'wrong' for the lack of reduce and map being an 'explode'?
Nope, Map pretty does the same as as your manual enqueuing of tasks.

Synchronizing Asynchronous request handlers in Silverlight environment

For our senior design project my group is making a Silverlight application that utilizes graph theory concepts and stores the data in a database on the back end. We have a situation where we add a link between two nodes in the graph and upon doing so we run analysis to re-categorize our clusters of nodes. The problem is that this re-categorization is quite complex and involves multiple queries and updates to the database so if multiple instances of it run at once it quickly garbles data and breaks (by trying to re-insert already used primary keys). Essentially it's not thread safe, and we're trying to make it safe, and that's where we're failing and need help :).
The create link function looks like this:
private Semaphore dblock = new Semaphore(1, 1);
// This function is on our service reference and gets called
// by the client code.
public int addNeed(int nodeOne, int nodeTwo)
{
dblock.WaitOne();
submitNewNeed(createNewNeed(nodeOne, nodeTwo));
verifyClusters(nodeOne, nodeTwo);
dblock.Release();
return 0;
}
private void verifyClusters(int nodeOne, int nodeTwo)
{
// Run analysis of nodeOne and nodeTwo in graph
}
All copies of addNeed should wait for the first one that comes in to finish before another can execute. But instead they all seem to be running and conflicting with each other in the verifyClusters method. One solution would be to force our front end calls to be made synchronously. And in fact, when we do that everything works fine, so the code logic isn't broken. But when it's launched our application will be deployed within a business setting and used by internal IT staff (or at least that's the plan) so we'll have the same problem. We can't force all clients to submit data at different times, so we really need to get it synchronized on the back end. Thanks for any help you can give, I'd be glad to supply any additional information that you could need!
I wrote a series to specifically address this situation - let me know if this works for you (sequential asynchronous workflows):
Part 2 (has a link back to the part1):
http://csharperimage.jeremylikness.com/2010/03/sequential-asynchronous-workflows-part.html
Jeremy
Wrap your database updates in a transaction. Escalate to a table lock if necessary

Can I get the instances of alive objects of a certain type in C#?

This is a C# 3.0 question. Can I use reflection or memory management classes provided by .net framework to count the total alive instances of a certain type in the memory?
I can do the same thing using a memory profiler but that requires extra time to dump the memory and involves a third party software. What I want is only to monitor a certain type and I want a light-weighted method which can go easily to unit tests. The purpose to count the alive instances is to ensure I don't have any expected living instances that cause "memory leak".
Thanks.
To do it entirely within the application you could do an instance-counter, but it would need to be explicitly coded and managed inside each class--there's no silver bullet that I'm aware of to let you query the framework from within the executing code to see how many instances are alive.
What you're asking for is really the domain of a profiler. You can purchase one or build your own, but it requires your application to run as a child process of the profiler. Rolling your own isn't an easy undertaking, by the way.
If you want to consider the instance counter it would have to be something like:
public class MyClass : IDisposable
public MyClass()
{
++ClassInstances;
}
public void Dispose()
{
--ClassInstances;
}
private static new object _ClassInstancesLock;
private static int _ClassInstances;
private static int ClassInstances
{
get
{
lock (_ClassInstancesLock)
{
return _ClassInstances
}
}
}
This is just a really rough sample, no tests for compilation; 0% guarantee for thread-safety (critical for this type of approach) and it leaves the door wide open for Dispose to be called, the instance counter to decrement, but for the object not to properly GC. To diagnose that bundle of joy you'll need, you guessed it, a professional profiler--or at least windbg.
Edit: I just noticed the very last line of your question and needed to say that my above approach, as shoddy and failure-prone as it is, is almost guaranteed to deceive and lie to you about the true number of instances if you're experiencing a leak. The best tool, IMO, for attacking these problems is ANTS Memory Profiler. Version 5 is a double-edge in that they broke Performance and Memory profiler into two seperate SKUs (used to be bundled together) but Memory Profiler 5.0 is absolutely lightning fast. Profiling these problems used to be slow as molases, but they've gotten around it somehome.
Unless this is for a personal project with 0 intent of redistribution you should invest the few hundred dollars needed for ANTS--but by all means use it's trial period first. It's a great tool for exactly this kind of analysis.
The only way I see to do this is without any form of instrumentation to use the CLR Profiling API to track object lifetimes. I'm not aware of any APIs available to the managed code to do the same thing, and, so far as I know, CLR doesn't keep the list of live objects anywhere (so even with profiler API you have to create the data structures for that yourself).
VB.NET has a feature where it lets you track objects in debugger, but it actually emits additional code specifically for that (which basically registers all created objects in internal list of weak references). You could do that as well, using e.g. PostSharp to post-process your assemblies.

Is it bad practice to "go deep" with your application of callbacks?

Weird question, but I'm not sure if it's anti-pattern or not.
Say I have a web app that will be rendering 1000 records to an html table.
The typical approach I've seen is to send a query down to the database, translate the records in some way to some abstract state (be it an array, or a object, etc) and place the translated records into a collection that is then iterated over in the view.
As the number of records grows, this approach uses up more and more memory.
Why not send along with the query a callback that performs an operation on each of the translated rows as they are read from the database? This would mean that you don't need to collect the data for further iteration in the view so the memory footprint shrinks, and you're not iterating over the data twice.
There must be something implicitly wrong with this approach, because I rarely see it used anywhere. What's wrong with this approach?
Thanks.
Actually, this is exactly how a well-developed application should behave.
There is nothing wrong with this approach, except that not all database interfaces allow you to do this easily.
If we talk about tabularizing 10 records for a yet another social network, there is no need to mess with callbacks if you can get an array of hashes or whatever with a single call that is already implemented for you.
There must be something implicitly wrong with this approach, because I rarely see it used anywhere.
I use it. Frequently. Even when i wouldn't use too much memory repeatedly copying the data, using a callback just seems cleaner. In languages with closures, it also lets you keep relevant code together while factoring out the messy DB stuff.
This is a "limited by your tools" class of problem: Most programming languages don't allow to say "Do something around this code". This was solved in recent years with the advent of closures. Think of a closure as a way to pass code into another method which is then executed in a context. For example, in GSQL, you can write:
def l = []
sql.execute ("select id from table where time > ?", time) { row ->
l << row[0]
}
This will open a connection to the database, create a statement and a result set and then run the l << it[0] for each row the DB returns. Note that the code runs inside of sql.execute() but it can access local variables (l) and variables defined in sql.execute() (row).
With this kind of code, you can even generate the result of a HTTP request on the fly without keeping much of the page in RAM at any time. In my case, I'd stream a 2MB document to the browser using only a few KB of RAM and the browser would then chew 83s to parse this.
This is roughly what the iterator pattern allows you to do. In many cases this breaks down on the interface between your application and the database. Technologies like LINQ even have solutions that can send back code to the database.
I've found it easier to use an interface resolver than deep callback where its hooked up through several classes. MS has a much fancier version than mine called Unity. This provides a much cleaner way of accessing classes that should not be tightly coupled
http://www.codeplex.com/unity

Resources