Camel aggregator does not aggregate all - apache-camel

I am splitting some java objects and then aggregating. I am kind of confused how this completion strategy works with Camel (2.15.2). I am using completion size and completion timeout. If I understand correctly, completion timeout does not have much effect in it. Because, there is not much waiting going on here.
Altogether, I have 3000+ objects. But, it seems only a part of it is aggregated. But, if I vary the completion size value, situation changes. If the size is 100, it aggregates around 800, and if it is 200, it aggregates up to around 1600. But, I don't know the size of objects in advance, and so, cannot rely on a assumed number.
Can anyone please explain to me what I am doing wrong here?
If I use eagerCheckCompletion, it aggregates the whole thing in a one go, which I don't want.
Below is my route:
from("direct:specializeddatavalidator")
.to("bean:headerFooterValidator").split(body())
.process(rFSStatusUpdater)
.process(dataValidator).choice()
.when(header("discrepencyList").isNotNull()).to("seda:errorlogger")
.otherwise().to("seda:liveupdater").end();
from("seda:liveupdater?concurrentConsumers=4&timeout=5000")
.aggregate(simple("${in.header.contentType}"),
batchAggregationStrategy())
.completionSize(MAX_RECORDS)
.completionTimeout(BATCH_TIME_OUT).to("bean:liveDataUpdater");
from("seda:errorlogger?concurrentConsumers=4")
.aggregate(simple("${in.header.contentType}"),
batchAggregationStrategy("discrepencyList"))
.completionSize(MAX_RECORDS_FOR_ERRORS)
.completionTimeout(BATCH_TIME_OUT)
.process(errorProcessor).to("bean:liveDataUpdater");

Weird, but if you want to aggregate all the splitted messages you can simply use
.split(body(), batchAggregationStrategy())
And depending on how you want it to work you can use
.shareUnitOfWork().stopOnException()
See http://camel.apache.org/splitter.html for more info

Related

How can I g_signal_connect() by ID rather than string name?

The typical way to handle a button press in GTK is:
g_signal_connect(GTK_BUTTON(myButton), "pressed", G_CALLBACK(myButtonHandler), NULL);
However, I find it bad, slow, and unnecessary to use strings (such as "pressed") for internal identification. If I can find the numerical signal ID that corrosponds to this, I can skip the parsing step. But how do I connect an event by ID rather than string name? I did a lot of digging and found this, and I also learned that g_signal_connect is a macro that expands to g_signal_connect_data, but none of these quite solve my problem.
Is this possible, if so, how do I do this?
You can use g_signal_connect_closure_by_id() but then you have to create a GClosure structure to hold your callback and callback data.
I would really recommend against this, as it will add boilerplate to your code for little benefit. You generally only connect signals once. If you are connecting a signal in a tight loop, then you are probably doing something wrong, or you have a very unusual use case. Anyway, signal names are actually interned, which means that you are not even incurring the cost of string comparisons; only the cost of splitting the string at : if the signal has a detail annotation. Don't bother with optimizing this unless it is actually showing up as a bottleneck on your profiler graphs.

Flink trigger on a custom window

I'm trying to evaluate Apache Flink for the use case we're currently running in production using custom code.
So let's say there's a stream of events each containing a specific attribute X which is a continuously increasing integer. That is a bunch of contiguous events have this attributes set to N, then the next batch has it set to N+1 etc.
I want to break the stream into windows of events with the same value of X and then do some computations on each separately.
So I define a GlobalWindow and a custom Trigger where in onElement method I check the attribute of any given element against the saved value of the current X (from state variable) and if they differ I conclude that we've accumulated all the events with X=CURRENT and it's time to do computation and increase the X value in the state.
The problem with this approach is that the element from the next logical batch (with X=CURRENT+1) has been already consumed but it's not a part of the previous batch.
Is there a way to put it back somehow into the stream so that it is properly accounted for the next batch?
Or maybe my approach is entirely wrong and there's an easier way to achieve what I need?
Thank you.
I think you are on a right track.
Trigger specifies when a window can be processed and results for a window can be emitted.
The WindowAssigner is the part which says to which window element will be assigned. So I would say you also need to provide a custom implementation of WindowAssigner that will assign same window to all elements with equal value of X.
A more idiomatic way to do this with Flink would be to use stream.keyBy(X).window(...). The keyBy(X) takes care of grouping elements by their particular value for X. You then apply any sort of window you like. In your case a SessionWindow may be a good choice. It will fire for each key after that key hasn't been seen for some configurable period of time.
This approach will be much more robust with regard to unordered data which you must always assume in a stream processing system.

threadpools - boss/worker vs peer (workcrew) models

I'm aiming to use a threadpool with pthreads and am trying to choose between these two models of threading and it seems to me that the peer model is more suitable when working with fixed input, whereas the boss/worker model is better for dynamically changing work items. However, I'm a little unsure of how exactly to get the peer model to work with a threadpool.
I have a number of tasks that all need to be performed on the same data set. Here's some simple psuedocode for how I would look at tackling this:
data = [0 ... 999]
data_index = 0
data_size = 1000
tasks = [0 ... 99]
task_index = 0
threads = [0 ... 31]
thread_function()
{
while (true)
{
index = data_index++ (using atomics)
if index > data_size
{
sync
if thread_index == 0
{
data_index = 0
task_index++
sync
}
else
{
sync
}
continue
}
tasks[task_index](data[index])
}
}
(Firstly, it seems like there should be a way of making this use just one synchronisation point, but I'm not sure whether that's possible?)
The above code seems like it will work well for the case where the the tasks are known in advance, though I guess a threadpool is unnecessary for this particular problem. However even if the data items are still predefined across all tasks, if the tasks are not known in advance, it seems like the boss/worker model is better suited? Is it possible to use the boss/worker model but still allow the tasks to be picked up by the threads themselves (as above), where the boss essentially suspends itself until all tasks are complete? (Maybe this is still termed the peer model?)
Final question is regarding the synchronisation, barrier or condition variable and why?
If anyone can make any suggestions as to how better to approach this problem or even to poke holes in any of my assumptions, that would be great? Unfortunately I'm restricted from using a more higher-level library such as tbb for tackling this.
Edit: I should point out in case this isn't clear, each task needs to be completed in it's entirety before moving onto the next.
I'm a bit confused by your description here, hope the below is relevant.
I always looked at this pattern and found it very useful: The "boss" is responsible for detecting work and dispatching it to a worker pool based on some algorithm, from that time on, the worker is independent.
In this scenario, the worker is always waiting for work, not aware of any other instance, process requests and when it finishes, may trigger a notification of completion.
This has the advantage of good separation between the work itself and the algorithm that balance between the threads.
The other option is for the "boss" to maintain a pool of work items, and the workers to always pick them up as soon as they are free. But I guess this is more complex to implement and requires a larger amount of synchronization. I do not see the benefit of this second approach over the previous one.
Control logic and worker state is maintained by the "boss" in both scenarios.
As the paralleled work is done on a task, the "boss" "object" is handling a task, in a simple implementation, this "boss" blocks until a task is finished, allowing to call the next "boss" in line.
Regarding the Sync, unless I'm missing here something, you only need to sync once for all the workers to finish and this sync is done at the "boss" where the workers just send notifications that they finished.

Create a non-blocking timer to erase data

Someone can show me how to create a non-blocking timer to delete data of a struct?
I've this struct:
struct info{
char buf;
int expire;
};
Now, at the end of the expire's value, I need to delete data into my struct. the fact is that in the same time, my program is doing something else. so how can I create this? even avoiding use of signals.
It won't work. The time it takes to delete the structure is most likely much less than the time it would take to arrange for the structure to be deleted later. The reason is that in order to delete the structure later, some structure has to be created to hold the information needed to find the structure later when we get around to deleting it. And then that structure itself will eventually need to be freed. For a task so small, it's not worth the overhead of dispatching.
In a difference case, where the deletion is really complicated, it may be worth it. For example, if the structure contains lists or maps that contain numerous sub-elements that must be traverse to destroy each one, then it might be worth dispatching a thread to do the deletion.
The details vary depending on what platform and threading standard you're using. But the basic idea is that somewhere you have a function that causes a thread to be tasked with running a particular chunk of code.
Update: Hmm, wait, a timer? If code is not going to access it, why not delete it now? And if code is going to access it, why are you setting the timer now? Something's fishy with your question. Don't even think of arranging to have anything deleted until everything is 100% finished with it.
If you don't want to use signals, you're going to need threads of some kind. Any more specific answer will depend on what operating system and toolchain you're using.
I think the motto is to have a timer and if it expires as in case of Client Server logic. You need to delete those entries for which the time is expired. And when a timer expires, you need to delete that data.
If it is yes: Then it can be implemented in couple of ways.
a) Single threaded : You create a sorted queue based on the difference of (interval - now ) logic. So that the shortest span should receive the callback first. You can implement the timer queue using map in C++. Now when your work is over just call the timer function to check if any expired request is there in your queue. If yes, then it would delete that data. So the prototype might look like set_timer( void (pf)(void)); add_timer(void * context, long time_to_expire); to add the timer.
b) Multi-threaded : add_timer logic will be same. It will access the global map and add it after taking lock. This thread will sleep(using conditional variable) for the shortest time in the map. Meanwhile if there is any addition to the timer queue, it will get a notification from the thread which adds the data. Why it needs to sleep on conditional variable, because, it might get a timer which is having lesser interval than the minimum existing already.
So suppose first call was for 5 secs from now
and the second timer is 3 secs from now.
So if the timer thread only sleeps and not on conditional variable, then it will wake up after 5 secs whereas it is expected to wake up after 3 secs.
Hope this clarifies your question.
Cheers,

Arrays in PowerBuilder

I have this code
n_userobject inv_userobject[]
For i = 1 to dw_1.Rowcount()
inv_userobject[i] = create n_userobject
.
.
.
NEXT
dw_1.rowcount() returns only 210 rows. Its so odd that in the range of 170 up, the application stop and crashes on inv_userobject[i] = create n_userobject.
My question, is there any limit on array or userobject declaration using arrays?
I already try destroying it after the loop so as to check if that will be a possible solution, but it is still crashing.
Or how can i be able to somehow refresh the userobject?
Or is there anyone out there encounter this?
Thanks for all your help.
First, your memory problem. You're definitely not running into an array limit. If I was to take a guess, one of the instance variables in n_userobject isn't being cleaned up properly (i.e. pointing to a class that isn't being destroyed when the parent class is destroyed) or pointing to a class that similarly doesn't clean itself up. If you've got PB Enterprise, I'd do a profiling trace with a smaller loop and see what is being garbage collected (there's a utility called CDMatch that really helps this process).
Secondly, let's face it, you're just doing this to avoid writing a reset method. Even if you get this functional, it will never be as efficient as writing your own reset method and reusing the same instance over again. Yes, it's another method you'll have to maintain whenever the instance variable list changes or the defaults change, but you'll easily gain that back in performance.
Good luck,
Terry.
I'm assuming the crash you're facing is at the PBVM level, and not a regular PB exception (which you can catch in your code). If I'm wrong, please add the exception details.
A loop of 170-210 iterations really isn't a large one. However, crashes within loops are usually the result of resource exhaustion. What we usually do in long loops is call GarbageCollect() occasionally. How often should it be called depends on what your code does - using it frequently could allow the use of less memory, but it will slow down the run. Read this for more.
If this doesn't help, make sure the error does not come from some non-PB code (imported DLL or so). You can check the stack trace during the crash to see the exception's origin.
Lastly, if you're supported by Sybase (or a local representative), you can send them a crash dump. They can analyze it, and see if it's a bug in PB, and if so, let you know when it was (or will be) fixed.
What I would normally do with a DataWindow is to create an object that processes the data in a row and call it for each row.
the only suggestion i have for this is to remove the rowcount from the for (For i = 1 to dw_1.Rowcount()) this will cause the code to recount the rows every time it uses one. get the count into a variable and then use the variable. it should run a bit better and be far more easy to debug.

Resources