How to benchmark DB operations using JMH? - benchmarking

Sometimes we have to perform same DB operation multiple times within a loop. How can I compute the execution time for each operation using JMH?
public void applyAll(ArrayList<parameter_type> lists) {
for(parameter_type param : lists) {
How can I compute the execution time for saveToDB(param) for each time it is being executed/called?

DB operations are really nothing to microbenchmark. Their will depend on multiple things that are quite impossible to isolate.
As for using parameters, have a look at this answer that explains the use of the #Param annotation.

As #RafaelWinterhalter said, this type of calls are prone to give misleading results in benchmarks. But if you still want to try, then:
Serialize and save a reference list of calls.
Then in a benchmark use a #State(Scope.Thread) object to restore this list to an array and have a loop counter variable there.
Then #Benchmark public int test1_saveToDB(MyState state) { saveToDB(state.params[state.i]); return state.i++; }


Unique Count for Multiple timewindows - Process or Reduce function combined with ProcessWindowFunction?

We need to find number of unique elements in the input stream for multiple timewindows.
The Input data Object is of below definition InputData(ele1: Integer,ele2: String,ele3: String)
Stream is keyed by ele1 and ele2.The requirement is to find number of unique ele3 in the last 1 hour, last 12 hours and 24 hours and the result should refresh every 15 mins.
We are using SlidingTimewindow with sliding interval as 15 mins and Streaming intervals 1,12 and 24.
Since we need to find Unique elements, we are using Process function as the window function,which would store all the elements(events) for each key till the end of window to process and count unique elements.This,we thought could be optimized for its memory consumption
Instead,we tried using combination of Reduce function and Process function,to incrementaly aggregate,keep storing unique elements in a HashSet in Reduce function and then count the size of the HashSet in Process window function.
public class UserDataReducer implements ReduceFunction<UserData> {
public UserData reduce(UserData u1, UserData u2) {
return new UserData.Builder(u1.getElement1(), u1.getElement2(),)
public class UserDataProcessor extends ProcessWindowFunction<UserData,Metrics,
Tuple2<Integer, String>,TimeWindow> {
public void process(Tuple2<Integer, String> key,
ProcessWindowFunction<UserData, Metrics, Tuple2<Integer, String>, TimeWindow>.Context context,
Iterable<UserData> elements,
Collector<Metrics> out) throws Exception {
if (Objects.nonNull(elements.iterator().next())) {
UserData aggregatedUserAttribution = elements.iterator().next();
out.collect(new Metrics(
We expected the heap memory consumption to reduce,since we are now storing only one object per key per slide as the state.
But there was no decrease in the heap memory consumption,it was almost same or a bit higher.
We observed in the heapdump of the new process, a high number of hashmap instances,consuming more memory than the input data objects would occupy,in the ealrier job.
What would be the best way to solve this? Process function or Incremental aggregation with a combination of Reduce and Process function?
State Backend: Hashmap
Flink Version: 1.14.2 on Yarn
In this case I'm not really sure if partial aggregation will reduce Heap size. It should allow You to reduce state size by some factor depending on the uniqueness of the dataset. That is because (as far as I understand) You are effectively copying HashSet for every single element that is assigned to the window, while they are being garbage collected, it doesn't happen immediately so You will see quite a few of those HashSets in heap dumps.
Overall, ProcessFunction will quite probably generate larger state but in terms of Heap Size they may be quite similar as You have noticed.
One thing You might consider is to try to apply more advanced processing. You can either try to read on Triggers and try to implement a trigger in a such a way that You will have 24h window, but it would emit results for ever y 1h, 12h and 24h (after which the window would be purged). Note that in such case You would need to do some work in ProcessFunction to make sure the results are correct. One more thing You can look at is this post.
Note that both proposed solutions will require some understanding of Flink and more manual processing of window elements.

Is FLIP-140 still correct in how it describes sorting/spilling data?

FLIP-140 states:
We will introduce a sorting step (with potential spilling, reusing the UnilateralSortMerger implementation) before every keyed operator for sorting/grouping inputs by their keys. This will allow us to process records in per-key groups, which will enable us to use a simplified implementation of a StateBackend that is not organized in key groups and only ever keeps values for a single key.
The single key at a time execution will be used for the Batch style execution as decided by the algorithm described in FLIP-134: DataStream Semantics for Bounded Input .
Moreover it will be possible to disable it through a execution.sorted-shuffles.enabled configuration option.
However I see not documentation for execution.sorted-shuffles.enabled, and no references to it in the code. So is the above description of how things work still correct? Wondering how the "only keep one key's state around" would work without sorting.
This code makes me think that both the sorting and special state backend are being used with batch execution:
private void setBatchStateBackendAndTimerService(StreamGraph graph) {
boolean useStateBackend = configuration.get(ExecutionOptions.USE_BATCH_STATE_BACKEND);
boolean sortInputs = configuration.get(ExecutionOptions.SORT_INPUTS);
!useStateBackend || sortInputs,
"Batch state backend requires the sorted inputs to be enabled!");
if (useStateBackend) {
LOG.debug("Using BATCH execution state backend and timer service.");
graph.setStateBackend(new BatchExecutionStateBackend());
graph.setCheckpointStorage(new BatchExecutionCheckpointStorage());
} else {

Using Active Record pattern in CakePHP, and avoiding passing arrays around

As my CakePHP 2.4 app gets bigger, I'm noticing I'm passing a lot of arrays around in the model layer. Cake has kinda led me down this path because it returns arrays, not objects, from it's find calls. But more and more, it feels like terrible practice.
For example, in my Job model, I've got a method like this:
public function durationInSeconds($job) {
return $job['Job']['estimated_hours'] * 3600; // convert to seconds
Where as I imagine that using active record patter, it should look more like this:
public function durationInSeconds() {
return $this->data['Job']['estimated_hours'] * 3600; // convert to seconds
(ie, take no parameter, and assume the current instance represents the Job you want to work with)
Is that second way better?
And if so, how do I use it when, for example, I'm looping through the results of a find('all') call? Cake returns an array - do I loop through that array and do a read for every single row? (seems a waste to re-fetch the info from the database)
Or should I implement a kind of setActiveRecord method that emulates read, like this:
function setActiveRecord($row){
$this->id = $row['Job']['id'];
$this->dtaa = $row;
Or is there a better way?
EDIT: The durationInSeconds method was just a simplest possible example. I know for that particular case, I could use virtual fields. But in other cases I've got methods that are somewhat complex, where virtual fields won't do.
The best solution depends on the issue you need to solve. But if you have to make a call to a function for each result row, perhaps it is necessary to redesign the query taking all the necessary data.
In this case that you have shown, you can use simply a virtual Field on Job model:
$this->virtualFields = array(
'duration_in_seconds' => 'Job.estimated_hours * 3600',
..and/or you can use a method like this:
public function durationInSeconds($id = null) {
if (!empty($id)) {
$this->id = $id;
return $this->field('estimated_hours') * 3600; // convert to seconds

Testing an Algorithms speed. How?

I'm currently testing different algorithms, which determine whether an Integer is a real square or not. During my research I found this question at SOF:
Fastest way to determine if an integer's square root is an integer
I'm compareably new to the Programming scene. When testing the different Algorithms that are presented in the question, I found out that this one
bool istQuadratSimple(int64 x)
int32 tst = (int32)sqrt(x);
return tst*tst == x;
actually works faster than the one provided by A. Rex in the Question I posted. I've used an NS-Timer object for this testing, printing my results with an NSLog.
My question now is: How is speed-testing done in a professional way? How can I achieve equivalent results to the ones provided in the question I posted above?
The problem with calling just this function in a loop is that everything will be in the cache (both the data and the instructions). You wouldn't measure anything sensible; I wouldn't do that.
Given how small this function is, I would try to look at the generated assembly code of this function and the other one and I would try to reason based on the assembly code (number of instructions and the cost of the individual instructions, for example).
Unfortunately, it only works in trivial / near trivial cases. For example, if the assembly codes are identical then you know there is no difference, you don't need to measure anything. Or if one code is like the other plus additional instructions; in that case you know that the longer one takes longer to execute. And then there are the not so clear cases... :(
(See the update below.)
You can get the assembly with the -S -emit-llvm flags from clang and with the -S flag from gcc.
Hope this help.
UPDATE: Response to Prateek's question in the comment "is there any way to determine the speed of one particular algorithm?"
Yes, it is possible but it gets horribly complicated REALLY quick. Long story short, ignoring the complexity of modern processors and simply accumulating some predefined cost associated with the instructions can lead to very very inaccurate results (the estimate off by a factor of 100, due to the cache and the pipeline, among others). If you try take into consideration the complexity of the modern processors, the hierarchical cache, the pipeline, etc. things get very difficult. See for example Worst Case Execution Time Prediction.
Unless you are in a clear situation (trivial / near trivial case), for example the generated assembly codes are identical or one is like the other plus a few instructions, it is also hard to compare algorithms based on their generated assembly.
However, here a simple function of two lines is shown, and for that, looking at the assembly could help. Hence my answer.
I am not sure if there is any professional way of checking the speed (if there is let me know as well). For the method that you directed to in your question I would probably do something this this in java.
package Programs;
import java.math.BigDecimal;
import java.math.RoundingMode;
public class SquareRootInteger {
public static boolean isPerfectSquare(long n) {
if (n < 0)
return false;
long tst = (long) (Math.sqrt(n) + 0.5);
return tst * tst == n;
public static void main(String[] args) {
long iterator = 1;
int precision = 10;
long startTime = System.nanoTime(); //Getting systems time before calling the isPerfectSquare method repeatedly
while (iterator < 1000000000) {
long endTime = System.nanoTime(); // Getting system time after the 1000000000 executions of isPerfectSquare method
long duration = endTime - startTime;
BigDecimal dur = new BigDecimal(duration);
BigDecimal iter = new BigDecimal(iterator);
System.out.println("Speed "
+ dur.divide(iter, precision, RoundingMode.HALF_UP).toString()
+ " nano secs"); // Getting average time taken for 1 execution of method.
You can check your method in similar fashion and check which one outperforms other.
Record the time value before your massive calculation and the value after that. The difference is the time executed.
Write a shell script where you will run the program. And run 'time ./' to get it's running time.

How to make the program work this way?

So i have a program that does these calculations with numbers. The program is threaded, and the number of threads are specified from the user.
I will give a close example
static void *program_thread(void *thread)
bool somevar = true;
work = getwork();
if(condition1 blah blah)
somevar = false; /* disable getwork */
somevar = true; /* condition was either met or not met, so we request
new work either way */
Then with pthreads(and i will skip some code) i do
int main(blah)
if (pthread_create(&thr->pth, NULL, program_thread, thread_number)) {
printf("%s","program thread create failed");
return 1;
Now i will start explaining. The number of threads created are specified from the user, so i do a for loop and create as many threads as i need.
Each thread calls
work = getwork();
Thus getting independant work to do, however the CPU is slow for this kind of job. It tries to compute something by trying 2^32 numbers(which is from 1 to 4 294 967 296)
But my CPU can only do around 3 million numbers per second, and by the time it reaches 4 billion numbers, it's restarted(for new work).
So i then thought of a better method. Instead of each thread getting totally different work, all the threads should get the same work and split the numbers they need to try.
The problem is, that i can't controll what work it get's, so i must fetch
work = getwork();
Before initiating the threads. The question is HOW? Using pthread_create obviously...but then what?
You get more than one way to do it:
split your work package into smaller parts (thus, your getWork returns a new, smaller work)
store your work in a common place, that you access from your thread using a reader-writer pattern
from the pthread API, the 4th parameter is given to your thread, you can do something like the following code :
Work = getWork();
if (pthread_create(&thr->pth, NULL, program_thread, (void*) &work))
And your program_thread function would be like that
static void *program_thread(void *pxThread)
Work* pWork = (Work*) pxThread;
Of course, you need to check the validaty of the pointer and common stuff (in my example, I created it on stack which is most probably a bad idea). Note that your code is givig a thread_number as a pointer, which is usually a bad idea. If you want to have more information transfered to your thread, simply hide it into a structure.
I'm not sure I fully understood your issue, but this could give you some hints most probably. Please note also that when doing multithreading, you need to take into account specific issues like race conditions, concurrent access and more complex lifecycle of objects...