How to calculate the power consumption of a VM?

How to calculate the power consumption of a VM? - cloudsim

Respected researchers, i want to calculate the power consumed by the virtual machines of a physical server in a cloud datacenter.
please help me i will be very thankful for this appreciation.

/**
* The cost of each byte of bandwidth (bw) consumed.
*/
protected double costPerBw;
/**
* The total bandwidth (bw) cost for transferring the cloudlet by the
* network, according to the {#link #cloudletFileSize}.
*/
protected double accumulatedBwCost;
// Utilization
/**
* The utilization model that defines how the cloudlet will use the VM's
* CPU.
This segment are taken from Cloudlet.java.line 212. This may be helpful.
Or, as you set each VM properties you can calculate the power consumption.
//VM description
int vmid = 0;
int mips = 250;
long size = 10000; //image size (MB)
int ram = 2048; //vm memory (MB)
long bw = 1000;
int pesNumber = 1; //number of cpus
String vmm = "Xen"; //VMM name
This segment taken from CloudSimExample3.java line 64

CloudSim Plus has a built-in feature to compute VM's power consumption.
The method below shows how to use such a feature. You can get the complete example here.
private void printVmsCpuUtilizationAndPowerConsumption() {
for (Vm vm : vmList) {
System.out.println("Vm " + vm.getId() + " at Host " + vm.getHost().getId() + " CPU Usage and Power Consumption");
double vmPower; //watt-sec
double utilizationHistoryTimeInterval, prevTime = 0;
final UtilizationHistory history = vm.getUtilizationHistory();
for (final double time : history.getHistory().keySet()) {
utilizationHistoryTimeInterval = time - prevTime;
vmPower = history.vmPowerConsumption(time);
final double wattsPerInterval = vmPower*utilizationHistoryTimeInterval;
System.out.printf(
"\tTime %8.1f | Host CPU Usage: %6.1f%% | Power Consumption: %8.0f Watt-Sec * %6.0f Secs = %10.2f Watt-Sec\n",
time, history.vmCpuUsageFromHostCapacity(time) *100, vmPower, utilizationHistoryTimeInterval, wattsPerInterval);
prevTime = time;
}
System.out.println();
}
}

Related

How do I obtain the theoretical/Linpack FLOPS performance on basic vector/matrix operations?

The question is simple. How do I further optimize my code as the basic matrix operations are critical and common to my calculation. BLAS and LAPACK operations are good in linear algebra but neither of them provides basic element by element addition/multiply operations (Hadamard). Theoretical performance maybe difficult, but Linpack performance or 60~80% Linpack performance should be achievable. (I can only do 12%, if I use multiply-add, then only 25%)
For references
Theoretical performance: 8259u has 4 cores * 3.8GHz * 16 FLOPS = 240 GFlops
Linpack performance: 8259u can run as fast as 140~160 GFlops double precision operations.
Platform: Macbook Pro 2018, Monterey
CPU: i5-8259u, 4c8t
RAM: 8GB
CC: gcc 11.3.0
CFLAGS: -mavx2 -mfma -fopenmp -O3
Here's my attempt
the flops are calculated as follows:
double time = stop - start;
double ops = 1.0 * Nx * Ny * iterNum; //2.0 for complex numbers
double flops = ops / time;
double gFlops = flops / 1E9;
Here's some results when I run my code. real and complex results are almost the same. Only showing the real results (roughly):
//Nx = Ny = 2048, iterNum = 10000
//Typical matrix size and iteration depth for my calculation
threads = 1: 1 GFlops
threads = 2: 2 GFlops
threads = 4: 3 GFlops
threads = 8: 4 GFlops
threads = 16: 9 GFlops
threads = 32: 11 GFlops
threads = 64: 15 GFlops
threads = 128: 18 GFlops
threads = 256: 19 GFlops
threads = 512: 21 GFlops
threads = 1024: 20 GFlops
threads = 2048: 40 GFlops // wrong answer
For the convenience of large matrix on heap and integrating with mathGL, the matrix is flattened as a vector consisting of Nx * Ny elements cascading by rows.
// for real numbers
x = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
y = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
z = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
sum = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
// for complex numbers
x = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
y = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
z = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
sum = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
and the addition was done parallelly using openmp.
double start = omp_get_wtime();
#pragma omp parallel private(shift)
{
for (int tds = omp_get_thread_num(); tds < threads; tds = tds + threads)
{
shift = Nx * Ny / threads * tds;
for (int i = 0; i < iterNum; i++)
{
AddComplex(sum+shift, sum+shift, z+shift, Nx/threads, Ny);
}
}
}
double stop = omp_get_wtime();
I wrote explicit vectorization code using AVX intrinsics "immintrin.h".
//real matrix addition
void AddReal(double *summation, const double *summand, const double *addend, int Nx, int Ny)
{
int nBlock = Nx * Ny / realPackSize;
int nRem = Nx * Ny % realPackSize;
register __m256d packSummand, packAddend, packSum;
const double *px = summand;
const double *py = addend;
double *pSum = summation;
for (int i = 0; i < nBlock; i++)
{
packSummand = _mm256_load_pd(px);
packAddend = _mm256_load_pd(py);
packSum = _mm256_add_pd(packSummand, packAddend);
_mm256_store_pd(pSum, packSum);
px = px + realPackSize;
py = py + realPackSize;
pSum = pSum + realPackSize;
}
for (int i = 0; i < nRem; i++)
{
pSum[i] = px[i] + py[i];
}
px = NULL;
py = NULL;
pSum = NULL;
return;
}
//Complex matrix addition
void AddComplex(double complex *summation, const double complex *summand, const double complex *addend, int Nx, int Ny)
{
int nBlock = Nx * Ny / complexPackSize;
int nRem = Nx * Ny % complexPackSize;
register __m256d packSummand, packAddend, packSum;
const double complex *px = summand;
const double complex *py = addend;
double complex *pSum = summation;
for (int i = 0; i < nBlock; i++)
{
packSummand = _mm256_load_pd(px);
packAddend = _mm256_load_pd(py);
packSum = _mm256_add_pd(packSummand, packAddend);
_mm256_store_pd(pSum, packSum);
px = px + complexPackSize;
py = py + complexPackSize;
pSum = pSum + complexPackSize;
}
for (int i = 0; i < nRem; i++)
{
pSum[i] = px[i] + py[i];
}
px = NULL;
py = NULL;
pSum = NULL;
return;
}

Level 1 (eg. dot product) and level 2 (eg. vector-matrix multiplication) BLAS functions are known not to scale (especially level 1 BLAS functions) as opposed to level 3 (eg. matrix-multiplication). Indeed, they are generally memory-bound: the amount of data read/written is O(n) while the amount of floating-point operation is also O(n). This is not the case for level 3 BLAS which are generally clearly compute-bound.
Theoretical performance maybe difficult, but Linpack performance or 60~80% Linpack performance should be achievable
If the computation is memory bound, then, no, this is not possible. Linpack is generally clearly compute bound on nearly all machine. The think is memory is slow and the speed of the RAM is not increasing as fast as the speed of processors over the last decades. This is known as a memory wall (formulated few decades ago and still true nowadays).
Here's some results when I run my code.
Having a faster computation with from using 1024 threads instead of 512 on a mobile processor with 4 core and 8 thread make me think that there is a huge problem somewhere. The maximum should be reached with 8 threads, or otherwise this means the computation is clearly inefficient. Indeed, running more threads than hardware threads cause the OS scheduler to make expensive context-switch (higher overhead). In the end, your processor never runs more that 8 tasks at a time. There are two possibility:
The timings are not correct (the provided piece of code about that seems fine to me)
The program is bogus
The computation exhibit a super-linear speed up (possibly due to cache)
I wrote explicit vectorization code using AVX intrinsics "immintrin.h".
The hot loop contains 2 loads, 1 store, 1 add and few instructions incrementing integers. Your processor can do 2 loads and 1 store per cycle so the SIMD part can be done in 1 cycle of throughput (though the latency can be much bigger) assuming nBlock is large enough.
Your processor can do 2 add per cycle so half the throughput is lost. However, you cannot write something faster than that if the load/write are mandatory.
If complexPackSize is smaller than a SIMD lane, then I think the processor has to make complex operation due to the overlapp with the past iteration that will certainly make it run the loop much less efficiently (a loop carried dependency will make the loop latency bound which is very inefficient here). If complexPackSize is much larger than a cache line, then prefetching will likely be an issue.
Your processor cannot execute too many instructions at the same time. The increment instruction and the loop check cause 5 instruction to be executed, which consume at least 1 cycle. This reduce the throughput by a factor of 2 again so not more than 25% of the theoretical performance can be reached. This can be improved a bit by unrolling the loop. Unrolling might also improve the execution because the _mm256_add_pd instruction has a pretty high latency. One should keep in mind that SIMD instructions are great for throughput but not for latency. Thus, when the latency is not an issue, SIMD codes should be fast.
Note that the write allocate cache policy cause data to be read when _mm256_store_pd is used increasing the amount of data transferred from the RAM and reducing the observed throughput. _mm256_stream_pd can be used to avoid this effect but it is fast only if data are not read just after or when data do not fit in the cache anyway. It also require data to be aligned. In fact, _mm256_store_pd also requires that and if it is not the case, it certainly cause a silent bug. The same applies for _mm256_load_pd: _mm256_loadu_pd should be used instead for unaligned data. I am not sure data read is always aligned. It should be fine if complexPackSize is a power of two divisible by 32 as well as shift. However, I highly doubt this is the case for shift, especially with a large number of threads. I also find very suspicious to use a constant complexPackSize while the SIMD lanes have a fixed size. Did you checked the results in all cases?

Strategy for doing final reduction

I am trying to implement an OpenCL version for doing reduction of a array of float.
To achieve it, I took the following code snippet found on the web :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
This kernel code works well but I would like to compute the final sum by adding all the partial sums of each work group.
Currently, I do this step of final sum by CPU with a simple loop and iterations nWorkGroups.
I saw also another solution with atomic functions but it seems to be implemented for int, not for floats. I think that only CUDA provides atomic functions for float.
I saw also that I could another kernel code which performs this operation of sum but I would like to avoid this solution in order to keep a simple readable source. Maybe I cannot do without this solution...
I must tell you that I use OpenCL 1.2 (returned by clinfo) on a Radeon HD 7970 Tahiti 3GB (I think that OpenCL 2.0 is not supported with my card).
More generally, I would like to get advice about the simplest method to perform this last final summation with my graphics card model and OpenCL 1.2.

If that float's order of magnitude is smaller than exa scale, then:
Instead of
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
You could use
if (local_id == 0)
{
if(strategy==ATOMIC)
{
long integer_part=getIntegerPart(localSums[0]);
atom_add (&totalSumIntegerPart[0] ,integer_part);
long float_part=1000000*getFloatPart(localSums[0]);
// 1000000 for saving meaningful 7 digits as integer
atom_add (&totalSumFloatPart[0] ,float_part);
}
}
this will overflow float part so when you divide it by 1000000 in another kernel, it may have more than 1000000 value so you get its integer part and add it to the real integer part:
float value=0;
if(strategy==ATOMIC)
{
float float_part=getFloatPart_(totalSumFloatPart[0]);
float integer_part=getIntegerPart_(totalSumFloatPart[0])
+ totalSumIntegerPart[0];
value=integer_part+float_part;
}
just a few atomic operations shouldn't be effective on whole kernel time.
Some of these get___part can be written easily already using floor and similar functions. Some need a divide by 1M.

Sorry for previous code.
also It has problem.
CLK_GLOBAL_MEM_FENCE effects only current workgroup.
I confused. =[
If you want to reduction sum by GPU, you should enqueue reduction kernel by NDRangeKernel function after clFinish(commandQueue).
Plaese just take concept.
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
barrier(CLK_GLOBAL_MEM_FENCE);
if(get_group_id(0)==0){
if(local_id < get_num_groups(0)){ // 16384
for(int n=0 ; n<get_num_groups(0) ; n+= group_size )
localSums[local_id] += partialSums[local_id+n];
barrier(CLK_LOCAL_MEM_FENCE);
for(int s=group_size/2;s>0;s/=2){
if(local_id < s)
localSums[local_id] += localSums[local_id+s];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(local_id == 0)
partialSums[0] = localSums[0];
}
}
}

Performance penalty on misaligned data

As a CS student I'm trying to understand the very basics of a computer. As I stumbled across this website, I wanted to test those performance penalties on my own. I understand what he's talking about and why this happens / should happen.
Anyway, here's my code which I used to call those functions he wrote:
int main(void)
{
int i = 0;
uint8_t alignment = 0;
uint8_t size = 1024 * 1024 * 10; // 10MiB
uint8_t* block = malloc(size);
for(alignment = 0; alignment <= 17; alignment++)
{
start_t = clock();
for(i = 0; i < 100000; i++)
Munge8(block + alignment, size);
end_t = clock();
printf("%i\n", end_t - start_t);
}
// Repeat, but next time with Munge16, Munge32, Munge64
}
I don't know if my CPU & RAM are so blazingly fast, but the output of all 4 functions (Munge8, Munge16, Munge32 and Munge64) is always 3 or 4 (random, no pattern).
Is this possible? 100000 repetitions should be alot more work to do, or am I that wrong? I'm working on a Windows 7 Enterprise x64, Intel Core i7-4600U CPU # 2.10GHz. All compiler optimizations are turned off i.e. /Od.
All the related questions on SO didn't answer why my solution isn't working.
What am I doing wrong? Any help is greatly appreciated.
Edit:
First of all: Thank you very much for your help. After changing the type of size from uint8_t to uint32_t I altered all the inside loops causing undefined behaviour of the test functions to two separate lines:
while( data32 != data32End )
{
data32++;
*data32 = -(*data32);
}
Now I'm getting a relatively stable output of 25/26, 12/13, 6 and 3 ticks, calculating the average of 100 repetitions. Is this a logical result? Does this mean that my architecture handles unaligned access as fast (or as slow) as aligned access? Do I measure the time to inexactly? Or is there a problem with accuracy when dividing by 10? My new code:
int main(void)
{
int i = 0;
uint8_t alignment = 0;
uint64_t size = 1024 * 1024 * 10; // 10MiB
uint8_t* block = malloc(size);
printf("%i\n\n", CLOCKS_PER_SEC); // yields 1000, just for comparison how fast my machine 'ticks'
for(alignment = 0; alignment <= 17; alignment++)
{
start_t = clock();
for(i = 0; i < 100; i++)
singleByte(block + alignment, size);
end_t = clock();
printf("%i\n", (end_t - start_t)/100);
}
// Again, repeat with all different functions
}
General criticism is, of course, also appreciated. :)

This fails due to integer overflow:
uint8_t size = 1024 * 1024 * 10; // 10MiB
it should be:
const size_t size = 1024 * 1024 * 10; // 10MiB
No idea why you'd ever use an 8-bit quantity to hold something that large.
Investigate how to enable all warnings for your compiler.

It seems there is a problem with your clock function. 1000 for CLOCKS_PER_SEC is way too low for your processor even if CPU throttling is activated (you should get around 2100000 if frequency scaling is turned off). How much cycles do you get for each averaged mesure by using cycle.h ?

analysis of cpu cache access time

I have the following program which I with the help of someother on stackoverflow wrote to understand cachelines and CPU caches.I have the result of the calculation posted below.
1 450.0 440.0
2 420.0 230.0
4 400.0 110.0
8 390.0 60.0
16 380.0 30.0
32 320.0 10.0
64 180.0 10.0
128 60.0 0.0
256 40.0 10.0
512 10.0 0.0
1024 10.0 0.0
I have plotted a graph using gnuplot which is posted below.
I have the following questions.
is my timing calculation in milliseconds correct ? 440ms seems to
be a lot of time?
From the graph cache_access_1 (redline) can we conclude that the
size of cache line is 32 bits (and not 64-bits?)
Between the for loops in the code is it a good idea to clear the
cache? If yes how do I do that programmatically?
As you can see I have some 0.0 values in the result above.?
What does this indicate? is the granularity of measurement too
coarse?
Kindly reply.
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#define MAX_SIZE (512*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int k = 0;
int count = 0;
/*
* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
* It lives in a fixed-size buffer, set by the program executable's configuration according to the
* operating system, but its actual size is seldom checked against the available space.
*/
/*int arr[MAX_SIZE];*/
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
/*cpu clock ticks count start*/
for(k = 0; k < 3; k++)
{
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i++)
{
arr[i] += 3;
/*count++;*/
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 1 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
printf("\n");
for (k = 1 ; k <= 1024 ; k <<= 1)
{
/*cpu clock ticks count start*/
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i += k)
{
/*count++;*/
arr[i] += 3;
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 2 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
printf("\n");
/* Third loop, performing the same operations as loop 2,
but only touching 16KB of memory
*/
for (k = 1 ; k <= 1024 ; k <<= 1)
{
/*cpu clock ticks count start*/
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i += k)
{
count++;
arr[i & 0xfff] += 3;
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 3 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
return 0;
}

Since you are on Linux, I'll answer from that perspective. I will also write with an Intel (i.e., x86-64) architecture in mind.
440 ms is probably accurate. A better way to look at the results would be time per element or access. Note that increasing your k reduces the number of elements accessed. Now, cache access 2 shows a fairly steady result of 0.9ns / access. This time is roughly comparable to 1 - 3 cycles per access (depending on CPU's clock rate). So sizes 1 - 16 (maybe 32) are accurate.
No (although I will first assume you mean 32 versus 64 byte). You should ask yourself, what does "cache line size" look like? If you access smaller than the cache line, then you will miss and subsequently hit one or more times. If you are greater than or equal to the cache line size, every access will miss. At k=32 and above, the access time for access 1 is relatively constant at 20ns per access. At k=1-16, overall access time is constant, suggesting that there are approximately the same number of cache misses. So I would conclude that the cache line size is 64 bytes.
Yes, at least for the last loop that is only storing ~16KB. How? Either touch a lot of other data, like another GB array. Or call an instruction like x86's WBINVD, which writes to memory and then invalidates all cache contents; however, it requires you to be in kernel-mode.
As you noted, beyond size 32, the times hover around 10ms, which is showing your timing granularity. You need to either increase the time required (so that a 10ms granularity is sufficient) or switch to a different timing mechanism, which is what the comments are debating. I'm a fan of using the instruction rdtsc (read timestamp counter (i.e., cycle count)), but this can be even more problematic than the suggestions above. Switching your code to rdtsc basically required switching clock, clock_t, and CLOCKS_PER_SEC. However, you could still face clock drift if your thread migrates, but this is a fun test so I wouldn't concern myself with this issue.
More caveats: the trouble with consistent strides (like powers of 2) is that the processor likes to hide the cache miss penalty by prefetching. You can disable the prefetcher on many machines in the BIOS (see "Changing the Prefetcher for Intel Processors").
Page faults may also be impacting your results. You are allocating 500M ints or about 2GB of storage. Loop 1 tries to touch the memory so that the OS will allocate pages, but if you don't have this much available memory (not just total, as the OS, etc takes up some space) then your results will be skewed. Furthermore, the OS may start reclaiming some of the space so that you will always be page faulting on some of your accesses.
Related to the previous, the TLB is also going to have some impact on the results. The hardware keeps a small cache of mappings from virtual to physical address in a translation lookaside buffer (TLB). Each page of memory (4KB on Intel) needs a TLB entry. So your experiment is going to need 2GB / 4KB => ~500,000 entries. Most TLBs hold less than 1000 entries, so the measurements are also skewed by this miss. Fortunately, it is only once every 4KB or 1024 ints. It is possible that malloc is allocating "large" or "huge" pages for you, for more details - Huge Pages in Linux.
Another experiment would be to repeat the third loop, but change the mask that you are using, so that you can observe the size of each cache level (L1, L2, maybe L3, rarely L4). You may also find that different cache levels use different cacheline sizes.

how to set a fps of a camera?

I am using a frame grabber inspecta-5 with 1GB memory, also a high speed camera "EoSens Extended Mode, 640X480 1869fps, 10X8 taps". I am new to coding for grabbers and also to controlling the camera. the Inspecta-5 grabber, gives me different options, like changing the requested number of frames from the camrea to grabber and also from grabber to main memory. also I can use camrea links to send signal to camera and have different exposure times.
but Im not really sure what should I use to obtain 1000 frame per second rate, and how can I test it?
according to the software manual if I set the following options in the camera profile :
ReqFrame=1000
GReqFrame=1000
it means transfer 1000 frames from the camera to grabber and also transfer 1000 frame from grabber to Main memory, respectively.
but does it mean that I have 1000fps?
what would be the option for setting the fps to 1000 ? and also how can I test it that I really grabbed 1000 frames in One Second????
here is a link to the grabber software manual : mikrotron.de/index.php?de_downloadfiles you can find the software manual under the "Inspecta Level1 API for WinNT/2000/XP" section. the file name is "i5-level1-sw_manual_e.pdf" , just in case if anybody needs it.
THANK YOU

At 1,000fps you don't have much time to snap a frame or even save a frame. Use the following example and plug in your estimated FPS, capture and save latencies. At 1,000fps, you can have a total of about .8ms latency (and why not .99999? I don't know - something to do with unattainable theoretical max or my old PC).
public static void main(String args[]) throws Exception {
int fps = 1000;
float simulationCaptureNowMS = .40f;
float simulationSaveNowNowMS = .40f;
final long simulationCaptureNowNS = (long)(simulationCaptureNowMS * 1000000.0f);
final long simulationSaveNowNowNS = (long)(simulationSaveNowNowMS * 1000000.0f);
final long windowNS = (1000*1000000)/fps;
final long movieDurationSEC = 2;
long dropDeadTimeMS = System.currentTimeMillis() + (1000* movieDurationSEC);
while(System.currentTimeMillis() < dropDeadTimeMS){
long startNS = System.nanoTime();
actionSimulator(simulationCaptureNowNS);
actionSimulator(simulationSaveNowNowNS);
long endNS = System.nanoTime();
long sleepNS = windowNS-(endNS-startNS);
if (sleepNS<0) {
System.out.println("Data loss. Try again.");
System.exit(0);
}
actionSimulator(sleepNS);
}
System.out.println("No data loss at "+fps+"fps with interframe latency of "+(simulationCaptureNowMS+simulationSaveNowNowMS)+"ms");
}
private static void actionSimulator(long ns) throws Exception {
long d = System.nanoTime()+ns;
while(System.nanoTime()<d) Thread.yield();
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight