why 3.0.1.8 TDengine database's query performance is far better 3.0.1.4

why 3.0.1.8 TDengine database's query performance is far better 3.0.1.4 - tdengine

this SQL:
select * from meters limit 1000000 >> /dev/null;
3.0.1.4 takes 14 seconds, while 3.0.1.8 takes 2.4 seconds, which is quite a big difference.
May I know which part of the code we optimized?

3.0 TDengine database Use fflush to brush the disk after each record is written to the redirected file，It is not available in 2.6, which is new in 3.0, so the redirection output is slow.
we canceled this then to increase performance

Related

why single connection to redis perform bad and how to make it faster

I am benchmarking redis recently and here is the result I got:
ubuntu 13.10 x86_64 with kernel version 3.11,
Intel® Core™ i5 CPU M 430 # 2.27GHz × 4
8GB Memory
So given the same load, multiple connection to redis could perform 8x faster than single connection. I am not considering pipelining here, and I already tried many optimization approach in my test. (using taskset to make redis running in single core, using unix domain socket)
Two questions:
Why multiple connection to redis could perform this faster than single connection?
Is there any other way(other than pipeline) to improve performance under single connection?

I did some performance testing for this problem these days and got some results here.
The key is to figure out where the extra latency comes from for the single connection case. My first hypophsis is they come from epoll. To find out that, I use systemtap and the script to determine epoll latency. The result (above: 1 connection result, below: 10 connection result. Sorry, the unit should be nanoseconds in the pic):
From the result, you can see that the avg time staying in epoll is almost same, 8655 nanoseconds vs 10254 nanoseconds. However, there is a significant difference of total number. For 10 connection we call epoll wait 444528 times but we will call it 2000054 times in single connection case, which is 4x and that's what lead to this additional time usage.
Next question would be why we call epoll so less time during multipe connection. After exploring a little bit with redis source code, I found the reason. Everytime epoll returns, it will return the number of events it gonna handle. The presudo code is like (hiding most of the details):
fds = epoll_wait(fds_we_are_monitoring);
for fd in fds:
handle_event_happening_in(fd);
The return value fds is a collection of all the events in which IO is happening, for example, read input from socket. For single connection benchmark, fds_we_are_monitoring is 1, since there are only one connection, then every time #fd would be 1. But for 10 connection case, fds could return any number less than 10, and then handling the events together in the for loop. The more events we get from epoll return once at a time, the faster we can get. Because the total number of requests is fixed, in this case, 1M set requests.
To verify that, I use systemtap to draw the distribution of return values of the function. aeProcessEvents, which returns the number of events it handled.
We can see the avg: 1 in single connection case vs 7 in 10 connection case. Which proves our hypothesis that the more number of events epoll returns once, the faster we can handle the requests, until it become CPU bound.
I think I got the answer for the first question: Why multiple connection to redis could perform this faster than single connection. However, I am still wondering if there is any other way(other than pipeline) to improve performance under single connection. I would appreciate if anyone else could share some thinking about this.

get an accurate time with NTP, GPS and PPS as soon as possible

Because I work with an astronomic application I need an accurate UTC time.
Now I use NTP with GPS with 1PPS over RS232 as refclock.
server 127.127.20.0 mode 18 minpoll 4 maxpoll 4 prefer
fudge 127.127.20.0 flag1 1 flag2 0 flag3 1 flag4 0 time2 0.475
I read that NTP is syncing the time by in- or decreasing the speed of hardware clock e.g. to avoid duplicate times if a step would be to large.
Because the computer is turned off from time to time NTP needs some time to get in sync. It depends how far the system clock was drifted away. If I'm wrong please correct me.
Is there a possibility to force the almost instant use of a prefered GPS time? Getting time sync even faster as usual would be also sufficient.
Actually, in an astronomic environment it's more important to have an accurate time asap than having a nice time shift algorithm which could be kind of slow.

When starting ntpd, the ntp deamon, there is an option q ( -q ) to cause ntp to immediately reset the time. The command "man ntpd" describes it and also the g option which you might want to use.
Most people prefer to do a quick time reset only on startup, by using the q option of ntpd (or the old way of doing it, ntpdate). Once the clock is set, you can use the standard startup command.
On many Linux systems, ntpd will be started at boot time, in which case the time will already be correct within 128 ms. If the computer is suspended or hibernated, the error could be greater though. A manual procedure to increase the accuracy would be:
sudo service ntp stop
sudo ntpd -q
sudo ntpd start

Looking for an explanation for thread synchronization performance issue

When using kernel objects to synchronize threads running on different CPUs, is there perhaps some extra runtime cost when using Windows Server 2008 R2 relative to other OS's?
Edit: And as found out via the answer, the question should also include the phrase, "when running at lower CPU utilization levels." I included more information in my own answer to this question.
Background
I work on a product that uses shared memory and semaphores for communication between processes (when the two processes are running on the same machine). Reports of performance problems on Windows Server 2008 R2 (which I shorten to Win2008R2 after this) led me to find that sharing a semaphore between two threads on Win2008R2 was relatively slow compared to other OS’s.
Reproducing it
I was able to reproduce it by running the following bit of code concurrently on two threads:
for ( i = 0; i < N; i++ )
{
WaitForSingleObject( globalSem, INFINITE );
ReleaseSemaphore( globalSem, 1, NULL );
}
Testing with a machine that would dual boot into Windows Server 2003 R2 SP2 and Windows Server 2008 R2, the above snippet would run about 7 times faster on the Win2003R2 machine versus the Win2008R2 (3 seconds for Win2003R2 and 21 seconds for Win2008R2).
Simple Version of the Test
The following is the full version of the aforementioned test:
#include <windows.h>
#include <stdio.h>
#include <time.h>
HANDLE gSema4;
int gIterations = 1000000;
DWORD WINAPI testthread( LPVOID tn )
{
int count = gIterations;
while ( count-- )
{
WaitForSingleObject( gSema4, INFINITE );
ReleaseSemaphore( gSema4, 1, NULL );
}
return 0;
}
int main( int argc, char* argv[] )
{
DWORD threadId;
clock_t ct;
HANDLE threads[2];
gSema4 = CreateSemaphore( NULL, 1, 1, NULL );
ct = clock();
threads[0] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );
threads[1] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );
WaitForMultipleObjects( 2, threads, TRUE, INFINITE );
printf( "Total time = %d\n", clock() - ct );
CloseHandle( gSema4 );
return 0;
}
More Details
I updated the test to enforce the threads to run a single iteration and force a switch to the next thread at each loop. Each thread signals the next thread to run at the end of each loop (round-robin style). And I also updated it to use a spinlock as an alternative to the semaphore (which is a kernel object).
All machines I tested on were 64-bit machines. I compiled the test mostly as 32-bit. If built as 64-bit, it ran a bit faster overall and changed the ratios some, but the final result was the same. In addition to Win2008R2, I also ran against Windows 7 Enterprise SP 1, Windows Server 2003 R2 Standard SP 2, Windows Server 2008 (not R2), and Windows Server 2012 Standard.
Running the test on a single CPU was significantly faster (“forced” by setting thread affinity with SetThreadAffinityMask and checked with GetCurrentProcessorNumber). Not surprisingly, it was faster on all OS’s when using a single CPU, but the ratio between multi-cpu and single cpu with the kernel object synchronization was much higher on Win2008R2. The typical ratio for all machines except Win2008R2 was 2x to 4x (running on multiple CPUs took 2 to 4 times longer). But on Win2008R2, the ratio was 9x.
However ... I was not able to reproduce the slowdown on all Win2008R2 machines. I tested on 4, and it showed up on 3 of them. So I cannot help but wonder if there is some kind of configuration setting or performance tuning option that might affect this. I have read performance tuning guides, looked through various settings, and changed various settings (e.g., background service vs foreground app) with no difference in behavior.
It does not seem to be necessarily tied to switching between physical cores. I originally suspected that it was somehow tied to the cost of accessing global data on different cores repeatedly. But when running a version of the test that uses a simple spinlock for synchronization (not a kernel object), running the individual threads on different CPUs was reasonably fast on all OS types. The ratio of the multi-cpu semaphore sync test vs multi-cpu spinlock test was typically 10x to 15x. But for the Win2008R2 Standard Edition machines, the ratio was 30x.
Here are some actual numbers from the updated test (times are in milliseconds):
+----------------+-----------+---------------+----------------+
| OS | 2 cpu sem | 1 cpu sem | 2 cpu spinlock |
+----------------+-----------+---------------+----------------+
| Windows 7 | 7115 ms | 1960 ms (3.6) | 504 ms (14.1) |
| Server 2008 R2 | 20640 ms | 2263 ms (9.1) | 866 ms (23.8) |
| Server 2003 | 3570 ms | 1766 ms (2.0) | 452 ms (7.9) |
+----------------+-----------+---------------+----------------+
Each of the 2 threads in the test ran 1 million iterations. Those testes were all run on identical machines. The Win Server 2008 and Server 2003 numbers are from a dual boot machine. The Win 7 machine has the exact same specs but was a different physical machine. The machine in this case is a Lenovo T420 laptop with Core i5-2520M 2.5GHz. Obviously not a server class machine, but I get similar result on true server class hardware. The numbers in parentheses are the ratio of the first column to the given column.
Any explanation for why this one OS would seem to introduce extra expense for kernel level synchronization across CPUs? Or do you know of some configuration/tuning parameter that might affect this?
While it would make this exceedingly verbose and long post longer, I could post the enhanced version of the test code that the above numbers came from if anyone wants it. That would show the enforcement of the round-robin logic and the spinlock version of the test.
Extended Background
To try to answer some of the inevitable questions about why things are done this way. And I'm the same ... when I read a post, I often wonder why I am even asking. So here are some attempts clarify:
What is the application? It is a database server. In some situations, customers run the client application on the same machine as the server. In that case, it is faster to use shared memory for communications (versus sockets). This question is related to the shared memory comm.
Is the workload really that dependent on events? Well ... the shared memory comm is implemented using named semaphores. The client signals a semaphore, the server reads the data, the server signals a semaphore for the client when the response is ready. In other platforms, it is blinding fast. On Win2008R2, it is not. It is also very dependent on the customer application. If they write it with lots of small requests to the server, then there is a lot of communication between the two processes.
Can a lightweight lock be used? Possibly. I am already looking at that. But it is independent of the original question.

Pulled from the comments into an answer:
Maybe the server is not set to the high-performance power plan? Win2k8 might have a different default. Many servers aren't by default, and this hits performance very hard.
The OP confirmed this as the root cause.
This is a funny cause for this behavior. The idea flashed up in my head while I was doing something completely different.

It could well be the OS installation configuration varies. Perhaps the slow system is configured to disallow multiple threads from your process being scheduled simultaneously. If some other high priority process were always (or mostly) ready to run, the only alternative is for your threads to be run sequentially, not in parallel.

I'm adding this additional "answer" information here rather than burying it in my overly long OP. #usr pointed me in the right direction with the power management options suggestion. The contrived test in the OP as well as the original problem involves a lot of handshaking between different threads. The handshaking in the real world app was across different processes, but testing showed the results do not differ if it is threads or processes doing the handshaking. The sharing of the semaphore (kernel sync object) across the CPUs seems to be greatly affected in Windows Server 2008 R2 by the power settings when running at low (e.g., 5% to 10%) CPU usage. My understanding of this at this point is purely based on measuring and timing applications.
A related question on Serverfault talks about this some as well.
The Test Settings
OS Power Options Setting The default power plan for Windows Server 2008 R2 is "Balanced". Changing it to the "High Performance" option helped performance of this test quite a bit. In particular, one specified setting under the "Change advanced power settings" seems to be the critical one. The advanced settings has an option under Processor power management called Minimum processor state. The default value for this under the Balanced plan seems to be 5%. Changing that to 100% in my testing was the key.
BIOS Setting In addition, a BIOS setting affected this test greatly. I'm sure this varies a lot across hardware, but the primary machine I tested on has a setting named "CPU Power Management". The description of the BIOS setting is, "Enables or disables the power saving feature that stop (sic) the microprocessor clock automatically when there are no system activities." I changed this option to "Disabled".
Empirical Results
The two test cases shown are:
(a) Simple. A modified version of the one included in the OP. This simple test enforced round-robin switching at every iteration between two threads on two CPUs. Each thread ran 1 million iterations (thus, there were 2 million context switches across CPUs).
(b) Real World. The real world client/server test where a client was making many "small" requests of the server via shared memory and synchronized with global named semaphores.
The three test scenarios are:
(i) Balanced. Default installation of Windows Server 2008 R2, which uses the Balanced power plan.
(ii) HighPerf. I changed the power option from "Balanced" to "High Performance". Equivalently, the same results occurred by setting the Minimum Processor State CPU option as described above to 100% (from 5%).
(iii) BIOS. I disabled the CPU Power Management BIOS option as described above and also left the High Performance power option selected.
The times given are in seconds:
╔════════════════╦═════════════╦═══════════════╦════════════╗
║ ║ (i)Balanced ║ (ii) HighPerf ║ (iii) BIOS ║
╠════════════════╬═════════════╬═══════════════╬════════════╣
║ (a) Simple ║ 21.4 s ║ 9.2 s ║ 4.0 s ║
║ (b) Real World ║ 9.3 s ║ 2.2 s ║ 1.7 s ║
╚════════════════╩═════════════╩═══════════════╩════════════╝
So after both changes were made (OS and BIOS), both the real world test and the contrived test ran about 5 times faster than under the default installation and default BIOS settings.
While I was testing these cases, I sometimes encountered a result I could not explain. When the CPU was busy (some background process would fire up), the test would run faster. I would file it away in my head and be puzzled for a while. But now it makes sense. When another process would run, it would bump up the CPU usage past whatever threshold was needed to keep it in a high power state and the context switches would be fast. I still do not know what aspect is slow (the primary cost is buried in the WaitForSingleObject call) but the end results now all kind of make sense.

This isn't a reasonable benchmark, your semaphores are always frobbed in the same process (and so presumably on the same CPU/core). An important part of the cost of locking in real-world cases is the memory accesses involved when different CPUs/cores fight over exclusive access to the memory area (which bounces back and forth between caches). Look for some more real-world benchmarks (sorry, not my area), o (even better) measure (some cut down version of) your application with (contrived, but realistic) test data.
[Test data for benchmarks should never be the ones for testing or regression testing: the later pokes at (probably rarely used) corner cases, you want "typical" runs for benchmarking.]

Increasing CPU Utilization and keep it at a certain level using C code

I am writing a C code (on Linux) that needs to consume a certain amount of CPU when it's running. I am carrying out an experiment in which I trigger certain actions upon reaching a certain CPU threshold. So, once the Utilization reaches a certain threshold, I need to keep it at that state for say 30 seconds till I complete my experiments. I am monitoring the CPU Utilization using the top command.
So my questions are -
1. How do I increase the CPU Utilization to a given value (in a deterministic way if possible)?
2. Once I get to the threshold, is there a way to keep it at that level for a pre-defined time?
Sample output of top command (the 9th column is CPU used by the 'top' process) -
19304 abcde 16 0 5448 1212 808 R 0.2 0.0 0:00.06 top
Similar to above, I will look at the line in top which shows the utilization of my binary.
Any help would be appreciated. Also, let me know if you need more details.
Thanks!
Edit:
The following lines of code allowed me to control CPU Utilization quite well - In the following case, I have 2 options - keep it above 50% and keep it below 50% - After some trial and error, I settled down at the given usleep values.
endwait = clock() + ( seconds * CLOCKS_PER_SEC );
while( clock() < endwait ) {}
if (cpu_utilization > 50)
usleep(250000);
else
usleep(700000);
Hope this helps!

cpuburn is known to make CPU utilization so high that it raise its temperature to its max level.
It seems there is no more official website about it, but you can still access to source code with Debian package or googlecode.
It's implemented in asm, so you'll have to make some glue in order to interact with it in C.

Something of this sort should have a constant CPU utilization, in my opinion:
md5sum < /dev/urandom

How to measure the power consumed by a C algorithm while running on a Pentium 4 processor?

How can I measure the power consumed by a C algorithm while running on a Pentium 4 processor (and any other processor will also do)?

Since you know the execution time, you can calculate the energy used by the CPU by looking up the power consumption on the P4 datasheet. For example, a 2.2 GHz P4 with a 400 MHz FSB has a typical Vcc of 1.3725 Volts and Icc of 47.9 Amps which is (1.3725*47.9=) 65.74 watts. Since you know your loop of 10,000 algorithm cycles took 46.428570s, you assume a single loop will take 46.428570/10000 = 0.00454278s. The amount of energy consumed by your algorithm would then be 65.74 watts * 0.00454278s = 0.305 watt seconds (or joules).
To convert to kilowatt hours: 0.305 watt seconds * 1000 kilowatts/watt * 1 hour / 3600 seconds = 0.85 kwh. A utility company charges around $0.11 per kwh so this algorithm would cost 0.85 kwh * $0.11 = about a penny to run.
Keep in mind this is the CPU only...none of the rest of the computer.

Run your algorithm in a long loop with a Kill-a-Watt attached to the machine?

Excellent question; I upvoted it. I haven't got a clue, but here's a methodology:
-- get CPU spec sheet from Intel (or AMD or whoever) or see Wikipedia; that should tell you power consumption at max FLOP rate;
-- translate algorithm into FLOPs;
-- do some simple arithmetic;
-- post your data and calculations to SO and invite comments and further data
Of course, you'll have to frame your next post as another question, I'll watch with interest.

Unless you run the code on a simple single tasking OS such as DOS or and RTOS where you get precise control of what runs at any time, the OS will typically be running many other processes simultaneously. It may be difficult to distinguish between your process and any others.

First, you need to be running the simplest OS that supports your code (probably a server version unix of some sort, I expect this to be impractical on Windows). That's to avoid the OS messing up your measurements.
Then you need to instrument the box with a sensitive datalogger between the power supply and motherboard. This is going to need some careful hardware engineering so as not to mess up the PCs voltage regulation, but someone must have done it.
I have actually done this with an embedded MIPS box and a logging multimeter, but that had a single 12V power supply. Actually, come to think of it, if you used a power supply built for running a PC in a vehicle, you would have a 12V supply and all you'd need then is a lab PSU with enough amps to run the thing.

It's hard to say.
I would suggest you to use a Current Clamp, so you can measure all the power being consumed by your CPU. Then you should measure the idle consumption of your system, get the standard value with as low a standard deviation as possible.
Then run the critical code in a loop.
Previous suggestions about running your code under DOS/RTOS are also valid, but maybe it will not compile the same way as your production...

Sorry, I find this question senseless.
Why ? Because an algorithm itself has (with the following exceptions*) no correlation with the power consumption, it is the priority on the program/thread/process runs. If you change the priority, you change the amount of idle time the processor has and therefore the power consumption. I think the only difference in energy consumption between the instructions is the number of cycles needed, so fast code will be power friendly.
To measure power consumption of a "algorithm" is senseless if you don't mean the performance.
*Exceptions: Threads which can be idle while waiting for other threads, programs which use the HLT instruction.
Sure running the processor at fast as possible increases the amount of energy superlinearly
(more heat, more cooling needed), but that is a hardware problem. If you want to spare energy, you can downclock the processor or use energy-efficient ones (Atom processor), but changing/tweaking the code won't change anything.
So I think it makes much more sense to ask the processor producer for specifications what different processor modes exist and what energy consumption they have. You also need to know that the periphery (fan, power supply, graphics card (!)) and the running software on the system will influence the results of measuring computer power.
Why do you need this task anyway ?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight