I am studying a thesis. This paper describes a side channel attack. It measures the cache miss rate when there is only the attacker's code, and the cache miss rate when other programs and the attack code run on the same core as interference. I found that the cache miss rate can be obtained through perf. But after thinking about it for a long time, I can't fully understand what this interference program is.
The following are the original words in the paper.
we also show the results when there is only the sender process running on the physical core (denoted by sender only) and the results with the sender sharing the physical core with a benign gcc workload (denoted by sender & gcc). When there is only the sender process, it has the smallest L1 miss rate. When it is sharing the core with a benign program, the benign program, e.g., the gcc, will cause contention in the cache.
what does the interference program mentioned here refer to? Is it a benign c code, or is it a gcc library when compiling c code with gcc?
If the interference is to run the gcc library to compile the c code, the gcc compiles the c code in an instant. How can we make it run for a long time? This may be a very basic question, but I haven't figured it out after thinking about it for a long time.
The URL of the paper is:
https://caslab.csl.yale.edu/publications/xiong2020leaking.pdf
Thank you to everyone who is willing to provide suggestions.
“gcc” refers to the primary command of the GNU Compiler Collection. There are several reasons the run time of gcc might not be too short for the work in the paper:
The short running time of gcc may suffice for the work in the paper.
You may experience gcc as taking a short time for the programs you use it for, but it may take much longer for longer programs in more complicated languages.
gcc can be asked to operate on many source files in a single command.
gcc may be run repeatedly during the work in the paper.
You may be accustomed to running programs on a computer used exclusively by you. Commonly, in environments where side channel attacks are a concern, programs are run on “server” systems concurrently by multiple users. Such systems may be continuously busy. (This is not to exclude such attacks being a concern in a system normally used only by a single person, where, aside from the software desired by the rightful user, some software is executing at the instigation of a malicious person.)
Related
I have read an article C Is Not a Low-level Language, where is such paragraph:
Unfortunately, simple translation providing fast code is not true for
C. In spite of the heroic efforts that processor architects invest in
trying to design chips that can run C code fast, the levels of
performance expected by C programmers are achieved only as a result of
incredibly complex compiler transforms. The Clang compiler, including
the relevant parts of LLVM, is around 2 million lines of code. Even
just counting the analysis and transform passes required to make C run
quickly adds up to almost 200,000 lines (excluding comments and blank
lines).
What does a bold sentence mean? Does it mean that manufacturers design processors with some optimizations and architecture decisions targeted firstly or even specifically to C (C++) code? Or it just means that they are trying to design processors that executes any code faster, including the code written in C language?
If some preferences to C exists, what are they?
My couple of thoughts:
a branch prediction algorithm tuned in to patterns happening mainly in C code.
instructions which are useful and used in C but aren't useful in other languages. Otherwise other languages (compilers) will use them too.
I knows about language specific processors like Jazelle or Lisp machine for Java and Lisp respectively, but similar technologies can't be applied to C, because there are no bytecode.
Processors don't necessarily have optimizations targeted at C, but they do provide features to make C (and other procedural languages in general) map more cleanly to the platform.
Take cache-coherency in a multi-threaded environment as an example. From a C perspective, a global variable shared by two threads should look the same to both threads. If one thread writes to it the other should be able to see those modifications. But in a multi-core CPU with independent caches, that takes extra effort to support. Core 1 has to be able to detect that core 2 is accessing an address it has modified in cache and flush that out to memory (or somehow share it directly to core 2's cache).
That's essentially the thesis of that entire article. C's abstract machine model doesn't necessarily map cleanly to real modern high-performance processors like it did to the (by comparison extremely simple) PDP-11, and CPUs and compilers have to take great pains to paper over those differences.
The "heroic efforts" of the processor architects is largely referring to the design of cache and memory subsystems on the CPUs.
For a very long time now, the instruction executions circuits inside the CPUs have been far, far quicker than the electronics that looks after fetching/writing data from/to memory, largely because the technologies we have for RAM chips is hasn't really got better. Where the cores have speeded up the memory hasn't, and so the cache and memory subsystem has to get ever more elaborate in order to be able to pre-fetch data and move it towards the execution circuits ahead of time. Needless to say, this doesn't always pan out well.
It's also partly because of the physical distance between the CPU and RAM chips. Though only a few inches (if that) of track on a motherboard, that distance is significant; the speed of a signal down the track is about 1ns every 8 inches. For signals clocked in the GHz range (1 cycle << 1ns), a short track is a long way. This is partly why Apple have gone down the route of putting RAM onto the same package as the CPU in the home-grown M1 silicon.
Back to caches - the likes of Intel (and AMD, ARM) have strived to make CPUs that have good, general purpose performance, so that they run pretty much any code well. Modern compilers help a lot - if they know what the cache in the CPU is likely to do in any particular circumstance, the compilers can arrange code to fit in with what the hardware is likely to do.
A reasonable question then is, is that effective? Well, yes and no. Yes, because compiled code does run quite well, but no for a couple of reasons. The first is that ultimate performance for any given algorightm is rarely reached by the compiler / CPU, and secondly all this complexity makes it nigh on impossible for a good programmer to do their own optimisation.
Some CPUs help out the hero-programmer here. PowerPC (at least some variants) has instructions where the programmer can give the cache system a hint that the programme will shortly need data from such-and-such a location in RAM. The CPU uses that instruction to pre-load the L1 cache with that data, so that when the program actually starts to perform operations on data at that address it's already in cache.
The IBM Cell processor took this to a whole new level. The SPE math cores (there were 8 of them) had no cache, and no way of addressing data in CPU RAM at all. What there was instead was 256K of static RAM per core into which all code and data had to fit, and a way for code to push code and data in and out of that static RAM very quickly (256Gbyte/sec at the time, which was very very quick). The developer was completely on their own; they had to write code to load code and data into a core, get that executed, and then write more code to get the results out to wherever. This was actually pretty liberating; instead of having a cache and memory subsystem trying to automatically deliver data to executions cores, get in the way or (worse) just hide inefficiencies from you, one had the freedom to break down an algorigthm into core-sized lumps knowing that if it fitted, it'd be very quick, or knowing for sure it didn't fit.
Miles Budnek's answer addresses the issues that arise from multi-core CPUs with a cache-coherency and a Symetric Multi Processing (SMP) environment. It's even harder for the cache designer to get it right if there's multuple cores that might very well start tampering with a value. The difficulties involved has lead to vulnerabilities like Meltdown and Spectre.
SMP could be said to be an "optimisation" put into CPUs by designers to aid the C (or other) developer in transitioning code from single to multiple thread. It's an attractive thought - in the way that a single thread programme can see all of it's data merely by addressing it, why not extend the same visibility of data to all threads in the programme?
Turns out that this is what makes it very difficult to design modern CPUs. However the reasons why the industry went this way are plain enough - the smallest possible delta between single and multicore CPUs was going to be the least troublesome for the existing software community to adopt. That's perfectly reasonable.
But it is running out of steam, fast. A better approach (if the goal is the outright pursuit of performance) would be to go back to the old Transputer architectures from Inmos from the 1980s, early 1990s. In such architectures, data held by one core could only be processed by another if the software was written to explicitly transfer the data. Sounds familiar? Yes - Cell process was a bit like that.
Interestingly, languages such as Rust, Go, Erlang have all implemented Communicating Sequential Processes as a parallel processing paradigm. The irony is that, these days, CSP has to be implemented on top of a SMP environment, which is itself an artificial construct brought about by the interconnect between CPUs, cores and memory (e.g. QPI, Hypertransport). Basically, if the whole software world got fully comfortable with CSP then CPU designers wouldn't have to design cache-coherency into their multi-core CPUs. Rust in particular is very well suited, as it already has a strong concept of data ownership in its syntax (which could be leveraged to shovel data around between cores automatically).
The article referred to by the OP seems to me to have it in for C for some reason. There were so many points in it I felt triggered by, but I don't want to go addressing each one point by point. Maybe there is some bias or special interest that has not been declared. As a C programmer, with a particular interest in writing high performance programs, I thought I'd give my two cents on some of the issues raised. Hopefully, this might be of interest to others in the industry with or without a programming background.
From my point of view, the strengths of C are mainly as follows....
C allows you to do things you just can't do in 'higher level' languages.
A well written C (see weakness no.1) program is hard to beat on performance on the same hardware, written in another language.
C is comfortable handling binary data allowing for memory conservation.
C is well established with lots of libraries and programmers.
Objects in memory can be made easy to process from anywhere in the program by using pointers so the data itself doesn't need to be passed around.
Multi-threaded and multi-process programs are quite easy to implement.
It has Read-Write shared memory between threads (and processes with some fancy low-level stuff?)
Assembly can be inlined where needed (though it's not C then I know!).
... and main weaknesses...
Utilising SIMD capabilities is not possible in standard C, and difficult to implement in a portable way with intrinsics.
It takes a lot of code to do simple things for which there are no library functions.
Buffer overflow potential is easily missed, even for experienced programmers.
C pointers can be confusing.
The C programming language has a special place in the evolution of programming languages and I for one, would welcome a replacement that is a better fit to what is possible with modern hardware if it doesn't tie the hands of the programmer and offers better security and performance. From the article,...
'A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.'
Such things exist already, GPUs! Modern CPUs are much more like GPUs than they used to be now core counts can be 100+. I have used OpenCL C to write programs with amazing computational performance but they can't do everything well. Some applications can not be efficiently parallelised, if at all. OpenCL C program performance can become terrible when there is even a small amount of branching. Also, it is so much easier to exhause your memory bandwidth and fast cache when running many threads that it might be judged not worth the added complexity over a good single threaded implementation.
In OpenCL C, the programmer has somewhat more control of where data is stored in memory which can definately aid performance. Maybe it's a costly mistake to try to make programming languages too hardware independent. Might it be better to review some (LLVM like) intermediate standard, like in OpenCL C, where one can define 'private', 'local' and 'constant' memory objects to get performance improvements over 'global' memory objects. Such a standard wouldn't need to be tied to an instruction set. As a programmer, I welcome fast CPU instructions but it would be nice if they could be much more easily utilised in portable code AND compilable to portable binaries. Maybe this is something compiler writers could look into along with using SIMD vector registers rather than memory for pushing and popping. As I see it, there are four levels of portability.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the compiler to create binaries that will run correctly and efficiently on any hardware conforming to the intermediate standard.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the host compiler to create binaries that will run on the host's hardware configuration conforming to the intermediate standard, but may not run on other hardware conforming to the same.
Hardware dependent source code where the logical execution path through the source depends on the architecture of the hardware on which it is run. Programs need to 'query' the hardware configuration.
Hardware specific source code.
In a fantasy world where one can just imagine new standards, hardware, and programming languages, one could choose which level of portablity to aim for. I think that C was supposed to be hardware independent, but it isn't really if you want to get the best performance out of your hardware. OpenCL C tries also, but doesn't quite make it, though with run-time kernel compilation it does a pretty good job. The host program has the same issues though as any other. I don't think there are any 'Level 1' portable languages currently.
Sorry my response is a bit rambling. It's unfortunate that it's difficult to have an objective constructive discussion about the pros and cons of different ideas about future changes in software and hardware. Personally, I think FPGAs have huge potential but are still a long way from where they would need to be to go mainstream. Any new computing language will probably become out of date when hardware changes occur and software trends change. It's remarkable that C still occupies such a prominent space. In another 10 or 20 years time, C will probably still be going strong. How many other modern languages will still be commonplace then?
i'm working on a project that will have builds for Windows and Linux, 32 and 64 bits.
This project is based on loading strings for a text file, process it and write results to a SQLite3 database.
On linux it reaches almost 400k sequences per second, compiled by GCC without any optimization. However on Windows it stucks in 100k sequences per second, compiled on VS2010 without any optimization.
I tried using optimizations in compilers but nothing changed.
Is this right? C code on Windows runs slower?
EDIT:
I think i need to be more clear on some points.
I made tests with code optimization enabled AND disabled. Performance didn't changed, probably because my program's bottleneck is the time wasted reading data from HD.
This program takes benefits of parallel computing. There a queue where a thread queues processed data and another dequeue to write in the SQLite database. This way i don't think there is any performance lose from this.
Is this right? C code on Windows runs slower?
No. C doesn't have speed. It's the implementations of C that introduce speed. There are implementations that produce fast behaviour (generally "compilers that produce fast machine code") and implementations that produce slow behaviour for both Windows and Linux.
It isn't just Windows and Linux that are significant here, either. Some compilers optimise for specific processors, and will produce slow machine code for any other processors.
I tried using optimizations in compilers but nothing changed.
Testing speed without optimisations enabled makes no sense. However, this does tend to indicate that something else is slow. Perhaps the implementation that produced the library files for SQLite3 client in Windows is an implementation that produces slow code. I'd start by rebuilding the lot (including the SQLite3 library) with full optimisations enabled. Following that, you could try using a profiler to determine where the difference is and use the results to perform intelligent optimisations to your code.
I'm using MinGW, which is gcc for Windows. My program involves multiple windows, two different main threads, and several worker threads in a thread pool for overlapped network I/O.
It works perfectly fine without compiler optimization.
A) Is compiler optimization even necessary? My program's already very fast. Is it at all likely that it will provide a significant improvement?
B) Are there any articles on how to properly build a multthreaded program so compiler optimization can do its job?
“Imploded aggressively” is a bit weird (is your program a controller for a fission bomb?), but I understand that your program behaved as desired without compiler optimizations and mysteriously with compiler optimizations.
The technical term for this is that your program is buggy.
Multithreaded programming is intrinsically hard. Multithreaded programming when the threads share memory is very hard; it's the masochist way of concurrent programming (message passing is a lot easier to get right). You don't just need to read an article or two, you need to read several books and get a few years' programming experience.
You were unlucky that your program seemed to work without optimizations. It probably wouldn't work on a different machine where the timings are a bit different, or with a different compiler, or on a different operating system, either. So you ended up wasting your time thinking your program worked. But it doesn't. A compiler transforms correct source code into correct executables, no matter what optimization level you choose.¹
¹ Barring compiler bugs, sure. But the odds are very strongly stacked against you.
99.9% of all household failures in one optimization mode and not another are due to serious bugs. Multithreading races etc. are very sensitive to code performance. An instruction reorder or loop shortcut can turn a test pass into a debugging nightmare.
I'm assuming that the server runs up OK and detonates under load in aparrently different places, so making conventional debugging useless?
You are going to have to rely on logging and changing the test conditions to narrow down the point of ignition. My guess is this is going to be a Heisenbug that mutates with changes to the code, optimization, options, load profile, buffer sizes etc.
Not fixing the problem is not a good plan since it wil just show up in another form on next years boxes with more cores etc. Even with optimization off, it's still there, lurking, waiting for the opportunity to strike.
I hope I'm providing some comfort.
Seriously - log everything you can with a good logger - one that queues up the logs so as to keep disk latency out of the main app. Change things around to try and make the bug mutate and perhaps show up in the non-optimized build too. Write down, (type in), absolutely everything that you do amd what happens after any change, good or bad. Making the bug worse is actually better than making its symptoms go away, (without knowing exactly why). Try the server on various hardware configs, if you can.
Eventually, you will find the bug!
You have one thing going for you - it seems that you can reliably reproduce the problem. That, in itself, is a massive plus.
Forgot to ask - apart from the nuclear explosive metaphor, what is the main symptom? Is it AV'ing/segfaulting all over the place, or is it locked or livelocked up?
To answer part "A" of your question, the unoptimized version of your code still has the concurrency bugs in it, but the timing of how the threads run is such that the bugs have not yet been exposed with your test workloads. The current version of the unoptimized program will eventually fail in use, so you will need to fix the concurrency bugs before using the program for real work.
I just used gprof to analyze my program. I wanted to see what functions were consuming the most CPU time. However, now I would like to analyze my program in a different way. I want to see what LINES of the code that consume the most CPU time. At first, I read that gprof could do that, but I couldn't find the right option for it.
Now, I found gcov. However, the third-party program I am trying to execute has no "./configure" so I could not apply the "./configure --enable-gcov".
My question is simple. Does anyone know how to get execution time for each line of code for my program?
(I prefer suggestions with gprof, because I found its output to be very easy to read and understand.)
I think oprofile is what you are looking for. It does statistical based sampling, and gives you an approximate indication of how much time is spent executing each line of code, both at the C level of abstraction, and at the assembler code level.
As well as simply profiling the relative number of cycles spent at each line, you can also instrument for other events like cache misses and pipeline stalls.
Best of all: you don't need to do special builds for profiling, all you need to do is enable debug symbols.
Here is a good introduction to oprofile: http://people.redhat.com/wcohen/Oprofile.pdf
If your program isn't taking too long to execute, Valgrind/Callgrind + KCacheGrind + [compiling with debugging turned on (-g)] is one of the best methods of how to tell where a program is spending time while it is running in user mode.
valgrind --tool=callgrind ./program
kcachegrind callgrind.out.12345
The program should have a stable IPC (instructions per clock) in the parts that you want to optimize.
A drawback is that Valgrind cannot be used to measure I/O latency or to profile kernel space. Also, it's usability with programming languages which are using a toolchain incompatible with the C/C++ toolchain is limited.
In case Callgrind's instrumentation of the whole program takes too much time to execute, there are macros CALLGRIND_START_INSTRUMENTATION and CALLGRIND_STOP_INSTRUMENTATION.
In some cases, Valgrind requires libraries with debug information (such as /usr/lib/debug/lib/libc-2.14.1.so.debug), so you may want to install Linux packages providing the debug info files or to recompile libraries with debugging turned on.
oprofile is probably, as suggested by Anthony Blake, the best answer.
However, a trick to force a compiler, or a compiler flag (such as -pg for gprof profiling), when compiling an autoconf-ed software, could be
CC='gcc -pg' ./configure
or
CFLAGS='-pg' ./configure
This is also useful for some newer modes of compilation. For instance, gcc 4.6 provides link time optimization with the -flto flag passed at compilation and at linking; to enable it, I often do
CC='gcc-4.6 -flto' ./configure
For a program not autoconf-ed but still built with a reasonable Makefile you might edit that Makefile or try
make CC='gcc -pg'
or
make CC='gcc -flto'
It usually (but not always) work.
Is there a way I can create a virtual instance of gcc compiler on the client browser when the client opens my website??
By doing so, I can directly pass the user .c file as argument to my compiler instance and then execute it without having to make a POST call to server and execute the file there???
Originally I userstood your question to be targeting the native platform on which the browser is running:
Consider that Browsers may be running
on many different platforms,
operatinng systems and processor
architectures. Compiling C in the way
you describe might be technically
doable, but practically infeasible.
I was basing "practically infeasible" on the difficulty of supporting the plethora of widely used browser platforms.
Now I understand that you are thinking more on the lines of targeting a virtual environment. I'll amend practically infeasible to "a large amount of work".
If I understand your intent it is to run a C compiler which emits, shall we say, x86 compiled code and executes it. So to do that we need an emulation of the x86 environment in, say, JavaScript. What's more I think your intent is that the conmpiler itself execute in this environment, so that you can re-use gcc. So you'll need to emulate a file-system too. It's "obvious" that this could be done, but it really is a lot of work. Is it really worth it?
Competition code is small (I guess) even with lots of programmers the number of simultaneous compiles can't be so huge with a decent queued request system, a touch of Ajax, and a bit of back-end scaling how costly is it to support the expected population? What's the ratio of developers to back end systems?
Anyway, if I were to address this problem I'd go for taking the code for an opensource browser and melding in the gcc code. Produce a compiler/browser hybrid. Give that to the developers and tell them "Use this and get zippy compilation speeds, or use your own browser and join the queue."
You're not going to use GCC as it is written for this. AT BEST, you could accomplish something simalar if you had a compiler written in Java that targeted the JVM and could be ran as an applet. I don't know what it would take to get something like this working but, I suspect it would take a bit work to get it up and going. As far as I know nothing currently exist that does this.
Perhaps using a jsLinux in background? There the making process can run in the virtual machine. Communication could be done by extending the clipboard transfer, perhaps into multiple pipes...
I would be interested in javascript based gcc solutions, too.