Does OpenCL on an APU capable of using the whole memory? - c

Is it possible to build a machine with something like 32GB of RAM, and use about ~28GB with OpenCL?
My current APU is an Athlon 5350, with a "global memory size" reported of 2142658560. I played a little with pyopencl with the CL_MEM_USE_HOST_PTR, but I didn't find a way for doing that.
Is that possible at all?
May be with some new generation APU, like Ryzen Vega?
NOTE: I'm a non-professional and newbie, I didn't spend a hour yet studing OpenCL because before investing money and time on this, I want to know if it's possible at all... so sorry if this is a stupid question.

Yes, it is possible to have a 32GB computer and to devote ~28GB of it's RAM to any program. When you are writing an OpenCL program, all management of memory spaces (on-chip and off-chip) must be done manually. I do not think you can run an OpenCL kernel that seems to directly accessing RAM, but even if you could, that would not be particularly worth thinking about because the power of OpenCL is in fine-grained management of RAM, L2, and L1 - not in allowing programmers to consider their program as operating against just RAM.
Take some time, dive deep into memory management, and gain a very firm grasp of your computer's several memory spaces of varying sizes, connection speeds, and connection bandwidths.
You seem to be thinking about buying a huge amount of RAM to solve your problem. Hopefully you can find a better way to architect your solution - one that does not involve buying 128GB of RAM.
That said, some programs are inherently hard to parallelize. For these programs you might just want to buy a ton of RAM (and maybe even skip OpenCL entirely and run it on the CPU)

Related

Do processors have optimizations and architecture preferences targeted firstly or mainly to C/C++ languages?

I have read an article C Is Not a Low-level Language, where is such paragraph:
Unfortunately, simple translation providing fast code is not true for
C. In spite of the heroic efforts that processor architects invest in
trying to design chips that can run C code fast, the levels of
performance expected by C programmers are achieved only as a result of
incredibly complex compiler transforms. The Clang compiler, including
the relevant parts of LLVM, is around 2 million lines of code. Even
just counting the analysis and transform passes required to make C run
quickly adds up to almost 200,000 lines (excluding comments and blank
lines).
What does a bold sentence mean? Does it mean that manufacturers design processors with some optimizations and architecture decisions targeted firstly or even specifically to C (C++) code? Or it just means that they are trying to design processors that executes any code faster, including the code written in C language?
If some preferences to C exists, what are they?
My couple of thoughts:
a branch prediction algorithm tuned in to patterns happening mainly in C code.
instructions which are useful and used in C but aren't useful in other languages. Otherwise other languages (compilers) will use them too.
I knows about language specific processors like Jazelle or Lisp machine for Java and Lisp respectively, but similar technologies can't be applied to C, because there are no bytecode.
Processors don't necessarily have optimizations targeted at C, but they do provide features to make C (and other procedural languages in general) map more cleanly to the platform.
Take cache-coherency in a multi-threaded environment as an example. From a C perspective, a global variable shared by two threads should look the same to both threads. If one thread writes to it the other should be able to see those modifications. But in a multi-core CPU with independent caches, that takes extra effort to support. Core 1 has to be able to detect that core 2 is accessing an address it has modified in cache and flush that out to memory (or somehow share it directly to core 2's cache).
That's essentially the thesis of that entire article. C's abstract machine model doesn't necessarily map cleanly to real modern high-performance processors like it did to the (by comparison extremely simple) PDP-11, and CPUs and compilers have to take great pains to paper over those differences.
The "heroic efforts" of the processor architects is largely referring to the design of cache and memory subsystems on the CPUs.
For a very long time now, the instruction executions circuits inside the CPUs have been far, far quicker than the electronics that looks after fetching/writing data from/to memory, largely because the technologies we have for RAM chips is hasn't really got better. Where the cores have speeded up the memory hasn't, and so the cache and memory subsystem has to get ever more elaborate in order to be able to pre-fetch data and move it towards the execution circuits ahead of time. Needless to say, this doesn't always pan out well.
It's also partly because of the physical distance between the CPU and RAM chips. Though only a few inches (if that) of track on a motherboard, that distance is significant; the speed of a signal down the track is about 1ns every 8 inches. For signals clocked in the GHz range (1 cycle << 1ns), a short track is a long way. This is partly why Apple have gone down the route of putting RAM onto the same package as the CPU in the home-grown M1 silicon.
Back to caches - the likes of Intel (and AMD, ARM) have strived to make CPUs that have good, general purpose performance, so that they run pretty much any code well. Modern compilers help a lot - if they know what the cache in the CPU is likely to do in any particular circumstance, the compilers can arrange code to fit in with what the hardware is likely to do.
A reasonable question then is, is that effective? Well, yes and no. Yes, because compiled code does run quite well, but no for a couple of reasons. The first is that ultimate performance for any given algorightm is rarely reached by the compiler / CPU, and secondly all this complexity makes it nigh on impossible for a good programmer to do their own optimisation.
Some CPUs help out the hero-programmer here. PowerPC (at least some variants) has instructions where the programmer can give the cache system a hint that the programme will shortly need data from such-and-such a location in RAM. The CPU uses that instruction to pre-load the L1 cache with that data, so that when the program actually starts to perform operations on data at that address it's already in cache.
The IBM Cell processor took this to a whole new level. The SPE math cores (there were 8 of them) had no cache, and no way of addressing data in CPU RAM at all. What there was instead was 256K of static RAM per core into which all code and data had to fit, and a way for code to push code and data in and out of that static RAM very quickly (256Gbyte/sec at the time, which was very very quick). The developer was completely on their own; they had to write code to load code and data into a core, get that executed, and then write more code to get the results out to wherever. This was actually pretty liberating; instead of having a cache and memory subsystem trying to automatically deliver data to executions cores, get in the way or (worse) just hide inefficiencies from you, one had the freedom to break down an algorigthm into core-sized lumps knowing that if it fitted, it'd be very quick, or knowing for sure it didn't fit.
Miles Budnek's answer addresses the issues that arise from multi-core CPUs with a cache-coherency and a Symetric Multi Processing (SMP) environment. It's even harder for the cache designer to get it right if there's multuple cores that might very well start tampering with a value. The difficulties involved has lead to vulnerabilities like Meltdown and Spectre.
SMP could be said to be an "optimisation" put into CPUs by designers to aid the C (or other) developer in transitioning code from single to multiple thread. It's an attractive thought - in the way that a single thread programme can see all of it's data merely by addressing it, why not extend the same visibility of data to all threads in the programme?
Turns out that this is what makes it very difficult to design modern CPUs. However the reasons why the industry went this way are plain enough - the smallest possible delta between single and multicore CPUs was going to be the least troublesome for the existing software community to adopt. That's perfectly reasonable.
But it is running out of steam, fast. A better approach (if the goal is the outright pursuit of performance) would be to go back to the old Transputer architectures from Inmos from the 1980s, early 1990s. In such architectures, data held by one core could only be processed by another if the software was written to explicitly transfer the data. Sounds familiar? Yes - Cell process was a bit like that.
Interestingly, languages such as Rust, Go, Erlang have all implemented Communicating Sequential Processes as a parallel processing paradigm. The irony is that, these days, CSP has to be implemented on top of a SMP environment, which is itself an artificial construct brought about by the interconnect between CPUs, cores and memory (e.g. QPI, Hypertransport). Basically, if the whole software world got fully comfortable with CSP then CPU designers wouldn't have to design cache-coherency into their multi-core CPUs. Rust in particular is very well suited, as it already has a strong concept of data ownership in its syntax (which could be leveraged to shovel data around between cores automatically).
The article referred to by the OP seems to me to have it in for C for some reason. There were so many points in it I felt triggered by, but I don't want to go addressing each one point by point. Maybe there is some bias or special interest that has not been declared. As a C programmer, with a particular interest in writing high performance programs, I thought I'd give my two cents on some of the issues raised. Hopefully, this might be of interest to others in the industry with or without a programming background.
From my point of view, the strengths of C are mainly as follows....
C allows you to do things you just can't do in 'higher level' languages.
A well written C (see weakness no.1) program is hard to beat on performance on the same hardware, written in another language.
C is comfortable handling binary data allowing for memory conservation.
C is well established with lots of libraries and programmers.
Objects in memory can be made easy to process from anywhere in the program by using pointers so the data itself doesn't need to be passed around.
Multi-threaded and multi-process programs are quite easy to implement.
It has Read-Write shared memory between threads (and processes with some fancy low-level stuff?)
Assembly can be inlined where needed (though it's not C then I know!).
... and main weaknesses...
Utilising SIMD capabilities is not possible in standard C, and difficult to implement in a portable way with intrinsics.
It takes a lot of code to do simple things for which there are no library functions.
Buffer overflow potential is easily missed, even for experienced programmers.
C pointers can be confusing.
The C programming language has a special place in the evolution of programming languages and I for one, would welcome a replacement that is a better fit to what is possible with modern hardware if it doesn't tie the hands of the programmer and offers better security and performance. From the article,...
'A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.'
Such things exist already, GPUs! Modern CPUs are much more like GPUs than they used to be now core counts can be 100+. I have used OpenCL C to write programs with amazing computational performance but they can't do everything well. Some applications can not be efficiently parallelised, if at all. OpenCL C program performance can become terrible when there is even a small amount of branching. Also, it is so much easier to exhause your memory bandwidth and fast cache when running many threads that it might be judged not worth the added complexity over a good single threaded implementation.
In OpenCL C, the programmer has somewhat more control of where data is stored in memory which can definately aid performance. Maybe it's a costly mistake to try to make programming languages too hardware independent. Might it be better to review some (LLVM like) intermediate standard, like in OpenCL C, where one can define 'private', 'local' and 'constant' memory objects to get performance improvements over 'global' memory objects. Such a standard wouldn't need to be tied to an instruction set. As a programmer, I welcome fast CPU instructions but it would be nice if they could be much more easily utilised in portable code AND compilable to portable binaries. Maybe this is something compiler writers could look into along with using SIMD vector registers rather than memory for pushing and popping. As I see it, there are four levels of portability.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the compiler to create binaries that will run correctly and efficiently on any hardware conforming to the intermediate standard.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the host compiler to create binaries that will run on the host's hardware configuration conforming to the intermediate standard, but may not run on other hardware conforming to the same.
Hardware dependent source code where the logical execution path through the source depends on the architecture of the hardware on which it is run. Programs need to 'query' the hardware configuration.
Hardware specific source code.
In a fantasy world where one can just imagine new standards, hardware, and programming languages, one could choose which level of portablity to aim for. I think that C was supposed to be hardware independent, but it isn't really if you want to get the best performance out of your hardware. OpenCL C tries also, but doesn't quite make it, though with run-time kernel compilation it does a pretty good job. The host program has the same issues though as any other. I don't think there are any 'Level 1' portable languages currently.
Sorry my response is a bit rambling. It's unfortunate that it's difficult to have an objective constructive discussion about the pros and cons of different ideas about future changes in software and hardware. Personally, I think FPGAs have huge potential but are still a long way from where they would need to be to go mainstream. Any new computing language will probably become out of date when hardware changes occur and software trends change. It's remarkable that C still occupies such a prominent space. In another 10 or 20 years time, C will probably still be going strong. How many other modern languages will still be commonplace then?

Simple demo program which shows the performance of RAM

How can I create a simple demo program with Python or C which strongly depends on the performance of RAM? For example, a program which can tell the performance difference between high and low RAM frequencies, and/or bandwidth? This helps me understand in what ways better RAM is useful, without going into details of more complicated tasks.

Can bad C code cause a Blue Screen of Death?

I am a new coder in c, recently moved over from python, but still like to challenge myself with fairly ambitious projects (like a chess program), and have found that my computer suffers an unusual number of BSODs, both when I am running a program and not (admittedly, attempting to use the entirety of my memory as a hash table may not have been the greatest idea).
So my question is, are these most likely caused by my crappy c code, or is it more likely that my 3 year old, overworked laptop is the culprit?
If it could be the code, what are the big things I should avoid doing so as to prevent this?
BSOD usually contains some information as to what caused it.
What information it contains, and how exactly it is displayed depends on the version of Windows you are running.
As can be seen from the list here:
https://hetmanrecovery.com/recovery_news/bsod-errors
Most BSOD errors come from device / driver / kernel code, and not from your typical userland program.
That said, it might be possible to trigger BSOD if your code uses particularly low level windows API, especially if you run it with administrator privileges.
Note, that simply filling up memory will result in allocations for your program failing, and possibly your program, but not the whole OS crashing.
Also, windows does place limits on how much an individual process can allocate.
One final note:
"3 year old laptop" does not provide enough information to tell anything about your hardware, since there are different tiers of laptops available, and some of the high end 3 year old ones will still be better performing then a mid tier one bought yesterday.
As a troubleshooting measure, I would recommend backing up your data, making a clean install of your OS (aka "format the machine"), then making sure all your drivers are up to date.
You may also want to try hardware diagnostic tools, such as memtes86, check SMART on your storage, etc.
It's not supposed to be possible for anything you do in an ordinary "user space" program to crash the whole computer. Something else must also be wrong. Here are some possibilities:
If you are making the computer do CPU- and RAM-intensive work for long periods, you may stress the hardware to the point where a marginally defective component fails. Usually it's either the RAM, the power supply, or the cooling fans at fault.
Make sure your power supply is rated for all of the kit you have, running simultaneously. Make sure you have enough airflow for the amount of heat you're generating; check for dust-clogged heatsinks and fans that aren't actually spinning. If you have more than one RAM stick, take one out at a time and see if that makes the problem disappear.
I'd like to tell you to get error-correcting RAM if you don't have it already, but for infuriating market differentiation reasons you'd have to replace the motherboard and CPU as well. It's still worth doing, in the long run, but it amounts to replacing the whole computer.
You may be tickling a bug in the OS or the drivers. The most probable culprit is the GPU driver, particularly if your program does anything graphical. Regrettably, all you can do about this is make sure you're fully patched up.

Is there anyway to reduce the amount of ram in a system using software?

I am a beginner at coding and working on a web application. I am currently working on a desktop with 16GB of DDR4 memory. However, potential clients in the future would be using laptops and most likely have only 8GB of ram that would potentially have a slower clock speed. While I could turn off my computer and take out a stick of memory for testing, I was wondering if there was a software based solution in windows that would allow me to temporarily shut off a stick of memory or limit the amount that is allowed to be used so that I could run some tests. If anyone knows how to do this I would greatly appreciate any help. If it helps to know, my development environment is VS Code and I am using the MERN stack (react in the front end).
I think it's best if you use some Virtual Machine software to create a machine with your desired spec. Using some software like this: https://superuser.com/questions/1263090/is-it-possible-to-limit-the-memory-usage-of-a-particular-process-on-windows for example requires some more configurations and controls compared to VM. You can use https://www.virtualbox.org/, it's free.
Your OS by default takes the RAM it's given when booting up. It's nearly impossible to switch off RAM midusage at least in Windows. Memory management is something that's not allowed to be edited by external softwares but only by the operating system at least to my knowledge it is. You can use a virtual machine to test your software as an ideal solution but you can't lower the ram usage of the OS you're running now.

Footprint of Lua on a PPC Micro

We're developing some code on Freescale PPC micros (5517 and 5668 at the moment), and I was wondering if we could put Lua on them.
The devices need to be easily programmed/reconfigured in the field, and the current product uses a proprietary interpreted logic language that can be loaded in, and our software (written in C) runs an interpreter. I would like to move to a better language (the implementation is a bit buggy and slow), so I'm considering Lua, but the memory footprint must be very low. For the 5517 (which we may not use), the maximum RAM is 80K. Things are better on the 5668, with 592K of RAM.
So does anyone know if I can put Lua on bare metal? We're effectively not running an OS. If so, are there any estimates on what kind of memory footprint we might see? How much effort it would take?
Failing this, does anyone know of any kind of interpreter that might be better in a memory-constrained environment without an OS? Or are we better just rolling our own?
See the eLua project.

Resources