I've been working on using a C library from R by writing custom C-functions using the library's functionality, and then accessing these C-functions from R using the .C-Interface.
In some of the C-code, I allocate space for some custom structures and want to store pointers to them in R so I can use these structures in successive calls to .C. While toying around with the .C function, I noticed I can simply cast the pointer to the C-structure to int and store it in R as an integer. Passing this integer to later calls via .C works fine, I can keep track of my structures and use them without problems.
My somewhat naive question: what is wrong with storing these pointers in integers in R? It works fine so I'm assuming there has to be some downside, but I couldn't find any info on it.
R's integers are 32 bits even on a 64 bits platform. Therefore, when working on a 64 bits system this won't work (the pointers will be 64 bits).
R has functionality for this. See the 'Writing R Extensions' manual, the section on 'External pointers and weak references'.
If you are willing to switch to c++ (which doesn't mean you have to rewrite all of your code), you can use the Rcpp package which makes this easier. See for example External pointers with Rcpp
Related
I was trying to write a library for linear algebra operations in Haskell. In order to be able to define safe operations for matrices and vectors I wanted to encode their dimensions in their types. After some research I found that using DataKinds one is able to do that, similar to the way it's done here. For example:
data Vector (n :: Nat) a
dot :: Num a => Vector n a -> Vector n a -> a
In the aforementioned article, as well as in some libraries, the size of the vectors is a phantom type and the vector type itself is a wrapper around an Array. In trying to figure out if there is a array type with its size at the type-level in the standard library I started wondering about the underlying representation of arrays. From what I could gather form this commentary on GHC memory layout, arrays need to store their size on the heap so a 3-dimensional vector would need to take up 1 more word than necessary. Of course we could use the following definition:
data Vector3 a = Vector3 a a a
which might be fine if we only care about 3D geometry, but it doesn't allow for vectors of arbitrary size and also it makes indexing awkward.
So, my question is this. Wouldn't it be useful and a potential memory optimization to have an array type with statically known size in the standard library? As far, as I understand the only thing that it would need is a different info table, which would store the size, instead of it being stored for at each heap object. Also, the compiler could choose between Array and SmallArray automatically.
Wouldn't it be useful and a potential memory optimization to have an array type with statically known size in the standard library?
Sure. I suspect if you wrote up your use case carefully and implemented this, GHC HQ would accept a patch. You might want to do the writeup first and double-check that they're into it to avoid wasting time on a patch they won't accept, though; I certainly don't speak for them.
Also, the compiler could choose between Array and SmallArray automatically.
I'm not an expert here, but I kinda doubt this. Usually supporting polymorphism means you need a uniform representation.
I'm working on some zig bindings but as the language doesn't have complete C ABI support I'm trying to hack it to at least work somewhat. From the issue above I know that normal struct & union parameters <= 16 bytes are broken down into pieces without any modification with the exception of structs or unions that are just floats and <= 16. My question is: what does C to those structs?
Edit (additional information):
When passing a struct as a parameter in C, C handles different kinds of structs differently. For structs that are 16 bytes are smaller there are two kinds: "normal" and structs that are only made up of floats. I know how C passes normal structs, by just breaking it down into its pieces and pushing them onto the stack, but I don't know how it passes only-float structs.
Platform information:
64-bit macOS, compiler is apple clang version 11.0.3
I'm going through O'Reilly's Practical C Programming book, and having read the K&R book on the C programming language, and I am really having trouble grasping the concept behind unions.
They take the size of the largest data type that makes them up...and the most recently assigned one overwrites the rest...but why not just use / free memory as needed?
The book mentions that it's used in communication, where you need to set flags of the same size; and on a googled website, that it can eliminate odd-sized memory chunks...but is it of any use in a modern, non-embedded memory space?
Is there something crafty you can do with it and CPU registers? Is it simply a hold over from an earlier era of programming? Or does it, like the infamous goto, still have some powerful use (possibly in tight memory spaces) that makes it worth keeping around?
Well, you almost answered your question: Memory.
Back in the days memory was rather low, and even saving a few kbytes has been useful.
But even today there are scenarios where unions would be useful. For example, if you'd like to implement some kind of variant datatype. The best way to do this is using a union.
This doesn't sound like much, but let's just assume you want to use a variable either storing a 4 character string (like an ID) or a 4 byte number (which could be some hash or indeed just a number).
If you use a classic struct, this would be 8 bytes long (at least, if you're unlucky there are filling bytes as well). Using an union it's only 4 bytes. So you're saving 50% memory, which isn't a lot for one instance, but imagine having a million of these.
While you can achieve similar things by casting or subclassing a union is still the easiest way to do this.
One use of unions is having two variables occupy the same space, and a second variable in the struct decide what data type you want to read it as.
e.g. you could have a boolean 'isDouble', and a union 'doubleOrLong' which has both a double and a long. If isDouble == true interpret the union as a double else interpret it as a long.
Another use of unions is accessing data types in different representations. For instance, if you know how a double is laid out in memory, you could put a double in a union, access it as a different data type like a long, directly access its bits, its mantissa, its sign, its exponent, whatever, and do some direct manipulation with it.
You don't really need this nowadays since memory is so cheap, but in embedded systems it has its uses.
The Windows API makes use of unions quite a lot. LARGE_INTEGER is an example of such a usage. Basically, if the compiler supports 64-bit integers, use the QuadPart member; otherwise, set the low DWORD and the high DWORD manually.
It's not really a hold over, as the C language was created in 1972, when memory was a real concern.
You could make the argument that in modern, non-embedded space, you might not want to use C as a programming language to begin with. If you've chosen C as your language choice for implementation, you're looking to harness the benefits of C: it's efficient, close-to-metal, which results in tight, fast binaries.
As such, when choosing to use C, you'd still want to take advantage of it's benefits, which includes memory-space efficiency. To which, the Union works very well; allowing you to have some degree of type safety, while enforcing the smallest memory foot print available.
One place where I have seen it used is in the Doom 3/idTech 4 Fast Inverse Square Root implementation.
For those unfamiliar with this algorithm, it essentially requires treating a floating point number as an integer. The old Quake (and earlier) version of the code does this by the following:
float y = 2.0f;
// treat the bits of y as an integer
long i = * ( long * ) &y;
// do some stuff with i
// treat the bits of i as a float
y = * ( float * ) &i;
original source on GitHub
This code takes the address of a floating point number y, casts it to a pointer to a long (ie, a 32 bit integer in Quake days), and derefences it into i. Then it does some incredibly bizarre bit-twiddling stuff, and the reverse.
There are two disadvantages of doing it this way. One is that the convoluted address-of, cast, dereference process forces the value of y to be read from memory, rather than from a register1, and ditto on the way back. On Quake-era computers, however, floating point and integer registers were completely separate so you pretty much had to push to memory and back to deal with this restriction.
The second is that, at least in C++, doing such casting is deeply frowned upon, even when doing what amounts to voodoo such as this function does. I'm sure there are more compelling arguments, however I'm not sure what they are :)
So, in Doom 3, id included the following bit in their new implementation (which uses a different set of bit twiddling, but a similar idea):
union _flint {
dword i;
float f;
};
...
union _flint seed;
seed.i = /* look up some tables to get this */;
double r = seed.f; // <- access the bits of seed.i as a floating point number
original source on GitHub
Theoretically, on an SSE2 machine, this can be accessed through a single register; I'm not sure in practice whether any compiler would do this. It's still somewhat cleaner code in my opinion than the casting games in the earlier Quake version.
1 - ignoring "sufficiently advanced compiler" arguments
Is there a simple way to compile 32-bit C code into a 64-bit application, with minimal modification? The code was not setup to use fixed type sizes.
I am not interested in taking advantage of 64-bit memory addressing. I just need to compile into a 64-bit binary while maintaining 4 byte longs and pointers.
Something like:
#define long int32_t
But of course that breaks a number of long use cases and doesn't deal with pointers. I thought there might be some standard procedure here.
There seem to be two orthogonal notions of "portability":
My code compiles everywhere out of the box. Its general behaviour is the same on all platforms, but details of available features vary depending on the platform's characteristics.
My code contains a folder for architecture-dependent stuff. I guarantee that MYINT32 is always 32 bit no matter what. I successfully ported the notion of 32 bits to the nine-fingered furry lummoxes of Mars.
In the first approach, we write unsigned int n; and printf("%u", n) and we know that the code always works, but details like the numeric range of unsigned int are up to the platform and not of our concern. (Wchar_t comes in here, too.) This is what I would call the genuinely portable style.
In the second approach, we typedef everything and use types like uint32_t. Formatted output with printf triggers tons of warnings, and we must resort to monsters like PRI32. In this approach we derive a strange sense of power and control from knowing that our integer is always 32 bits wide, but I hesitate to call this "portable" -- it's just stubborn.
The fundamental concept that requires a specific representation is serialization: The document you write on one platform should be readable on all other platforms. Serialization is naturally where we forgo the type system, must worry about endianness and need to decide on a fixed representation (including things like text encoding).
The upshot is this:
Write your main program core in portable style using standard language primitives.
Write well-defined, clean I/O interfaces for serialization.
If you stick to that, you should never even have to think about whether your platform is 32 or 64 bit, big or little endian, Mac or PC, Windows or Linux. Stick to the standard, and the standard will stick with you.
No, this is not, in general, possible. Consider, for example, malloc(). What is supposed to happen when it returns a pointer value that cannot be represented in 32 bits? How can that pointer value possibly be passed to your code as a 32 bit value, that will work fine when dereferenced?
This is just one example - there are numerous other similar ones.
Well-written C code isn't inherently "32-bit" or "64-bit" anyway - it should work fine when recompiled as a 64 bit binary with no modifications necessary.
Your actual problem is wanting to load a 32 bit library into a 64 bit application. One way to do this is to write a 32 bit helper application that loads your 32 bit library, and a 64 bit shim library that is loaded into the 64 bit application. Your 64 bit shim library communicates with your 32 bit helper using some IPC mechanism, requesting the helper application to perform operations on its behalf, and returning the results.
The specific case - a Matlab MEX file - might be a bit complicated (you'll need two-way function calling, so that the 64 bit shim library can perform calls like mexGetVariable() on behalf of the 32 bit helper), but it should still be doable.
The one area that will probably bite you is if any of your 32-bit integers are manipulated bit-wise. If you assume that some status flags are stored in a 32-bit register (for example), or if you are doing bit shifting, then you'll need to focus on those.
Another place to look would be any networking code that assumes the size (and endian) of integers passed on the wire. Once those get moved into 64-bit ints you'll need to make sure that you don't lose sign bits or precision.
Structures that contain integers will no longer be the same size. Any assumptions about size and alignment need to be cleaned out.
I have a project that is half in C and half in Fortran 77. [No, not Fortran 90 or 03, Fortran 77.] The code would be much cleaner if I could pass pointers generated on the C side back to Fortran, which would then pass them back as necessary for handling in other C functions. As it is, the C code is filled with global variables that shouldn't be global, and is otherwise on the verge of becoming an unstructured mess. So are there any reasonably reliable ways to pass an opaque pointer between C and Fortran?
If you are on a 32-bit platform, consider casting the pointers to integers and passing those integers to the Fortran code. When the Fortran passes them back, reconvert the integer back into a pointer, cross-fingers, and use.
From what I remember (from 25+ years ago), Fortran 77 tends to pass everything to C by pointer anyway - and character strings get passed with a length, and arrays get passed with their dimensions.
If you're on a 64-bit platform, you'll have to work out whether the Fortran 77 compiler provides any 8-byte integers (INTEGER*8?) - my suspicion is that it won't (largely confirmed by looking at the GNU documentation; if you were using Fortran 2003, you'd be in better shape, it seems). If it does, the same trick works. If it does not, you are into much dodgier territory.
You could try - against recommendations - using a union of a double and a pointer. In the C, you'd set the pointer in the union from your C code pointer, then copy the double out of the union into a Fortran REAL*8, and as long as no-one touches that except to copy it or pass it back, maybe you will be OK if the gods smile favourably upon your endeavours. Most likely though, the whole thing will explode - this sort of union has an incredible ability to detect when the customer will be most annoyed if something doesn't work and will then proceed to explode at exactly the right moment - part way through the demo, or fifteen minutes after the program goes live.
An alternative to consider (still with gritted teeth) is a union of a 64-bit pointer and an array of two 32-bit integers, and then requiring the Fortran code to pass an array of two integers when you need to return a (64-bit) pointer. Clearly, an array of one integer(s) would work to 32-bit code; maybe just require the calling code to pass an array of two integers in all cases, zeroing the unused integer value in the 32-bit pointer case? That gives you forward migratability.
You can do this with the (non-standard) Cray pointer extension:
http://gcc.gnu.org/onlinedocs/gfortran/Cray-pointers.html