Related
On cppreference.com, in the section Implicit conversions, in the subsection "Lvalue conversion", it is noted that
[i]f the lvalue designates an object of automatic storage duration whose address was never taken and if that object was uninitialized (not declared with an initializer and no assignment to it has been performed prior to use), the behavior is undefined. [emphasis mine]
From that, I undestand that the "act of taking an address" of an object at some point in time may influence in some way whether the undefined behavior happens or not later when this object "is used". If I'm right, then it seems at least unusual.
Am I right? If so, how is that possible? If not, what am I missing?
cppreference.com is deriving this from a rule in the C standard. C 2018 6.3.2 2 says:
If the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken), and that object is uninitialized (not declared with an initializer and no assignment to it has been performed prior to use), the behavior is undefined.
So, the reason that taking an address matters is fundamentally because “the C standard says so” rather than because taking the address “does something” in the model of computing.
The reason this rule was added to the C standard was to support some behaviors Hewlett-Packard (HP) desired for its Itanium processor. In that processor, each of certain registers has an associated bit that indicates the register is “uninitialized.” So HP was able to make programs detect and trap in some instances where an object had not been initialized. (This detection does not extend to memory; the bit was only associated with a processor register.)
By saying the behavior is undefined if an uninitialized object is used, the C standard allows a trap to occur but also allows that a trap might not occur. So it allowed HP’s behavior of trapping, it allowed it when HP’s software did not detect the issue and so did not trap, and it allowed other vendors to ignore this and provide whatever value happened to be in a register, as well as other behaviors that might arise from optimization by the compiler.
As for predicating the undefined behavior based on automatic storage duration and not taking the address, I suspect this was a bit of a kludge. It provides a criterion that worked for the parties involved: HP was able to design their compiler to use the “unitialized” feature with their registers, but the rule does not carve out a great deal of object use as undefined behavior. For example, somebody might want to write an algorithm that processes large parts of many arrays en masse, ignoring that a few values “along the edges” of defined regions might be uninitialized. The idea there would be that, in some situations, it is more efficient to do a block of operations and then, at the end, carve away the ones you do not care about. Thus, for these situations, programmers want the code to work with “indeterminate values”—the code will execute, will reach the end of the operations, and will have valid values in the results in cares about, and there will not have been any traps or other undefined behavior arising from the values they did not care about. So limiting the undefined behavior of uninitialized objects to automatic objects whose address is not taken may have provided a boundary that worked for all parties concerned.
If an object had never had its address taken, it could potentially be optimized away. In such cases, attempting to read an uninitialized variable need not yield the same value on multiple reads.
By taking the address of an object, that guarantees that storage is set aside for it which can subsequently be read from. Then the value read will at least be consistent (though not necessarily predictable) assuming it is not a trap representation.
I'm working on a malloc implementation as a school exercise, using mmap.
I would like to compute the size of my block of memory, in my free list, by using the address of the metadata.
But I am not sure this solution would be well-defined inside the C standard, I didn't find a reference on whether or not the mmap allocated region is considered an "object" in the meaning of that part of the C standard :
§6.5.8.5 (quote taken from that answer to a somewhat related question) :
When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. If two pointers to object or incomplete types both point to the same object, or both point one past the last element of the same array object, they compare equal. If the objects pointed to are members of the same aggregate object, pointers to structure members declared later compare greater than pointers to members declared earlier in the structure, and pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values. All pointers to members of the same union object compare equal. If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P. In all other cases, the behavior is undefined.
In other words, can I consider the mmap region as a array of bytes (or char) within the standard ?
Yes you can do so reasonably, as an object of having initially no effective type - as otherwise the mmap system call would be completely useless and one could expect that a C compiler targetting a POSIX system should not render mmap useless...
The C Standard only describes the semantics of pointers formed in certain ways. Implementations are free to assign whatever semantics they see fit to pointers formed in other ways. According to the authors of the Standard, the Spirit of C includes the fundamental principle "Don't prevent the programmer from doing what needs to be done", and they presumably intended that quality implementation intended to be suitable for various tasks should avoid imposing needless obstacles to programmers attempting to accomplish those tasks. That would suggest that if a quality implementation defines ways of creating pointers to regions of storage that are not associated with objects of static, automatic, or allocated duration, it should process such pointers usefully even though the Standard would not require it to do so.
Unfortunately, compiler writers are not always clear about the range of purposes for which their compilers are designed to be suitable when configured in various ways. There are many situations where compilers will describe the behavior of a category of actions in more detail than required by the Standard, but the Standard will characterize an overlapping category of actions as invoking UB. Some compiler writers think that UB merely means that the Standard imposes no requirements, but behavioral descriptions that go beyond those required by the Standard should be unaffected. Others view the fact that an action invokes UB as overriding all other behavioral descriptions.
Actions involving addresses that are allocated in ways that the implementations don't understand are only going to be defined to the extent described by the implementations. On some implementations, the fact that the Standard would characterize e.g. comparisons involving unrelated pointers as UB should be viewed as irrelevant, since the Standard doesn't say anything about how such pointers will behave. On others, the fact that the standard characterizes some actions as UB would dominate, however. Unfortunately, it's hard to know which scenario would apply in any particular situation.
Comparing pointers with a relational operator (e.g. <, <=, >= or >) is only defined by the C standard when the pointers both point to within the same aggregate object (struct, array or union). This in practise means that a comparison in the shape of
if (start_object <= my_pointer && my_pointer < end_object+1) {
can be turned into
if (1) {
by an optimising compiler. Despite this, in K&R, section 8.7 "Example—A Storage Allocator", the authors make comparisons similar to the one above. They excuse this by saying
There is still one assumption, however, that pointers to different blocks returned by sbrk can be meaningfully compared. This is not guaranteed by the standard, which permits pointer comparisons only within an array. Thus this version of malloc is portable only among machines for which general pointer comparison is meaningful.
Furthermore, it appears the implementation of malloc used in glibc does the same thing!
What's worse is – the reason I stumbled across this to begin with is – for a school assignment I'm supposed to implement a rudimentary malloc like function, and the instructions for the assignment requires us to use the K&R code, but we have to replace the sbrk call with a call to mmap!
While comparing pointers from different sbrk calls is probably undefined, it is also only slightly dubious, since you have some sort of mental intuition that the returned pointers should come from sort of the same region of memory. Pointers returned by different mmap calls have, as far as I understand, no guarantee to even be remotely similar to each other, and consolidating/merging memory blocks across mmap calls should be highly illegal (and it appears glibc avoids this, resorting to only merging memory returned by sbrk or internally inside mmap pages, not across them), yet the assignment requires this.
Question: could someone shine some light on
Whether or not comparing pointers from different calls to sbrk may be optimised away, and
If so, what glibc does that lets them get away with it.
The language lawyer answer is (I believe) to be found in §6.5.8.5 of the C99 standard (or more precisely from ISO/IEC 9899:TC3 Committee Draft — Septermber 7, 2007 WG14/N1256 which is nearly identical but I don't have the original to hand) which has the following with regard to relational operators (i.e. <, <=, >, >=):
When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. If two pointers to object or incomplete types both point to the same object, or both point one past the last element of the same array object, they compare equal. If the objects pointed to are members of the same aggregate object, pointers to structure members declared later compare greater than pointers to members declared earlier in the structure, and pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values. All pointers to members of the same union object compare equal. If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P. In all other cases, the behavior is undefined.
(the C11 text is identical or near identical)
This at first seems unhelpful, or at least suggests the that the implementations each exploit undefined behaviour. I think, however, you can either rationalise the behaviour or use a work around.
The C pointers specified are either going to be NULL, or derived from taking the address of an object with &, or by pointer arithmetic, or by the result of some function. In the case concerned, they are derived by the result of the sbrk or mmap system calls. What do these systems calls really return? At a register level, they return an integer with the size uintptr_t (or intptr_t). It is in fact the system call interface which is casting them to a pointer. As we know casts between pointers and uintptr_t (or intptr_t) are by definition of the type bijective, we know we could cast the pointers to uintptr_t (for instance) and compare them, which will impose a well order relation on the pointers. The wikipedia link gives more information, but this will in essence ensure that every comparison is well defined as well as other useful properties such as a<b and b<c implies a<c. (I also can't choose an entirely arbitrary order as it would need to satisfy the other requirements of C99 §6.5.8.5 which pretty much leaves me with intptr_t and uintptr_t as candidates.)
We can exploit this to and write the (arguably better):
if ((uintptr_t)start_object <= (uintptr_t)my_pointer && (uintptr_t)my_pointer < (uintptr_t)(end_object+1)) {
There is a nit here. You'll note I casted to uintptr_t and not intptr_t. Why was that the right choice? In fact why did I not choose a rather bizarre ordering such as reversing the bits and comparing? The assumption here is that I'm chosing the same ordering as the kernel, specifically that my definition of < (given by the ordering) is such that the start and end of any allocated memory block will always be such that start < end. On all modern platforms I know, there is no 'wrap around' (e.g. the kernel will not allocate 32 bit memory starting at 0xffff8000 and ending at 0x00007ffff) - though note that similar wrap around has been exploited in the past.
The C standard specifies that pointer comparisons give undefined results under many circumstances. However, here you are building your own pointers out of integers returned by system calls. You can therefore either compare the integers, or compare the the pointers by casting them back to integers (exploiting the bijective nature of the cast). If you merely compare the pointers, you rely on the C compiler's implementation of pointer comparison being sane, which it almost certainly is, but is not guaranteed.
Are the possibilities I mention so obscure that they can be discounted? Nope, let's find a platform example where they might be important: 8086. It's possible to imagine an 8086 compilation model where every pointer is a 'far' pointer (i.e. contains a segment register). Pointer comparison could do a < or > on the segment register and only if they are equal do a < or > onto the offset. This would be entirely legitimate so long as all the structures in C99 §6.5.8.5 are in the same segment. But it won't work as one might expect between segments as 1000:1234 (which is equal to 1010:1134 in memory address) will appear smaller than 1010:0123. mmap here might well return results in different segments. Similarly one could think of another memory model where the segment register is actually a selector, and a pointer comparison uses a processor comparison instruction was used to compare memory addresses which aborts if an invalid selector or an offset outside a segment is used.
You ask two specific questions:
Whether or not comparing pointers from different calls to sbrk may be optimised away, and
If so, what glibc does that lets them get away with it.
In the formulation given above where start_object etc. are actually void *, then the calculation may be optimized away (i.e. might do what you want) but is not guaranteed to do so as the behaviour is undefined. A cast would guarantee that it does so provided the kernel uses the same well ordering as implied by the cast.
In answer to the second question, glibc is relying on a behaviour of the C compiler which is technically not required, but is very likely (per the above).
Note also (at least in the K&R in front of me) that the line you quote doesn't exist in the code. The caveat is in relation to the comparison of header * pointers with < (as I far as I can see comparison of void * pointers with < is always UB) which may derive from separate sbrk() calls.
The answer is simple enough. The C library implementation is written with some knowledge of (or perhaps expectations for) how the C compiler will handle certain code that has undefined behaviour according to the official specification.
There are many examples I could give; but that pointers actually refer to an address in the process' address space and can be compared freely is relied on by the C library implementation (at least by Glibc) and also by many "real world" programs. While it is not guaranteed by the standard for strictly conforming programs, it is true for the vast majority of real-world architectures/compilers. Also, note footnote 67, regarding conversion of pointers to integers and back:
The mapping functions for converting a pointer to an integer or an
integer to a pointer are intended to be consistent with the addressing
structure of the execution environment.
While this doesn't strictly give license to compare arbitrary pointers, it helps to understand how the rules are supposed to work: as a set of specific behaviours that are certain to be consistent across all platforms, rather than as a limit to what is permissible when the representation of a pointer is fully known and understood.
You've remarked that:
if (start_object <= my_pointer && my_pointer < end_object+1) {
Can be turned into:
if (1) {
With the assumption (which you didn't state) that my_pointer is derived in some way from the value of start_object or the address of the object which it delimits - then this is strictly true, but it's not an optimisation that compilers make in practice except in the case of static/automatic storage duration objects (i.e. objects which the compiler knows weren't dynamically allocated).
Consider the fact that calls to sbrk are defined to increment, or decrement the number of bytes allocated in some region (the heap), for some process by the given incr parameter according to some brk address. This is really just a wrapper around brk, which allows you to adjust the current top of the heap. When you call brk(addr), you're telling the kernel to allocate space for your process all the way from the bottom of the addr (or possibly to free space between the current previous higher-address top of the heap down to the new address). sbrk(incr) would be exactly equivalent if incr == new_top - original_top. Thus to answer your question:
Because sbrk just adjusts the size of the heap (i.e. some contiguous region of memory) by incr number of bytes, comparing the values of sbrk is just a comparison of points in some contiguous region of memory. That is exactly equivalent to comparing points in an array, and so it is a well defined operation according to the the C-standard. Therefore, pointer comparison calls around sbrk can be optimized away.
glibc doesn't do anything special to "get away with it" - they just assume that the assumption mentioned above holds true (which it does). In fact, if they're checking the state of a chunk for memory that was allocated with mmap, it explicitly verifies that the mmap'd memory is outside the range allocated with sbrk.
Edit: Something I want to make clearer about my answer: The key here is that there is no undefined behavior! sbrk is defined to allocate bytes in some contiguous region of memory, which is itself an 'object' as specified by the C-standard. Therefore, comparison of pointers within that 'object' is a completely sane and well defined operation. The assumption here is not that glibc is taking advantage of undefined pointer comparison, it's that it's assuming that sbrk grows / shrinks memory in some contiguous region.
The authors of the C Standard recognized that there are some segmented-memory hardware platforms where an attempt to perform a relational comparison between objects in different segments might behave strangely. Rather than say that such platforms could not efficiently accommodate efficient C implementations, the authors of the Standard allow such implementations to do anything they see fit if an attempt is made to compare pointers to objects that might be in different segments.
For the authors of the Standard to have said that comparisons between disjoint objects should only exhibit strange behavior on such segmented-memory systems that can't efficiently yield consistent behavior would have been seen as implying that such systems were inferior to platforms where relational comparisons between arbitrary pointers will yield a consistent ranking, and the authors of the Standard went out of their way to avoid such implications. Instead, they figured that since there was no reason for implementations targeting commonplace platforms to do anything weird with such comparisons, such implementations would handle them sensibly whether the Standard mandated them or not.
Unfortunately, some people who are more interested in making a compiler that conforms to the Standard than in making one that's useful have decided that any code which isn't written to accommodate the limitations of hardware that has been obsolete for decades should be considered "broken". They claim that their "optimizations" allow programs to be more efficient than would otherwise be possible, but in many cases the "efficiency" gains are only significant in cases where a compiler omits code which is actually necessary. If a programmer works around the compiler's limitations, the resulting code may end up being less efficient than if the compiler hadn't bothered with the "optimization" in the first place.
So, the standard (referring to N1570) says the following about comparing pointers:
C99 6.5.8/5 Relational operators
When two pointers are compared, the result depends on the relative
locations in the address space of the objects pointed to.
... [snip obvious definitions of comparison within aggregates] ...
In all other cases,
the behavior is undefined.
What is the rationale for this instance of UB, as opposed to specifying (for instance) conversion to intptr_t and comparison of that?
Is there some machine architecture where a sensible total ordering on pointers is hard to construct? Is there some class of optimization or analysis that unrestricted pointer comparisons would impede?
A deleted answer to this question mentions that this piece of UB allows for skipping comparison of segment registers and only comparing offsets. Is that particularly valuable to preserve?
(That same deleted answer, as well as one here, note that in C++, std::less and the like are required to implement a total order on pointers, whether the normal comparison operator does or not.)
Various comments in the ub mailing list discussion Justification for < not being a total order on pointers? strongly allude to segmented architectures being the reason. Including the follow comments, 1:
Separately, I believe that the Core Language should simply recognize the fact that all machines these days have a flat memory model.
and 2:
Then we maybe need an new type that guarantees a total order when
converted from a pointer (e.g. in segmented architectures, conversion
would require taking the address of the segment register and adding the
offset stored in the pointer).
and 3:
Pointers, while historically not totally ordered, are practically so
for all systems in existence today, with the exception of the ivory tower
minds of the committee, so the point is moot.
and 4:
But, even if segmented architectures, unlikely though it is, do come
back, the ordering problem still has to be addressed, as std::less
is required to totally order pointers. I just want operator< to be an
alternate spelling for that property.
Why should everyone else pretend to suffer (and I do mean pretend,
because outside of a small contingent of the committee, people already
assume that pointers are totally ordered with respect to operator<) to
meet the theoretical needs of some currently non-existent
architecture?
Counter to the trend of comments from the ub mailing list, FUZxxl points out that supporting DOS is a reason not to support totally ordered pointers.
Update
This is also supported by the Annotated C++ Reference Manual(ARM) which says this was due to burden of supporting this on segmented architectures:
The expression may not evaluate to false on segmented architectures
[...] This explains why addition, subtraction and comparison of
pointers are defined only for pointers into an array and one element
beyond the end. [...] Users of machines with a nonsegmented address
space developed idioms, however, that referred to the elements beyond
the end of the array [...] was not portable to segmented architectures
unless special effort was taken [...] Allowing [...] would be costly
and serve few useful purposes.
The 8086 is a processor with 16 bit registers and a 20 bit address space. To cope with the lack of bits in its registers, a set of segment registers exists. On memory access, the dereferenced address is computed like this:
address = 16 * segment + register
Notice that among other things, an address has generally multiple ways to be represented. Comparing two arbitrary addresses is tedious as the compiler has to first normalize both addresses and then compare the normalized addresses.
Many compilers specify (in the memory models where this is possible) that when doing pointer arithmetic, the segment part is to be left untouched. This has several consequences:
objects can have a size of at most 64 kB
all addresses in an object have the same segment part
comparing addresses in an object can be done just by comparing the register part; that can be done in a single instruction
This fast comparison of course only works when the pointers are derived from the same base-address, which is one of the reasons why the C standard defines pointer comparisons only for when both pointers point into the same object.
If you want a well-ordered comparison for all pointers, consider converting the pointers to uintptr_t values first.
I believe it's undefined so that C can be run on architectures where, in effect, "smart pointers" are implemented in hardware, with various checks to ensure that pointers never accidentally point outside of the memory regions they're defined to refer to. I've never personally used such a machine, but the way to think about them is that computing an invalid pointer is precisely as forbidden as dividing by 0; you're likely to get a run-time exception that terminates your program. Furthermore, what's forbidden is computing the pointer, you don't even have to dereference it to get the exception.
Yes, I believe the definition also ended up permitting more-efficient comparisons of offset registers in old 8086 code, but that was not the only reason.
Yes, a compiler for one of these protected pointer architectures could theoretically implement the "forbidden" comparisons by converting to unsigned or the equivalent, but (a) it would likely be significantly less efficient to do so and (b) that would be a wantonly deliberate circumvention of the architecture's intended protection, protection which at least some of the architecture's C programmers would presumably want to have enabled (not disabled).
Historically, saying that action invoked Undefined Behavior meant that any program which made use of such actions could be expected to correctly only on those implementations which defined, for that action, behavior meeting their requirements. Specifying that an action invoked Undefined Behavior didn't mean that programs using such action should be considered "illegitimate", but was rather intended to allow C to be used to run programs that didn't require such actions, on platforms which could not efficiently support them.
Generally, the expectation was that a compiler would either output the sequence of instructions which would most efficiently perform the indicated action in the cases required by the standard, and do whatever that sequence of instructions happened to do in other cases, or would output a sequence of instructions whose behavior in such cases was deemed to be in some fashion more "useful" than the natural sequence. In cases where an action might trigger a hardware trap, or where triggering an OS trap might plausibly in some cases be considered preferable to executing the "natural" sequence of instructions, and where a trap might cause behaviors outside the control of the C compiler, the Standard imposes no requirements. Such cases are thus labeled as "Undefined Behavior".
As others have noted, there are some platforms where p1 < p2, for unrelated pointers p1 and p2, could be guaranteed to yield 0 or 1, but where the most efficient means of comparing p1 and p2 that would work in the cases defined by the Standard might not uphold the usual expectation that p1 < p2 || p2 > p2 || p1 != p2. If a program written for such a platform knows that it will never deliberately compare unrelated pointers (implying that any such comparison would represent a program bug) it may be helpful to have stress-testing or troubleshooting builds generate code which traps on any such comparisons. The only way for the Standard to allow such implementations is to make such comparisons Undefined Behavior.
Until recently, the fact that a particular action would invoke behavior that was not defined by the Standard would generally only pose difficulties for people trying to write code on platforms where the action would have undesirable consequences. Further, on platforms where an action could only have undesirable consequences if a compiler went out of its way to make it do so, it was generally accepted practice for programmers to rely upon such an action behaving sensibly.
If one accepts the notions that:
The authors of the Standard expected that comparisons between unrelated pointers would work usefully on those platforms, and only those platforms, where the most natural means of comparing related pointers would also work with unrelated ones, and
There exist platforms where comparing unrelated pointers would be problematic
Then it makes complete sense for the Standard to regard unrelated-pointer comparisons as Undefined Behavior. Had they anticipated that even compilers for platforms which define a disjoint global ranking for all pointers might make unrelated-pointer comparisons negate the laws of time and causality (e.g. given:
int needle_in_haystack(char const *hs_base, int hs_size, char *needle)
{ return needle >= hs_base && needle < hs_base+hs_size; }
a compiler may infer that the program will never receive any input which would cause needle_in_haystack to be given unrelated pointers, and any code which would only be relevant when the program receives such input may be eliminated) I think they would have specified things differently. Compiler writers would probably argue that the proper way to write needle_in_haystack would be:
int needle_in_haystack(char const *hs_base, int hs_size, char *needle)
{
for (int i=0; i<size; i++)
if (hs_base+i == needle) return 1;
return 0;
}
since their compilers would recognize what the loop is doing and also recognize that it's running on a platform where unrelated pointer comparisons work, and thus generate the same machine code as older compilers would have generated for the earlier-stated formulation. As to whether it would be better to require compilers provide a means of specifying that code resembling the former version should either sensibly on platforms that will support it or refuse compilation on those that won't, or better to require that programmers intending the former semantics should write the latter and hope that optimizers turn it into something useful, I leave that to the reader's judgment.
I have a large block of data where some operations would be fastest if the block were viewed as an array of 64 bit unsigned integers and others would be fastest if viewed as an array of 32 bit unsigned integers. By 'fastest', I mean fastest on average for the machines that will be running the code. My goal is to be near optimal in all the environments running the code, and I think this is possible if I use a void pointer, casting it to one of the two types for dereferencing. This brings me to my questions:
1) If I use a void pointer, will casting it to one of the two types for dereferencing be slower than directly using a pointer of the desired type?
2) Am I correct in my understanding of the standard that doing this will not violate the anti-aliasing rules, and that it will not produce any undefined or unspecified behaviour? The 32 and 64 bit types I am using exist and have no padding (this is a static assertion).
3) Am I correct in understanding the anti-aliasing rules to basically serve two purposes: type safety and compiler guarantees to enable optimization? If so, if all situations where the code I am discussing will be executed are such that no other dereferencing is happening, am I likely to loose out on any significant compiler optimizations?
I have tagged this with 'c11' because I need to prove from the c11 standard that the behaviour is well defined. Any references to the standard would be appreciated.
Finally, I would like to address a likely concern to be brought up in the responses, regarding "premature optimization". First off, this code is being ran on a diverse computing cluster, were performance is critical, and I know that even a one instruction slowdown in dereferencing would be significant. Second, testing this on all the hardware would take time I don't have to finish the project. There are a lot of different types of hardware, and I have a limited amount of time on site to actually work with the hardware. However, I am confident that an answer to this question will enable me to make the right design choice anyway.
EDIT: An answer and comments pointed out that there is an aliasing problem with this approach, which I verified directly in the c11 standard. An array of unions would require two address calculations and dereferencings in the 32 bit case, so I'd prefer a union of arrays. The questions then become:
1) Is there a performance problem in using a union member as an array as opposed to a pointer to the memory? I.e., is there a cost in union member access? Note that declaring two pointers to the the arrays violates the anti-aliasing rules, so access would need to be made directly through the union.
2) Are the contents of the array guaranteed invariant when accessed through one array then through the other?
There are different aspects to your question. First of all, interpreting memory with different types has several problems:
aliasing
alignment
padding
Aliasing is a "local" problem. Inside a function, you don't want to have pointers to the same object that have a different target type. If you do modify such pointed to objects, the compiler may pretend not to know that the object may have changed and optimize your program falsely. If you don't do that inside a function (e.g do a cast right at the beginning and stay with that interpretation) you should be fine for aliasing.
Alignment problems are often overlooked nowadays because many processors now are quite tolerant with alignment problems, but this is nothing portable and might also have performance impacts. So you'd have to ensure that your array is aligned in a way that is suitable for all types that you access it. This can be done with _Alignas in C11, older compilers have extensions that also allow for this. C11 adds some restrictions to aligment, e.g that this is always a power of 2, which should enable you to write portable code with respect to this problem.
Integer type padding is something rare these days (only exception is _Bool) but to be sure you should use types that are known not to have problems with that. In your case these are [u]int32_t and [u]int64_t that are known to have exactly the number of bits requested and of have two's complement representation for the signed types. If a platform doesn't support them, your program would simply not compile.
I would refrain from using a void pointer. A union of two arrays or an array of union will do better.
Use a proper alignment on the whole type. C11 provides alignas() as keywords. GCC has attributes for alignment which are non-standard (and work in per-11 standards as well). Other compilers may have none at all.
Depending on your architecture, there should be no performance impact. But this cannot be guaranteed (I do not see an issue her, however). You might even align the type to a larger type than 64 bits to fill a cache line perfectly. That might speed up prefetch and writeback.
Aliasing refers to the fact that an object is referenced by multiple pointer a the same time. This means the same memory address can be addressed using two different "sources". The problem is that the compiler may not be aware abouth this and thus may hold the value of a variable in a CPU register during some calculation without writing it back to memory instantly.
If the same variable is then referenced by the other "source" (i.e. pointer), the compiler may read invalid data from the memory location.
Imo is aliasing only relevant within a function if two pointers are pased inside. So, if you do not intend to pass two pointers to the same object (or part of it) there should be no problem at all. Otherwise, you should get comfortable with (compiler)barriers. Edit: C standard seems to be a bit more strict on that, as it requires just the lvalues accessing an object to fulfill certain criteria (C11 6.5/7 (n1570) - thks Matt McNabb).
Oh, and don't use int/long/etc. You really should use stdint.h types if you really need proper sized types.