Fast non blocking keyboard IO in C under MinGW - c

I have written a CPU emulator in C on windows for fun, and I want it to handle its own IO in a non-blocking fashion: if there has been a keypress, return the char value of that keypress, else return 0.
At the moment I am using the following:
#include <conio.h>
...
unsigned int input(){
unsigned int input_data;
if (_kbhit()){
input_data = (unsigned int)_getch();
}
else{
input_data = 0;
}
return input_data;
}
And in terms of function, it is fine. The one problem I have is that it is very detrimental to the speed of the emulator - the emulator can go from 60-100 million instructions per second to the scale of tens or hundreds of thousands, just by running programs with lots of IO instructions. Is there a faster way to do this, whilst still keeping the same functionality?

Two options comes to my mind:
The first option is the easiest one. Do not check it every time. OS calls are expensive, and if your emulator calls this very often, it will slow everything down.
#include <conio.h>
...
unsigned int input(){
static int cheat = 0;
cheat = (cheat + 1) % 128;
if (cheat){
return 0;
}
unsigned int input_data;
if (_kbhit()){
input_data = (unsigned int)_getch();
}
else{
input_data = 0;
}
return input_data;
}
Second option is to receive the actual keyboard input async and store the input data into a buffer. And your input() function checks this buffer. This removes the call to the OS all together in the tight loop.

Related

2D array, prototype function and random numbers [duplicate]

I need a 'good' way to initialize the pseudo-random number generator in C++. I've found an article that states:
In order to generate random-like
numbers, srand is usually initialized
to some distinctive value, like those
related with the execution time. For
example, the value returned by the
function time (declared in header
ctime) is different each second, which
is distinctive enough for most
randoming needs.
Unixtime isn't distinctive enough for my application. What's a better way to initialize this? Bonus points if it's portable, but the code will primarily be running on Linux hosts.
I was thinking of doing some pid/unixtime math to get an int, or possibly reading data from /dev/urandom.
Thanks!
EDIT
Yes, I am actually starting my application multiple times a second and I've run into collisions.
This is what I've used for small command line programs that can be run frequently (multiple times a second):
unsigned long seed = mix(clock(), time(NULL), getpid());
Where mix is:
// Robert Jenkins' 96 bit Mix Function
unsigned long mix(unsigned long a, unsigned long b, unsigned long c)
{
a=a-b; a=a-c; a=a^(c >> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >> 13);
a=a-b; a=a-c; a=a^(c >> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >> 5);
a=a-b; a=a-c; a=a^(c >> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >> 15);
return c;
}
The best answer is to use <random>. If you are using a pre C++11 version, you can look at the Boost random number stuff.
But if we are talking about rand() and srand()
The best simplest way is just to use time():
int main()
{
srand(time(nullptr));
...
}
Be sure to do this at the beginning of your program, and not every time you call rand()!
Side Note:
NOTE: There is a discussion in the comments below about this being insecure (which is true, but ultimately not relevant (read on)). So an alternative is to seed from the random device /dev/random (or some other secure real(er) random number generator). BUT: Don't let this lull you into a false sense of security. This is rand() we are using. Even if you seed it with a brilliantly generated seed it is still predictable (if you have any value you can predict the full sequence of next values). This is only useful for generating "pseudo" random values.
If you want "secure" you should probably be using <random> (Though I would do some more reading on a security informed site). See the answer below as a starting point: https://stackoverflow.com/a/29190957/14065 for a better answer.
Secondary note: Using the random device actually solves the issues with starting multiple copies per second better than my original suggestion below (just not the security issue).
Back to the original story:
Every time you start up, time() will return a unique value (unless you start the application multiple times a second). In 32 bit systems, it will only repeat every 60 years or so.
I know you don't think time is unique enough but I find that hard to believe. But I have been known to be wrong.
If you are starting a lot of copies of your application simultaneously you could use a timer with a finer resolution. But then you run the risk of a shorter time period before the value repeats.
OK, so if you really think you are starting multiple applications a second.
Then use a finer grain on the timer.
int main()
{
struct timeval time;
gettimeofday(&time,NULL);
// microsecond has 1 000 000
// Assuming you did not need quite that accuracy
// Also do not assume the system clock has that accuracy.
srand((time.tv_sec * 1000) + (time.tv_usec / 1000));
// The trouble here is that the seed will repeat every
// 24 days or so.
// If you use 100 (rather than 1000) the seed repeats every 248 days.
// Do not make the MISTAKE of using just the tv_usec
// This will mean your seed repeats every second.
}
if you need a better random number generator, don't use the libc rand. Instead just use something like /dev/random or /dev/urandom directly (read in an int directly from it or something like that).
The only real benefit of the libc rand is that given a seed, it is predictable which helps with debugging.
On windows:
srand(GetTickCount());
provides a better seed than time() since its in milliseconds.
C++11 random_device
If you need reasonable quality then you should not be using rand() in the first place; you should use the <random> library. It provides lots of great functionality like a variety of engines for different quality/size/performance trade-offs, re-entrancy, and pre-defined distributions so you don't end up getting them wrong. It may even provide easy access to non-deterministic random data, (e.g., /dev/random), depending on your implementation.
#include <random>
#include <iostream>
int main() {
std::random_device r;
std::seed_seq seed{r(), r(), r(), r(), r(), r(), r(), r()};
std::mt19937 eng(seed);
std::uniform_int_distribution<> dist{1,100};
for (int i=0; i<50; ++i)
std::cout << dist(eng) << '\n';
}
eng is a source of randomness, here a built-in implementation of mersenne twister. We seed it using random_device, which in any decent implementation will be a non-determanistic RNG, and seed_seq to combine more than 32-bits of random data. For example in libc++ random_device accesses /dev/urandom by default (though you can give it another file to access instead).
Next we create a distribution such that, given a source of randomness, repeated calls to the distribution will produce a uniform distribution of ints from 1 to 100. Then we proceed to using the distribution repeatedly and printing the results.
Best way is to use another pseudorandom number generator.
Mersenne twister (and Wichmann-Hill) is my recommendation.
http://en.wikipedia.org/wiki/Mersenne_twister
i suggest you see unix_random.c file in mozilla code. ( guess it is mozilla/security/freebl/ ...) it should be in freebl library.
there it uses system call info ( like pwd, netstat ....) to generate noise for the random number;it is written to support most of the platforms (which can gain me bonus point :D ).
The real question you must ask yourself is what randomness quality you need.
libc random is a LCG
The quality of randomness will be low whatever input you provide srand with.
If you simply need to make sure that different instances will have different initializations, you can mix process id (getpid), thread id and a timer. Mix the results with xor. Entropy should be sufficient for most applications.
Example :
struct timeb tp;
ftime(&tp);
srand(static_cast<unsigned int>(getpid()) ^
static_cast<unsigned int>(pthread_self()) ^
static_cast<unsigned int >(tp.millitm));
For better random quality, use /dev/urandom. You can make the above code portable in using boost::thread and boost::date_time.
The c++11 version of the top voted post by Jonathan Wright:
#include <ctime>
#include <random>
#include <thread>
...
const auto time_seed = static_cast<size_t>(std::time(0));
const auto clock_seed = static_cast<size_t>(std::clock());
const size_t pid_seed =
std::hash<std::thread::id>()(std::this_thread::get_id());
std::seed_seq seed_value { time_seed, clock_seed, pid_seed };
...
// E.g seeding an engine with the above seed.
std::mt19937 gen;
gen.seed(seed_value);
#include <stdio.h>
#include <sys/time.h>
main()
{
struct timeval tv;
gettimeofday(&tv,NULL);
printf("%d\n", tv.tv_usec);
return 0;
}
tv.tv_usec is in microseconds. This should be acceptable seed.
As long as your program is only running on Linux (and your program is an ELF executable), you are guaranteed that the kernel provides your process with a unique random seed in the ELF aux vector. The kernel gives you 16 random bytes, different for each process, which you can get with getauxval(AT_RANDOM). To use these for srand, use just an int of them, as such:
#include <sys/auxv.h>
void initrand(void)
{
unsigned int *seed;
seed = (unsigned int *)getauxval(AT_RANDOM);
srand(*seed);
}
It may be possible that this also translates to other ELF-based systems. I'm not sure what aux values are implemented on systems other than Linux.
Suppose you have a function with a signature like:
int foo(char *p);
An excellent source of entropy for a random seed is a hash of the following:
Full result of clock_gettime (seconds and nanoseconds) without throwing away the low bits - they're the most valuable.
The value of p, cast to uintptr_t.
The address of p, cast to uintptr_t.
At least the third, and possibly also the second, derive entropy from the system's ASLR, if available (the initial stack address, and thus current stack address, is somewhat random).
I would also avoid using rand/srand entirely, both for the sake of not touching global state, and so you can have more control over the PRNG that's used. But the above procedure is a good (and fairly portable) way to get some decent entropy without a lot of work, regardless of what PRNG you use.
For those using Visual Studio here's yet another way:
#include "stdafx.h"
#include <time.h>
#include <windows.h>
const __int64 DELTA_EPOCH_IN_MICROSECS= 11644473600000000;
struct timezone2
{
__int32 tz_minuteswest; /* minutes W of Greenwich */
bool tz_dsttime; /* type of dst correction */
};
struct timeval2 {
__int32 tv_sec; /* seconds */
__int32 tv_usec; /* microseconds */
};
int gettimeofday(struct timeval2 *tv/*in*/, struct timezone2 *tz/*in*/)
{
FILETIME ft;
__int64 tmpres = 0;
TIME_ZONE_INFORMATION tz_winapi;
int rez = 0;
ZeroMemory(&ft, sizeof(ft));
ZeroMemory(&tz_winapi, sizeof(tz_winapi));
GetSystemTimeAsFileTime(&ft);
tmpres = ft.dwHighDateTime;
tmpres <<= 32;
tmpres |= ft.dwLowDateTime;
/*converting file time to unix epoch*/
tmpres /= 10; /*convert into microseconds*/
tmpres -= DELTA_EPOCH_IN_MICROSECS;
tv->tv_sec = (__int32)(tmpres * 0.000001);
tv->tv_usec = (tmpres % 1000000);
//_tzset(),don't work properly, so we use GetTimeZoneInformation
rez = GetTimeZoneInformation(&tz_winapi);
tz->tz_dsttime = (rez == 2) ? true : false;
tz->tz_minuteswest = tz_winapi.Bias + ((rez == 2) ? tz_winapi.DaylightBias : 0);
return 0;
}
int main(int argc, char** argv) {
struct timeval2 tv;
struct timezone2 tz;
ZeroMemory(&tv, sizeof(tv));
ZeroMemory(&tz, sizeof(tz));
gettimeofday(&tv, &tz);
unsigned long seed = tv.tv_sec ^ (tv.tv_usec << 12);
srand(seed);
}
Maybe a bit overkill but works well for quick intervals. gettimeofday function found here.
Edit: upon further investigation rand_s might be a good alternative for Visual Studio, it's not just a safe rand(), it's totally different and doesn't use the seed from srand. I had presumed it was almost identical to rand just "safer".
To use rand_s just don't forget to #define _CRT_RAND_S before stdlib.h is included.
Assuming that the randomness of srand() + rand() is enough for your purposes, the trick is in selecting the best seed for srand. time(NULL) is a good starting point, but you'll run into problems if you start more than one instance of the program within the same second. Adding the pid (process id) is an improvement as different instances will get different pids. I would multiply the pid by a factor to spread them more.
But let's say you are using this for some embedded device and you have several in the same network. If they are all powered at once and you are launching the several instances of your program automatically at boot time, they may still get the same time and pid and all the devices will generate the same sequence of "random" numbers. In that case, you may want to add some unique identifier of each device (like the CPU serial number).
The proposed initialization would then be:
srand(time(NULL) + 1000 * getpid() + (uint) getCpuSerialNumber());
In a Linux machine (at least in the Raspberry Pi where I tested this), you can implement the following function to get the CPU Serial Number:
// Gets the CPU Serial Number as a 64 bit unsigned int. Returns 0 if not found.
uint64_t getCpuSerialNumber() {
FILE *f = fopen("/proc/cpuinfo", "r");
if (!f) {
return 0;
}
char line[256];
uint64_t serial = 0;
while (fgets(line, 256, f)) {
if (strncmp(line, "Serial", 6) == 0) {
serial = strtoull(strchr(line, ':') + 2, NULL, 16);
}
}
fclose(f);
return serial;
}
Include the header at the top of your program, and write:
srand(time(NULL));
In your program before you declare your random number. Here is an example of a program that prints a random number between one and ten:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
//Initialize srand
srand(time(NULL));
//Create random number
int n = rand() % 10 + 1;
//Print the number
cout << n << endl; //End the line
//The main function is an int, so it must return a value
return 0;
}

Calculating the delay between write and read on I2C in Linux

I am currently working with I2C in Arch Linux Arm and not quite sure how to calculate the absolute minimum delay there is required between a write and a read. If i don't have this delay the read naturally does not come through. I have just applied usleep(1000) between the two commands, which works, but its just done empirically and has to be optimized to the real value (somehow). But how?.
Here is my code sample for the write_and_read function i am using:
int write_and_read(int handler, char *buffer, const int bytesToWrite, const int bytesToRead) {
write(handler, buffer, bytesToWrite);
usleep(1000);
int r = read(handler, buffer, bytesToRead);
if(r != bytesToRead) {
return -1;
}
return 0;
}
Normally there's no need to wait. If your writing and reading function is threaded somehow in the background (why would you do that???) then synchronizating them is mandatory.
I2C is a very simple linear communication and all the devices used my me was able to produce the output data within microsecs.
Are you using 100kHz, 400kHz or 1MHz I2C?
Edited:
After some discuss I suggest you this to try:
void dataRequest() {
Wire.write(0x76);
x = 0;
}
void dataReceive(int numBytes)
{
x = numBytes;
for (int i = 0; i < numBytes; i++) {
Wire.read();
}
}
Where x is a global variable defined in the header then assigned 0 in the setup(). You may try to add a simple if condition into the main loop, e.g. if x > 0, then send something in serial.print() as a debug message, then reset x to 0.
With this you are not blocking the I2C operation with the serial traffic.

Is this a decent home-made mutex implementation? Criticism? Potential Problems?

I'm wondering if anyone sees anything that would likely cause problems in this code. I know there are other ways/API calls I could used to have done this, but I'm trying to lay the foundation for my own platform independant? / cross-platform mutex framework.
Obviously I need to do some #ifdef's and define some macros for the Win32 Sleep() and GetCurrentThreadID() calls...
typedef struct aec {
unsigned long long lastaudibleframe; /* time stamp of last audible frame */
unsigned short aiws; /* Average mike input when speaker is playing */
unsigned short aiwos; /*Average mike input when speaker ISNT playing */
unsigned long long t_aiws, t_aiwos; /* Internal running total */
unsigned int c_aiws, c_aiwos; /* Internal counters */
unsigned long lockthreadid;
int stlc; /* Same thread lock count */
} AEC;
char lockecho( AEC *ec ) {
unsigned long tid=0;
static int inproc=0;
while (inproc) {
Sleep(1);
}
inproc=1;
if (!ec) {
inproc=0;
return 0;
}
tid=GetCurrentThreadId();
if (ec->lockthreadid==tid) {
inproc=0;
ec->stlc++;
return 1;
}
while (ec->lockthreadid!=0) {
Sleep(1);
}
ec->lockthreadid=tid;
inproc=0;
return 1;
}
char unlockecho( AEC *ec ) {
unsigned long tid=0;
if (!ec)
return 1;
tid=GetCurrentThreadId();
if (tid!=ec->lockthreadid)
return 0;
if (tid==ec->lockthreadid) {
if (ec->stlc>0) {
ec->stlc--;
} else {
ec->lockthreadid=0;
}
}
return 1;
}
No it's not, AFAIK you can't implement a mutex with plain C code without some low-level atomic operations (RMW, Test and Set... etc).. In your particular example, consider what happens if a context switch interrupts the first thread before it gets a chance to set inproc, then the second thread will resume and set it to 1 and now both threads "think" they have exclusive access to the struct.. this is just one of many things that could go wrong with your approach.
Also note that even if a thread gets a chance to set inproc, assignment is not guranteed to be atomic (it could be interrupted in the middle of assigning the variable).
As mux points out, your proposed code is incorrect due to many race conditions. You could solve this using atomic instructions like "Compare and Set", but you'll need to define those separately for each platform anyway. You're better off just defining a high-level "Lock" and "Unlock" interface, and implementing those using whatever the platform provides.

Output text one letter at a time in C

How would I output text one letter at a time like it's typing without using Sleep() for every character?
Sleep is the best option, since it doesn't waste CPU cycles.
The other option is busy waiting, meaning you spin constantly executing NoOps. You can do that with any loop structure that does absolutely nothing. I'm not sure what this is for, but it seems like you might also want to randomize the time you wait between characters to give it a natural feel.
I would have a Tick() method that would loop through the letters and only progress if a random number was smaller than a threshold I set.
some psuedocode may look like
int escapeIndex = 0;
int escapeMax = 1000000;
boolean exportCharacter = false;
int letterIndex = 0;
float someThresh = 0.000001;
String typedText = "somethingOrOther...";
int letterMax = typedText.length();
while (letterIndex < letterMax){
escapeIndex++;
if(random(1.0) < someThresh){
exportCharacter = true;
}
if(escapeIndex > escapeMax) {
exportCharacter = true;
}
if(exportCharacter) {
cout << typedText.charAt(letterIndex);
escapeIndex = 0;
exportCharacter = false;
letterIndex++;
}
}
If I were doing this in a video game lets say to simulate a player typing text into a terminal, this is how I would do it. It's going to be different every time, and it's escape mechanism provides a maximum time limit for the operation.
Sleeping is the best way to do what you're describing, as the alternative, busy waiting, is just going to waste CPU cycles. From the comments, it sounds like you've been trying to manually hard-code every single character you want printed with a sleep call, instead of using loops...
Since there's been no indication that this is homework after ~20 minutes, I thought I'd post this code. It uses usleep from <unistd.h>, which sleeps for X amount of microseconds, if you're using Windows try Sleep().
#include <stdio.h>
#include <unistd.h>
void type_text(char *s, unsigned ms_delay)
{
unsigned usecs = ms_delay * 1000; /* 1000 microseconds per ms */
for (; *s; s++) {
putchar(*s);
fflush(stdout); /* alternatively, do once: setbuf(stdout, NULL); */
usleep(usecs);
}
}
int main(void)
{
type_text("hello world\n", 100);
return 0;
}
Since stdout is buffered, you're going to have to either flush it after printing each character (fflush(stdout)), or set it to not buffer the output at all by running setbuf(stdout, NULL) once.
The above code will print "hello world\n" with a delay of 100ms between each character; extremely basic.

Do we need a lock in a single writer multi-reader system?

In os books they said there must be a lock to protect data from accessed by reader and writer at the same time.
but when I test the simple example in x86 machine,it works well.
I want to know, is the lock here nessesary?
#define _GNU_SOURCE
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
struct doulnum
{
int i;
long int l;
char c;
unsigned int ui;
unsigned long int ul;
unsigned char uc;
};
long int global_array[100] = {0};
void* start_read(void *_notused)
{
int i;
struct doulnum d;
int di;
long int dl;
char dc;
unsigned char duc;
unsigned long dul;
unsigned int dui;
while(1)
{
for(i = 0;i < 100;i ++)
{
dl = global_array[i];
//di = d.i;
//dl = d.l;
//dc = d.c;
//dui = d.ui;
//duc = d.uc;
//dul = d.ul;
if(dl > 5 || dl < 0)
printf("error\n");
/*if(di > 5 || di < 0 || dl > 10 || dl < 5)
{
printf("i l value %d,%ld\n",di,dl);
exit(0);
}
if(dc > 15 || dc < 10 || dui > 20 || dui < 15)
{
printf("c ui value %d,%u\n",dc,dui);
exit(0);
}
if(dul > 25 || dul < 20 || duc > 30 || duc < 25)
{
printf("uc ul value %u,%lu\n",duc,dul);
exit(0);
}*/
}
}
}
int start_write(void)
{
int i;
//struct doulnum dl;
while(1)
{
for(i = 0;i < 100;i ++)
{
//dl.i = random() % 5;
//dl.l = random() % 5 + 5;
//dl.c = random() % 5 + 10;
//dl.ui = random() % 5 + 15;
//dl.ul = random() % 5 + 20;
//dl.uc = random() % 5 + 25;
global_array[i] = random() % 5;
}
}
return 0;
}
int main(int argc,char **argv)
{
int i;
cpu_set_t cpuinfo;
pthread_t pt[3];
//struct doulnum dl;
//dl.i = 2;
//dl.l = 7;
//dl.c = 12;
//dl.ui = 17;
//dl.ul = 22;
//dl.uc = 27;
for(i = 0;i < 100;i ++)
global_array[i] = 2;
for(i = 0;i < 3;i ++)
if(pthread_create(pt + i,NULL,start_read,NULL) < 0)
return -1;
/* for(i = 0;i < 3;i ++)
{
CPU_ZERO(&cpuinfo);
CPU_SET_S(i,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pt[i],sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity %d\n",i);
exit(0);
}
}
CPU_ZERO(&cpuinfo);
CPU_SET_S(3,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pthread_self(),sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity recver\n");
exit(0);
}*/
start_write();
return 0;
}
If you don't synchronise reads and writes, a reader could read while a writer is writing, and read the data in a half-written state if the write operation is not atomic. So yes, synchronisation would be necessary to keep that from happening.
You surely need synchronization here . The simple reason being that there is a distinct possibility that data be in a inconsistent state when start_write is updating the information in the global array and one of your 3 threads try to read the same data from the global array .
What you quote is also incorrect . " must be a lock to protect data from accessed by reader and writer at the same time" should be "must be a lock to protect data from modified by reader and writer at the same time"
if the shared data is being modified by one of the threads and another thread is reading from it you need to use lock to protect it .
if the shared data is being accessed by two or more threads then you dont need to protect it .
It will work fine if the threads are just reading from global_array. printf should be fine since this does a single IO operation in append mode.
However, since the main thread calls start_write to update the global_array at the same time the other threads are in start_read then they are going to be reading the values in a very unpredictable manner. It depends highly on how the threads are implemented in the OS, how many CPUs/cores you have, etc.. This might work well on your dual core development box but then fail spectactuarly when you move to a 16 core production server.
For example, if the threads were not synchronizing, they might never see any updates to global_array in the right circumstances. Or some threads would see changes faster than others. It's all about the timing of when memory pages are flushed to central memory and when the threads see the changes in their caches. To ensure consistent results you need synchronization (memory barriers) to force the caches to the updated.
The general answer is you need some way to ensure/enforce necessary atomicity, so the reader doesn't see an inconsistent state.
A lock (done correctly) is sufficient but not always necessary. But in order to prove that it's not necessary, you need to be able to say something about the atomicity of the operations involved.
This involves both the architecture of the target host and, to some extent, the compiler.
In your example, you're writing a long to an array. In this case, the question is is the storage of a long atomic? It probably is, but it depends on the host. It's possible that the CPU writes out a portion of the long (upper/lower words/bytes) separately and thus the reader could get a value never written. (This is, I believe, unlikely on most modern CPU archs, but you'd have to check to be sure.)
It's also possible for there to be write buffering in the CPU. It's been a long time since I looked at this, but I believe it's possible to get store reordering if you don't have the necessary write barrier instructions. It's unclear from your example if you would be relying on this.
Finally, you'd probably need to flag the array as volatile (again, I haven't done this in a while so I'm rusty on the specifics) in order to ensure that the compiler doesn't make assumptions about the data not changing underneath it.
It depends on how much you care about portability.
At least on an actual Intel x86 processor, when you're reading/writing dword (32-bit) data that's also dword aligned, the hardware gives you atomicity "for free" -- i.e., without your having to do any sort of lock to enforce it.
Changing much of anything (up to an including compiler flags that might affect the data'a alignment) can break that -- but in ways that might remain hidden for a long time (especially if you have low contention over a particular data item). It also leads to extremely fragile code -- for example, switching to a smaller data type can break the code, even if you're only using a subset of the values.
The current atomic "guarantee" is pretty much an accidental side-effect of the way the cache and bus happen to be designed. While I'm not sure I'd really expect a change that broke things, I wouldn't consider it particularly far-fetched either. The only place I've seen documentation of this atomic behavior was in the same processor manuals that cover things like model-specific registers that definitely have changed (and continue to change) from one model of processor to the next.
The bottom line is that you really should do the locking, but you probably won't see a manifestation of the problem with your current hardware, no matter how much you test (unless you change conditions like mis-aligning the data).

Resources