Why udelay and ndelay is not accurate in linux kernel? - timer

I make a function like this
trace_printk("111111");
udelay(4000);
trace_printk("222222");
and the log shows it's 4.01 ms , it'OK
But when i call like this
trace_printk("111111");
ndelay(10000);
ndelay(10000);
ndelay(10000);
ndelay(10000);
....
....//totally 400 ndelay calls
trace_printk("222222");
the log will shows 4.7 ms. It's not acceptable.
Why the error of ndelay is so huge like this?
Look deep in the kernel code i found the implemention of this two functions
void __udelay(unsigned long usecs)
{
__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
}
void __ndelay(unsigned long nsecs)
{
__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
}
I thought udelay will be 1000 times than ndelay, but it's not, why?

As you've already noticed, the nanosecond delay implementation is quite a coarse approximation compared to the millisecond delay, because of the 0x5 constant factor used. 0x10c7 / 0x5 is approximately 859. Using 0x4 would be closer to 1000 (approximately 1073).
However, using 0x4 would cause the ndelay to be less than the number of nanoseconds requested. In general, delay functions aim to provide a delay at least as long as requested by the user (see here: http://practicepeople.blogspot.jp/2013/08/kernel-programming-busy-waiting-delay.html).

Every time you call it, a rounding error is added. Note the comment 2**32 / 1000000000. That value is really ~4.29, but it was rounded up to 5. That's a pretty hefty error.
By contrast the udelay error is small: (~4294.97 versus 4295 [0x10c7]).

You can use ktime_get_ns() to get high precision time since boot. So you can use it not only as high precision delay but also as high precision timer. There is example:
u64 t;
t = ktime_get_ns(); // Get current nanoseconds since boot
for (i = 0; i < 24; i++) // Send 24 1200ns-1300ns pulses via GPIO
{
gpio_set_value(pin, 1); // Drive GPIO or do something else
t += 1200; // Now we have absolute time of the next step
while (ktime_get_ns() < t); // Wait for it
gpio_set_value(pin, 0); // Do something, again
t += 1300; // Now we have time of the next step, again
while (ktime_get_ns() < t); // Wait for it, again
}

Related

Creating a Program that slowly Increases the brightness of an LED as a start-up

I am wanting to create a program using a for-loop that slowly increases the brightness of an LED as "start-up" when I press a button.
I have basically no knowledge of for loops. I've tried messing around by looking at similar programs and potential solutions, but I was unable to do it.
This is my start code, which I have to use PWMperiod to achieve.
if (SW3 == 0) {
for (unsigned char PWMperiod = 255; PWMperiod != 0; PWMperiod --) {
if (TonLED4 == PWMperiod) {
TonLED4 += 1;
}
__delay_us (20);
}
}
How would I start this/do it?
For pulse width modulation, you'd want to turn the LED off for a certain amount of time, then turn the LED on for a certain amount of time; where the amounts of time depend on how bright you want the LED to appear and the total period ("on time + off time") is constant.
In other words, you want a relationship like period = on_time + off_time where period is constant.
You also want the LED to increase brightness slowly. E.g. maybe go from off to max. brightness over 10 seconds. This means you'll need to loop total_time / period times.
How bright the LED should be, and therefore how long the on_time should be, will depend on how much time has passed since the start of the 10 seconds (e.g. 0 microseconds at the start of the 10 seconds and period microseconds at the end of the 10 seconds). Once you know the on_time you can calculate off_time by rearranging that "period = on_time + off_time" formula.
In C it might end up something like:
#define TOTAL_TIME 10000000 // 10 seconds, in microseconds
#define PERIOD 1000 // 1 millisecond, in microseconds
#define LOOP_COUNT (TOTAL_TIME / PERIOD)
int on_time;
int off_time;
for(int t = 0; t < LOOP_COUNT; t++) {
on_time = period * t / LOOP_COUNT;
off_time = period - on_time;
turn_LED_off();
__delay_us(off_time);
turn_LED_on();
__delay_us(on_time);
}
Note: on_time = period * t / LOOP_COUNT; is a little tricky. You can think it as on_time = period * (t / LOOP_COUNT); where t / LOOP_COUNT is a fraction that goes from 0.00000 to 0.999999 representing the fraction of the period that the LED should be turned on, but if you wrote it like that the compiler will truncate the result of t / LOOP_COUNT to an integer (round it towards zero) so the result will be zero. When written like this; C will do the multiplication first, so it'll behave like on_time = (period * t) / LOOP_COUNT; and truncation (or rounding) won't be a problem. Sadly, doing the multiplication first solves one problem while possibly causing another problem - period * t might be too big for an int and might cause an overflow (especially on small embedded systems where an int could be 16 bits). You'll have to figure out how big an int is for your computer (for the values you use - changing TOTAL_TIME or PERIOD with change the maximum value that period * t could be) and use something larger (e.g. a long maybe) if an int isn't enough.
You should also be aware that the timing won't be exact, because it ignores time spent executing your code and ignores anything else the OS might be doing (IRQs, other programs using the CPU); so the "10 seconds" might actually be 10.5 seconds (or worse). To fix that you need something more complex than a __delay_us() function (e.g. some kind of __delay_until(absolute_time) maybe).
Also; you might find that the LED doesn't increase brightness linearly (e.g. it might slowly go from off to dull in 8 seconds then go from dull to max. brightness in 2 seconds). If that happens; you might need a lookup table and/or more complex maths to correct it.

Understanding the calculation of the time (jiffies)

I want to sleep in the kernel for a specific amount of time, and I use time_before and jiffies to calculate the amount of time I should sleep, however I don't understand how the calculation actually works. I know that HZ is 250 and jiffies is a huge dynamic value. I know what them both are and what they are used for.
I calculate the time with jiffies + (10 * HZ).
static unsigned long j1;
static int __init sys_module_init(void)
{
j1 = jiffies + (10 * HZ);
while (time_before(jiffies, j1))
schedule();
printk("Hello World - %d - %ld\n", HZ, jiffies); // Hello World - 250 - 4296485594 (dynamic)
return 0;
}
How does the calculation work and how many seconds will I sleep? I want to know that because in the future I'll probably want to sleep for a specific time.
HZ represents the amount of ticks in a second, and multiplying that by 10 gives the amount of ticks in 10 seconds. So the calculation jiffies + 10 * HZ yields the expected value of jiffies 10 seconds from now.
However, calling schedule() in a loop until that value is hit is not the recommended way to go. If you want to sleep in the kernel, you don't need to reinvent the wheel, there already is a set of APIs just for this purpose, documented here, which will make your life a lot easier. The simplest way to sleep in your specific case would be to use msleep(), passing the number of milliseconds you want to sleep for:
#include <linux/delay.h>
static int __init sys_module_init(void)
{
msleep(10 * 1000);
return 0;
}

How to implement a timer for every second with zero nanosecond with liburing?

I noticed the io_uring kernel side uses CLOCK_MONOTONIC at CLOCK_MONOTONIC, so for the first timer, I get the time with both CLOCK_REALTIME and CLOCK_MONOTONIC and adjust the nanosecond like below and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout. iorn/clock.c at master · hnakamur/iorn
const long sec_in_nsec = 1000000000;
static int queue_timeout(iorn_queue_t *queue) {
iorn_timeout_op_t *op = calloc(1, sizeof(*op));
if (op == NULL) {
return -ENOMEM;
}
struct timespec rts;
int ret = clock_gettime(CLOCK_REALTIME, &rts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_REALTIME error: %s\n", strerror(errno));
return -errno;
}
long nsec_diff = sec_in_nsec - rts.tv_nsec;
ret = clock_gettime(CLOCK_MONOTONIC, &op->ts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_MONOTONIC error: %s\n", strerror(errno));
return -errno;
}
op->handler = on_timeout;
op->ts.tv_sec++;
op->ts.tv_nsec += nsec_diff;
if (op->ts.tv_nsec > sec_in_nsec) {
op->ts.tv_sec++;
op->ts.tv_nsec -= sec_in_nsec;
}
op->count = 1;
op->flags = IORING_TIMEOUT_ABS;
ret = iorn_prep_timeout(queue, op);
if (ret < 0) {
return ret;
}
return iorn_submit(queue);
}
From the second time, I just increment the second part tv_sec and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout.
Here is the output from my example program. The millisecond part is zero but it is about 400 microsecond later than just second.
on_timeout time=2020-05-10T14:49:42.000442
on_timeout time=2020-05-10T14:49:43.000371
on_timeout time=2020-05-10T14:49:44.000368
on_timeout time=2020-05-10T14:49:45.000372
on_timeout time=2020-05-10T14:49:46.000372
on_timeout time=2020-05-10T14:49:47.000373
on_timeout time=2020-05-10T14:49:48.000373
Could you tell me a better way than this?
Thanks for your comments! I'd like to update the current time for logging like ngx_time_update(). I modified my example to use just CLOCK_REALTIME, but still about 400 microseconds late. github.com/hnakamur/iorn/commit/… Does it mean clock_gettime takes about 400 nanoseconds on my machine?
Yes, that sounds about right, sort of. But, if you're on an x86 PC under linux, 400 ns for clock_gettime overhead may be a bit high (order of magnitude higher--see below). If you're on an arm CPU (e.g. Raspberry Pi, nvidia Jetson), it might be okay.
I don't know how you're getting 400 microseconds. But, I've had to do a lot of realtime stuff under linux, and 400 us is similar to what I've measured as the overhead to do a context switch and/or wakeup a process/thread after a syscall suspends it.
I never use gettimeofday anymore. I now just use clock_gettime(CLOCK_REALTIME,...) because it's the same except you get nanoseconds instead of microseconds.
Just so you know, although clock_gettime is a syscall, nowadays, on most systems, it uses the VDSO layer. The kernel injects special code into the userspace app, so that it is able to access the time directly without the overhead of a syscall.
If you're interested, you could run under gdb and disassemble the code to see that it just accesses some special memory locations instead of doing a syscall.
I don't think you need to worry about this too much. Just use clock_gettime(CLOCK_MONOTONIC,...) and set flags to 0. The overhead doesn't factor into this, for the purposes of the ioring call as your iorn layer is using it.
When I do this sort of thing, and I want/need to calculate the overhead of clock_gettime itself, I call clock_gettime in a loop (e.g. 1000 times), and try to keep the total time below a [possible] timeslice. I use the minimum diff between times in each iteration. That compensates for any [possible] timeslicing.
The minimum is the overhead of the call itself [on average].
There are additional tricks that you can do to minimize latency in userspace (e.g. raising process priority, clamping CPU affinity and I/O interrupt affinity), but they can involve a few more things, and, if you're not very careful, they can produce worse results.
Before you start taking extraordinary measures, you should have a solid methodology to measure timing/benchmarking to prove that your results can not meet your timing/throughput/latency requirements. Otherwise, you're doing complicated things for no real/measurable/necessary benefit.
Below is some code I just created, simplified, but based on code I already have/use to calibrate the overhead:
#include <stdio.h>
#include <time.h>
#define ITERMAX 10000
typedef long long tsc_t;
// tscget -- get time in nanoseconds
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
// tscsec -- convert nanoseconds to fractional seconds
double
tscsec(tsc_t tsc)
{
double sec;
sec = tsc;
sec /= 1e9;
return sec;
}
tsc_t
calibrate(void)
{
tsc_t tscbeg;
tsc_t tscold;
tsc_t tscnow;
tsc_t tscdif;
tsc_t tscmin;
int iter;
tscmin = 1LL << 62;
tscbeg = tscget();
tscold = tscbeg;
for (iter = ITERMAX; iter > 0; --iter) {
tscnow = tscget();
tscdif = tscnow - tscold;
if (tscdif < tscmin)
tscmin = tscdif;
tscold = tscnow;
}
tscdif = tscnow - tscbeg;
printf("MIN:%.9f TOT:%.9f AVG:%.9f\n",
tscsec(tscmin),tscsec(tscdif),tscsec(tscnow - tscbeg) / ITERMAX);
return tscmin;
}
int
main(void)
{
calibrate();
return 0;
}
On my system, a 2.67GHz Core i7, the output is:
MIN:0.000000019 TOT:0.000254999 AVG:0.000000025
So, I'm getting 25 ns overhead [and not 400 ns]. But, again, each system can be different to some extent.
UPDATE:
Note that x86 processors have "speed step". The OS can adjust the CPU frequency up or down semi-automatically. Lower speeds conserve power. Higher speeds are maximum performance.
This is done with a heuristic (e.g. if the OS detects that the process is a heavy CPU user, it will up the speed).
To force maximum speed, linux has this directory:
/sys/devices/system/cpu/cpuN/cpufreq
Where N is the cpu number (e.g. 0-7)
Under this directory, there are a number of files of interest. They should be self explanatory.
In particular, look at scaling_governor. It has either ondemand [kernel will adjust as needed] or performance [kernel will force maximum CPU speed].
To force maximum speed, as root, set this [once] to performance (e.g.):
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Do this for all cpus.
However, I just did this on my system, and it had little effect. So, the kernel's heuristic may have improved.
As to the 400us, when a process has been waiting on something, when it is "woken up", this is a two step process.
The process is marked "runnable".
At some point, the system/CPU does a reschedule. The process will be run, based upon the scheduling policy and the process priority in effect.
For many syscalls, the reschedule [only] occurs on the next system timer/clock tick/interrupt. So, for some, there can be a delay of up to a full clock tick (i.e.) for HZ value of 1000, this can be up to 1ms (1000 us) later.
On average, this is one half of HZ or 500 us.
For some syscalls, when the process is marked runnable, a reschedule is done immediately. If the process has a higher priority, it will be run immediately.
When I first looked at this [circa 2004], I looked at all code paths in the kernel, and the only syscall that did the immediate reschedule was SysV IPC, for msgsnd/msgrcv. That is, when process A did msgsnd, any process B waiting for the given message would be run.
But, others did not (e.g. futex). They would wait for the timer tick. A lot has changed since then, and now, more syscalls will do the immediate reschedule. For example, I recently measured futex [invoked via pthread_mutex_*], and it seemed to do the quick reschedule.
Also, the kernel scheduler has changed. The newer scheduler can wakeup/run some things on a fraction of a clock tick.
So, for you, the 400 us, is [possibly] the alignment to the next clock tick.
But, it could just be the overhead of doing the syscall. To test that, I modified my test program to open /dev/null [and/or /dev/zero], and added read(fd,buf,1) to the test loop.
I got a MIN: value of 529 us. So, the delay you're getting could just be the amount of time it takes to do the task switch.
This is what I would call "good enough for now".
To get "razor's edge" response, you'd probably have to write a custom kernel driver and have the driver do this. This is what embedded systems would do if (e.g.) they had to toggle a GPIO pin on every interval.
But, if all you're doing is printf, the overhead of printf and the underlying write(1,...) tends to swamp the actual delay.
Also, note that when you do printf, it builds the output buffer and when the buffer in FILE *stdout is full, it flushes via write.
For best performance, it's better to do int len = sprintf(buf,"current time is ..."); write(1,buf,len);
Also, when you do this, if the kernel buffers for TTY I/O get filled [which is quite possible given the high frequency of messages you're doing], the process will be suspended until the I/O has been sent to the TTY device.
To do this well, you'd have to watch how much space is available, and skip some messages if there isn't enough space to [wholy] contain them.
You'd need to do: ioctl(1,TIOCOUTQ,...) to get the available space and skip some messages if it is less than the size of the message you want to output (e.g. the len value above).
For your usage, you're probably more interested in the latest time message, rather than outputting all messages [which would eventually produce a lag]

RMS calculation DC offset

I need to implement an RMS calculations of sine wave in MCU (microcontroller, resource constrained). MCU lacks FPU (floating point unit), so I would prefer to stay in integer realm. Captures are discrete via 10 bit ADC.
Looking for a solution, I've found this great solution here by Edgar Bonet: https://stackoverflow.com/a/28812301/8264292
Seems like it completely suits my needs. But I have some questions.
Input are mains 230 VAC, 50 Hz. It's transformed & offset by hardware means to become 0-1V (peak to peak) sine wave which I can capture with ADC getting 0-1023 readings. Hardware are calibrated so that 260 VRMS (i.e. about -368:+368 peak to peak) input becomes 0-1V peak output. How can I "restore" back original wave RMS value providing I want to stay in integer realm too? Units can vary, mV will do fine also.
My first guess was subtracting 512 from the input sample (DC offset) and later doing this "magic" shift as in Edgar Bonet answer. But I've realized it's wrong because DC offset aren't fixed. Instead it's biased to start from 0V. I.e. 130 VAC input would produce 0-500 mV peak to peak output (not 250-750 mV which would've worked so far).
With real RMS to subtract the DC offset I need to subtract squared sum of samples from the sum of squares. Like in this formula:
So I've ended up with following function:
#define INITIAL 512
#define SAMPLES 1024
#define MAX_V 368UL // Maximum input peak in V ( 260*sqrt(2) )
/* K is defined based on equation, where 64 = 2^6,
* i.e. 6 bits to add to 10-bit ADC to make it 16-bit
* and double it for whole range in -peak to +peak
*/
#define K (MAX_V*64*2)
uint16_t rms_filter(uint16_t sample)
{
static int16_t rms = INITIAL;
static uint32_t sum_squares = 1UL * SAMPLES * INITIAL * INITIAL;
static uint32_t sum = 1UL * SAMPLES * INITIAL;
sum_squares -= sum_squares / SAMPLES;
sum_squares += (uint32_t) sample * sample;
sum -= sum / SAMPLES;
sum += sample;
if (rms == 0) rms = 1; /* do not divide by zero */
rms = (rms + (((sum_squares / SAMPLES) - (sum/SAMPLES)*(sum/SAMPLES)) / rms)) / 2;
return rms;
}
...
// Somewhere in a loop
getSample(&sample);
rms = rms_filter(sample);
...
// After getting at least N samples (SAMPLES * X?)
uint16_t vrms = (uint32_t)(rms*K) >> 16;
printf("Converted Vrms = %d V\r\n", vrms);
Does it looks fine? Or am I doing something wrong like this?
How does SAMPLES (window size?) number relates to F (50Hz) and my ADC capture rate (samples per second)? I.e. how much real samples do I need to feed to rms_filter() before I can get real RMS value providing my capture speed are X sps? At least how to evaluate required minimum N of samples?
I did not test your code, but it looks to me like it should work fine.
Personally, I would not have implemented the function this way. I would
instead have removed the DC part of the signal before trying to
compute the RMS value. The DC part can be estimated by sending the raw
signal through a low pass filter. In pseudo-code this would be
rms = sqrt(low_pass(square(x - low_pass(x))))
whereas what you wrote is basically
rms = sqrt(low_pass(square(x)) - square(low_pass(x)))
It shouldn't really make much of a difference though. The first formula,
however, spares you a multiplication. Also, by removing the DC component
before computing the square, you end up multiplying smaller numbers,
which may help in allocating bits for the fixed-point implementation.
In any case, I recommend you test the filter on your computer with
synthetic data before committing it to the MCU.
How does SAMPLES (window size?) number relates to F (50Hz) and my ADC
capture rate (samples per second)?
The constant SAMPLES controls the cut-off frequency of the low pass
filters. This cut-off should be small enough to almost completely remove
the 50 Hz part of the signal. On the other hand, if the mains
supply is not completely stable, the quantity you are measuring will
slowly vary with time, and you may want your cut-off to be high enough
to capture those variations.
The transfer function of these single-pole low-pass filters is
H(z) = z / (SAMPLES * z + 1 − SAMPLES)
where
z = exp(i 2 π f / f₀),
i is the imaginary unit,
f is the signal frequency and
f₀ is the sampling frequency
If f₀ ≫ f (which is desirable for a good sampling), you can approximate
this by the analog filter:
H(s) = 1/(1 + SAMPLES * s / f₀)
where s = i2πf and the cut-off frequency is f₀/(2π*SAMPLES). The gain
at f = 50 Hz is then
1/sqrt(1 + (2π * SAMPLES * f/f₀)²)
The relevant parameter here is (SAMPLES * f/f₀), which is the number of
periods of the 50 Hz signal that fit inside your sampling window.
If you fit one period, you are letting about 15% of the signal through
the filter. Half as much if you fit two periods, etc.
You could get perfect rejection of the 50 Hz signal if you design a
filter with a notch at that particular frequency. If you don't want
to dig into digital filter design theory, the simplest such filter may
be a simple moving average that averages over a period of exactly
20 ms. This has a non trivial cost in RAM though, as you have to
keep a full 20 ms worth of samples in a circular buffer.

Division in C returns 0... sometimes

Im trying to make a simple RPM meter using an ATMega328.
I have an encoder on the motor which has 306 interrupts per rotation (as the motor encoder has 3 spokes which interrupt on rising and falling edge, the motor is geared 51:1 and so 6 transitions * 51 = 306 interrupts per wheel rotation ) , and I am using a timer interrupting every 1ms, however in the interrupt it set to recalculate every 1 second.
There seems to be 2 problems.
1) RPM never goes below 60, instead its either 0 or RPM >= 60
2) Reducing the time interval causes it to always be 0 (as far as I can tell)
Here is the code
int main(void){
while(1){
int temprpm = leftRPM;
printf("Revs: %d \n",temprpm);
_delay_ms(50);
};
return 0;
}
ISR (INT0_vect){
ticksM1++;
}
ISR(TIMER0_COMPA_vect){
counter++;
if(counter == 1000){
int tempticks = ticksM1;
leftRPM = ((tempticks - lastM1)/306)*1*60;
lastM1 = tempticks;
counter = 0;
}
}
Anything that is not declared in that code is declared globally and as an int, ticksM1 is also volatile.
The macros are AVR macros for the interrupts.
The purpose of the multiplying by 1 for leftRPM represents time, ideally I want to use 1ms without the if statement so the 1 would then be 1000
For a speed between 60 and 120 RPM the result of ((tempticks - lastM1)/306) will be 1 and below 60 RPM it will be zero. Your output will always be a multiple of 60
The first improvement I would suggest is not to perform expensive arithmetic in the ISR. It is unnecessary - store the speed in raw counts-per-second, and convert to RPM only for display.
Second, perform the multiply before the divide to avoid unnecessarily discarding information. Then for example at 60RPM (306CPS) you have (306 * 60) / 306 == 60. Even as low as 1RPM you get (6 * 60) / 306 == 1. In fact it gives you a potential resolution of approximately 0.2RPM as opposed to 60RPM! To allow the parameters to be easily maintained; I recommend using symbolic constants rather than magic numbers.
#define ENCODER_COUNTS_PER_REV 306
#define MILLISEC_PER_SAMPLE 1000
#define SAMPLES_PER_MINUTE ((60 * 1000) / MILLISEC_PER_SAMPLE)
ISR(TIMER0_COMPA_vect){
counter++;
if(counter == MILLISEC_PER_SAMPLE)
{
int tempticks = ticksM1;
leftCPS = tempticks - lastM1 ;
lastM1 = tempticks;
counter = 0;
}
}
Then in main():
int temprpm = (leftCPS * SAMPLES_PER_MINUTE) / ENCODER_COUNTS_PER_REV ;
If you want better that 1RPM resolution you might consider
int temprpm_x10 = (leftCPS * SAMPLES_PER_MINUTE) / (ENCODER_COUNTS_PER_REV / 10) ;
then displaying:
printf( "%d.%d", temprpm / 10, temprpm % 10 ) ;
Given the potential resolution of 0.2 rpm by this method, higher resolution display is unnecessary, though you could use a moving-average to improve resolution at the expense of some "display-lag".
Alternatively now that the calculation of RPM is no longer in the ISR you might afford a floating point operation:
float temprpm = ((float)leftCPS * (float)SAMPLES_PER_MINUTE ) / (float)ENCODER_COUNTS_PER_REV ;
printf( "%f", temprpm ) ;
Another potential issue is that ticksM1++ and tempticks = ticksM1, and the reading of leftRPM (or leftCPS in my solution) are not atomic operations, and can result in an incorrect value being read if interrupt nesting is supported (and even if it is not in the case of the access from outside the interrupt context). If the maximum rate will be less that 256 cps (42RPM) then you might get away with an atomic 8 bit counter; you cal alternatively reduce your sampling period to ensure the count is always less that 256. Failing that the simplest solution is to disable interrupts while reading or updating non-atomic variables shared across interrupt and thread contexts.
It's integer division. You would probably get better results with something like this:
leftRPM = ((tempticks - lastM1)/6);

Resources