The meaning of period in ALSA

The meaning of period in ALSA - c

I'm using ALSA for and audio application on Linux, I found great docs explain how to use it : 1 and this one. although I have some issues to understand this part of the setup :
/* Set number of periods. Periods used to be called fragments. */
if (snd_pcm_hw_params_set_periods(pcm_handle, hwparams, periods, 0) < 0) {
fprintf(stderr, "Error setting periods.\n");
return(-1);
}
what does mean set a number of period when I'm using the PLAYBACK mode
and :
/* Set buffer size (in frames). The resulting latency is given by */
/* latency = periodsize * periods / (rate * bytes_per_frame) */
if (snd_pcm_hw_params_set_buffer_size(pcm_handle, hwparams, (periodsize * periods)>>2) < 0) {
fprintf(stderr, "Error setting buffersize.\n");
return(-1);
}
and the same question here about the latency , how should I understand it?

I assume you've read and understood this section of linux-journal. You may also find that this blog clarify things with respect to period size selection (or fragment in the blog) in the context of ALSA. To quote:
You shouldn't misuse the fragments logic of sound devices. It's like
this:
The latency is defined by the buffer size.
The wakeup interval is defined by the fragment size.
The buffer fill level will oscillate between 'full buffer' and 'full
buffer minus 1x fragment size minus OS scheduling latency'. Setting
smaller fragment sizes will increase the CPU load and decrease battery
time since you force the CPU to wake up more often. OTOH it increases
drop out safety, since you fill up playback buffer earlier. Choosing
the fragment size is hence something which you should do balancing out
your needs between power consumption and drop-out safety. With modern
processors and a good OS scheduler like the Linux one setting the
fragment size to anything other than half the buffer size does not
make much sense.
...
(Oh, ALSA uses the term 'period' for what I call 'fragment'
above. It's synonymous)
So essentially, typically you would set period to 2 (as was done in the howto you referenced). Then periodsize * period is your total buffer size in bytes. Finally, the latency is the delay that is induced by the buffering of that many samples, and can be computed by dividing the buffer size by the rate at which samples are played back (ie. according to the formula latency = periodsize * periods / (rate * bytes_per_frame) in the code comments).
For example, the parameters from the howto:
period = 2
periodsize = 8192 bytes
rate = 44100Hz
16 bits stereo data (4 bytes per frame)
correspond to a total buffer size of period * periodsize = 2 * 8192 = 16384 bytes, and a latency of 16384 / (44100 * 4) ~ 0.093` seconds.
Note also that your hardware may have some size limitations for the supported period size (see this trouble shooting guide)

When the application tries to write samples into the buffer, an if the buffer is already full, the process goes to sleep. It gets woken up by the hardware through an interrupt; this interrupt is raised at the end of each period.
There should be at least two periods per buffer; otherwise, the buffer is already empty when a wakeup happens, which result in an underrun.
Increasing the number of periods (i.e., reducing the period size) increases the safety margin against underruns caused by scheduling or processing delays.
The latency is just proportional to the buffer size: when you completely fill the buffer, the last sample written is played by the hardware only after all the other samples have been played.

Related

Getting more precise timing control in Linux

I am trying to create a low-jitter multicast source for digital TV. The program in question should buffer the input, calculate the intended times from the PCR values in the stream and then send the packets at relatively precise intervals. However, this is not running on a RTOS, so some timing variance is expected.
This is the basic code (the relevant variables are initialized, I just omitted the code here):
while (!sendstop) {
//snip
//put 7 MPEG packets in one UDP packet buffer "outpkt"
//snip
waittime = //calculate from PCR values - value is in microseconds
//waittime is in the order of 2000 -> 2ms
sleeptime=curtime;
sleeptime.tv_nsec += waittime * 1000L;
sleeptime.tv_sec += sleeptime.tv_nsec / 1000000000;
sleeptime.tv_nsec %= 1000000000;
while (clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &sleeptime, NULL) && errno == EINTR) {
printf("I");
}
sendto(sck,outpkt,1316,0,res->ai_addr,res->ai_addrlen); //send the packet
clock_gettime(CLOCK_MONOTONIC,&curtime);
}
However, this results in the sending being too slow (since there is some processing that also takes time), so the buffer fills up. So, I thought that I should get the difference between "sleeptime" (the time that should have been) and "curtime" (the actual time) and the subtract it from the future "waittime". This almost works, but now is a bit too fast and now I get an empty buffer.
My next idea was to multiply the difference by some value before subtracting it, like this (just above "while..."):
difn=curtime.tv_nsec-ostime.tv_nsec;
if (difn<0) difn+=1000000000;
sleeptime.tv_nsec = sleeptime.tv_nsec-(difn*difnc)/1000; //difnc - adjustment
if (sleeptime.tv_nsec<0) {
sleeptime.tv_nsec+=1000000000;
sleeptime.tv_sec--;
}
However, different values of difnc work at different times of day, servers and so on. There needs to be some kind of automatic adjustment based on the operation of the program. The best I could figure out was to increment/decrement it every time the buffer is full or empty, however, this leads to slow cycles of "too fast" - "too slow". I tried to adjust the "difnc" value based on how full/empty the buffer is but that too just leads of "slow"-"fast" cycles.
How can I properly automatically derive the "difnc" value or is there some other method of getting a more precise timing than with just the "clock_nanosleep" function but without busy waits (the server has other things to do)?

Sum reduction with parallel algorithm - Bad performances compared to CPU version

I have achieved a small code for doing sum reduction of a 1D array. I am comparing a CPU sequential version and a OpenCL version.
The code is available on this link1
The kernel code is available on this link2
and if you want to compile : link3 for Makefile
My issue is about the bad performances of GPU version :
for size of vector lower than 1,024 * 10^9 elements (i.e with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements) the runtime for GPU version is higher (slightly higher but higher) than CPU one.
As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup.
Concerning the number of workgroups, I have taken :
// Number of work-groups
int nWorkGroups = size/local_item_size;
But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups, isn't this too much ?? ).
I tried to modify loca_item_size in the range of [64, 128, 256, 512, 1024] but the performances remain bad for all these values.
I have good benefits only for size = 1.024 * 10^9 elements, here are the runtimes :
Size of the vector
1024000000
Problem size = 1024000000
GPU Parallel Reduction : Wall Clock = 20 second 977511 micro
Final Sum Sequential = 5.2428800006710899200e+17
Sequential Reduction : Wall Clock = 337 second 459777 micro
From your experiences, why do I get bad performances ? I though that advantages should be more significative compared to CPU version.
Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue.
Thanks

Well I can tell you some reasons:
You don't need to write the reduction buffer. You can directly clear it in GPU memory using clEnqueueFillBuffer() or a helper kernel.
ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0,
local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);
Dont use blocking calls, except for the last read. Otherwise you are wasting some time there.
You are doing the last reduction in CPU. Iterative processing trough the kernel can help.
Because if your kernel is just reducing 128 elements per pass. Your 10^9 number just gets down to 8*10^6. And the CPU does the rest. If you add there the data copy, it makes it completely non worth.
However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values. So, the only bottleneck would be the first GPU copy and the kernel launch.

C - Store global variables in flash?

As the title may suggest, I'm currently short on SRAM in my program and I can't find a way to reduce my global variables. Is it possible to bring global variables over to flash memory? Since these variables are frequently read and written, would it be bad for the nand flash because they have limited number of read/write cycle?
If the flash cannot handle this, would EEPROM be a good alternative?
EDIT:
Sorry for the ambiguity guys. I'm working with Atmel AVR ATmega32HVB which has:
2K bytes of SRAM,
1K bytes of EEPROM
32K bytes of FLASH
Compiler: AVR C/C++
Platform: IAR Embedded AVR
The global variables that I want to get rid of are:
uint32_t capacityInCCAccumulated[TOTAL_CELL];
and
int32_t AccumulatedCCADCvalue[TOTAL_CELL];
Code snippets:
int32_t AccumulatedCCADCvalue[TOTAL_CELL];
void CCGASG_AccumulateCCADCMeasurements(int32_t ccadcMeasurement, uint16_t slowRCperiod)
{
uint8_t cellIndex;
// Sampling period dependant on configuration of CCADC sampling..
int32_t temp = ccadcMeasurement * (int32_t)slowRCperiod;
bool polChange = false;
if(temp < 0) {
temp = -temp;
polChange = true;
}
// Add 0.5*divisor to get proper rounding
temp += (1<<(CCGASG_ACC_SCALING-1));
temp >>= CCGASG_ACC_SCALING;
if(polChange) {
temp = -temp;
}
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
AccumulatedCCADCvalue[cellIndex] += temp;
}
// If it was a charge, update the charge cycle counter
if(ccadcMeasurement <= 0) {
// If it was a discharge, AccumulatedCADCvalue can be negative, and that
// is "impossible", so set it to zero
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
if(AccumulatedCCADCvalue[cellIndex] < 0)
{
AccumulatedCCADCvalue[cellIndex] = 0;
}
}
}
}
And this
uint32_t capacityInCCAccumulated[TOTAL_CELL];
void BATTPARAM_InitSramParameters() {
uint8_t cellIndex;
// Active current threshold in ticks
battParams_sram.activeCurrentThresholdInTicks = (uint16_t) BATTCUR_mA2Ticks(battParams.activeCurrentThreshold);
for (cellIndex = 0; cellIndex < TOTAL_CELL; cellIndex++)
{
// Full charge capacity in CC accumulated
battParams_sram.capacityInCCAccumulated[cellIndex] = (uint32_t) CCGASG_mAh2Acc(battParams.fullChargeCapacity);
}
// Terminate discharge limit in CC accumulated
battParams_sram.terminateDischargeLimit = CCGASG_mAh2Acc(battParams.terminateDischargeLimit);
// Values for remaining capacity calibration
GASG_CalculateRemainingCapacityValues();
}

would it be bad for the nand flash because they have limited number of
read/write cycle?
Yes it's not a good idea to use flash for frequent modification of data.
Read only from flash does not reduce the life time of flash. Erasing and writing will reduce the flash lifetime.
Reading and writing from flash is substantially slower compared to conventional memory.
To write a byte whole block has to be erased and re written in flash.

Any kind of Flash is a bad idea to be used for frequently changing values:
limited number of erase/write cycles, see datasheet.
very slow erase/write (erase can be ~1s), see datasheet.
You need a special sequence to erase then write (no language support).
While erasing or writing accesses to Flash are blocked at best, some require not to access the Flash at all (undefined behaviour).
Flash cells cannot freely be written per-byte/word. Most have to be written per page (e.g. 64 bytes) and erased most times in much larger units (segments/blocks/sectors).
For NAND Flash, endurance is even more reduced compared to NOR Flash and the cells are less reliable (bits might flip occasionally or are defective), so you have to add error detection and correction. This is very likely a direction you should not go.
True EEPROM shares most issues, but they might be written byte/word-wise (internal erase).
Note that modern MCU-integrated "EEPROM" is most times also Flash. Some implementations just use slightly more reliable cells (about one decade more erase/write cycles than the program flash) and additional hardware allowing arbitrary byte/word write (automatic erase). But that is still not sufficient for frequent changes.
However, you first should verify if your application can tolerate the lengthly write/erase times. Can you accept a process blocking that long, or rewrite your program acordingly? If the answer is "no", you should even stop further investigation into that direction. Otherwise you should calculate the number of updates over the expected lifetime and compare to the information in the datasheet. There are also methods to reduce the number of erase cycles, but the leads too far.
If an external device (I2C/SPI) is an option, you could use a serial SRAM. Although the better (and likely cheaper) approach would be a larger MCU or think about a more efficient (i.e. less RAM, more code) way to store the data in SRAM.

how to measure serial receive byte speed. eg bytes per second.

I'm receiving byte by byte via serial at baud rate of 115200. How to calculate bytes per sec im receiving in a c program?

There are only 3 ways to measure bytes actually received per second.
The first way is to keep track of how many bytes you receive in a fixed length of time. For example, each time you receive bytes you might do counter += number_of_bytes, and then every 5 seconds you might do rate = counter/5; counter = 0;.
The second way is to keep track of how much time passed to receive a fixed number of bytes. For example, every time you receive one byte you might do temp = now(); rate = 1/(temp - previous); previous = temp;.
The third way is to combine both of the above. For example, each time you receive bytes you might do temp = now(); rate = number_of_bytes/(temp - previous); previous = temp;.
For all of the above, you end up with individual samples and not an average. To convert the samples into an average you'd need to do something like average = sum_of_samples / number_of_samples. The best way to do this (e.g. if you want nice/smooth looking graphs) would be to store a lot of samples; where you'd replace the oldest sample with a new sample and recalculate the average.
For example:
double sampleData[1024];
int nextSlot = 0;
double average;
addSample(double value) {
double sum = 0;
sampleData[nextSlot] = value;
nextSlot++;
if(nextSlot >= 1024) nextSlot = 0;
for(int i = 0; i < 1024; i++) sum += sampleData[1024];
average = sum/1024;
}
Of course the final thing (collecting the samples using one of the 3 methods, then finding the average) would need some fiddling to get the resolution how you want it.

Assuming you have some fairly continuous input, just count the number of bytes you receive, and after some number of characters have been received, print out the time and number of characters over that time. You'll need a fairly good timestamp - clock() may be one reasonable source, but it depends on what system you are on what is the "best" option - as well as how portable you want it, but serial comms tend to not be very portable anyways, or your error will probably be large. Each time you print, reset the count.

To correct some odd comments in this thread about the theoretical maximum:
Around the time that 14400 Baud modems came to the pre-web world, the measure of Baud changed from Baud (wiki it) to match emerging digital technologies such as ISDN 64kbit. At that time, Baud became to mean Bits/second.
Being serial data in the format of 8N1, a common shorthand notation, there are eight bits, no parity bit, and one stop bit for every byte. There is no start bit.
So a theoretical maximum for 8N1 serial over 115200 Baud (bits/sec) = 115200/(8+1) = 12800 bytes/sec.
Similar (but not the same) to watching your download speeds, the rough ball-park way to work out bytes/sec from bits/sec, without a calculator, is to divide by 10.

Baud rate is measurement of how many times per second a signal is able to change. In one of that cycles, depending on the modulation you are using, you can send one or more bits (if you are using no modulation - bit rate is the same as baud rate).
Let's say you are using QPSK modulation, so you can transmit/receive 2 bits per baud. So, if you are receiving data at 115200 baud rate, 2 bits per symbol, you are receiving data with 115200 * 2 = 230400bps.

How do I calculate network utilization for both transmit and receive

How do I calculate network utilization for both transmit and receive either using C or a shell script?
My system is an embedded linux. My current method is to recorded bytes received (b1), wait 1 second, then recorded again (b2). Then knowing the link speed, I calculate the percentage of the receive bandwidth used.
receive utilization = (((b2 - b1)*8)/link_speed)*100
is there a better method?

Check out open source programs that does something similar.
My search turned up a little tool called vnstat.
It tries to query the /proc file system, if available, and uses getifaddrs for systems that do not have it. It then fetches the correct AF_LINK interface, fetches the corresponding if_data struct and then reads out transmitted and received bytes, like this:
ifinfo.rx = ifd->ifi_ibytes;
ifinfo.tx = ifd->ifi_obytes;
Also remember that sleep() might sleep longer than exactly 1 second, so you should probably use a high resolution (wall clock) timer in your equation -- or you could delve into the if-functions and structures to see if you find anything appropriate for your task.

thanks to 'csl' for pointing me in the direction of vnstat. using vnstat example here is how I calculate network utilization.
#define FP32 4294967295ULL
#define FP64 18446744073709551615ULL
#define COUNTERCALC(a,b) ( b>a ? b-a : ( a > FP32 ? FP64-a-b : FP32-a-b))
int sample_time = 2; /* seconds */
int link_speed = 100; /* Mbits/s */
uint64_t rx, rx1, rx2;
float rate;
/*
* Either read:
* '/proc/net/dev'
* or
* '/sys/class/net/%s/statistics/rx_bytes'
* for bytes received counter
*/
rx1 = read_bytes_received("eth0");
sleep(sample_time); /* wait */
rx2 = read_bytes_received("eth0");
/* calculate MB/s first the convert to Mbits/s*/
rx = rintf(COUNTERCALC(rx1, rx2)/(float)1048576);
rate = (rx*8)/(float)sample_time;
percent = (rate/(float)link_speed)*100;