I am having problem understanding difference between below two code cases. Case 1 is working as per expectation and Case 2 is not.
Problem Statement: I need to write some set of DWORDS on my device file and trigger a DMA. DMA capacity is 128*4 bytes (128 DWORDS). So I want to trigger DMA(using ioctl) after writing 128 bytes to device file descriptor to have full utilisation of capacity. I can do this for individual Dwords as well.
Basic difference in two cases:
In first case intention is to write 128 DWORDS to the file at once and in second case write DWORDS individually and trigger DMA after 128 DWORDS are written.
Is the data written to the file same in both cases? Second is not working so something wrong there. please help. By not working I mean the expected result of DMA commands are not happening so in second case data in file descriptor is is not same as first case just before the dma command.
Case 1 (WORKING)
int input_dwords[128] = {0xAABBCCDD, 0XBBCCDDAA, ....} //128 DWORDS (actually this data is in
//text file just putting in array for
//illustration)
int cmd_buf = (int*)malloc(sizeof(int) * 128); //space for 128 DWORDS
int* cur = cmd_buf;
for(int i = 0; i< 128; i++)
{
*cur = input_dword[i]
cur++;
}
//Write to file in one shot
write(fd, cmd_buf, sizeof(int)*128);
//trigger DMA (ioctl)
trigger_dma(fd);
Case 2 (NOT WORKING)
int input_dwords[128] = {0xAABBCCDD, 0xBBCCDDAA, ....}
int* cur = (int*)input_dwords;
for(int i = 0 ; i< 128 i++)
{
//writing to file one DWORD at a time.
write(fd, cur, sizeof(int));
cur++;
}
//trigger DMA (ioctl)
trigger_dma(fd);
Related
Our task is intended to demonstrate the benefit of using DMA to copy a large amount of data versus relying on the processor to directly handle the copying.
The processor is an STM32F407 on the ST discovery board.
In order to measure the copying time, a GPIO pin must be turned ON during copying and OFF once it has been copied.
The code appears to be functional but it is currently showing the CPU taking about 2.15ms to complete and DMA about 4.5ms, which is the opposite of what is intended. I'm not sure if there simply isn't enough data for the faster speed of DMA to offset the overhead in setting it up perhaps?
I have tried both copying elements of an array using the CPU and also using the memcpy function which seemed to yield very similar times.
The function code is shown below:
DMASpeed(void)
{
#define elementNum 32000
int *ptr = NULL;
ptr = (int*)malloc(elementNum * sizeof(int));
int *ptr2 = NULL;
ptr2 = (int*)malloc(elementNum * sizeof(int));
for (int i = 0; i < elementNum; i++)
{
ptr[i] = 4;
}
LD5_GPIO_Port->BSRR = (uint32_t)LD5_Pin << 16U;
LD6_GPIO_Port->BSRR = (uint32_t)LD6_Pin << 16U;
// Initial value
// printf("BEFORE: dst = '%s'\n", dst);
// Transfer
printf("Initiate DMA Transfer...\n");
HAL_DMA_Start(&hdma_memtomem_dma2_stream0, (int)ptr, (int)ptr2, (elementNum * sizeof(int)));
LD5_GPIO_Port->BSRR = LD5_Pin;
printf("DMA Transfer initiated.\n");
// Poll for DMA completion
printf("Poll for DMA completion.\n");
HAL_DMA_PollForTransfer(&hdma_memtomem_dma2_stream0,
HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
LD5_GPIO_Port->BSRR = (uint32_t)LD5_Pin << 16U;
printf("DMA complete.\n");
// Print result
// printf("AFTER: dst = '%s'\n", dst);
free(ptr);
free(ptr2);
ptr = (int*)malloc(elementNum * sizeof(int));
ptr2 = (int*)malloc(elementNum * sizeof(int));
for (int i = 0; i < elementNum; i++)
{
ptr[i] = i;
}
printf("Initiate CPU Transfer...\n");
LD6_GPIO_Port->BSRR = LD6_Pin;
// for (int i = 0; i<512; i++)
// {
// ptr2[i] = ptr[i];
// }
memcpy(ptr2, ptr, (elementNum * sizeof(int)));
printf("CPU Transfer Complete.\n");
LD6_GPIO_Port->BSRR = (uint32_t)LD6_Pin << 16U;
free(ptr);
free(ptr2);
}
Thanks in advance for any assistance
you try to proof something what is not the true. DMA memory to memory transfer will be always slower than direct CPU one. DMA was not intended to be faster than the CPU. it's there is to provide the transfer w
without the CPU activity in the background. the core has always priority over the DMA.
MEM to MEM DMA transfer will be always slower than the CPU one
There is another problem as well. Many STM devices have memory areas which are not accessible by the DMA (for example CCMRAM).
Remove printf in below code segment:
LD5_GPIO_Port->BSRR = LD5_Pin;
printf("DMA Transfer initiated.\n"); // <--Remove this
// Poll for DMA completion
printf("Poll for DMA completion.\n"); // <--Remove this
You are turning ON the pin and then printing large text , it is adding up in your total time calculation.
Remove all printf OR atleast do not print anything in between pin toggling.
EDIT:
To be precise you are printing 50 characters in case of DMA transfer and 23 characters in case of CPU transfer.
For those, who google for "How to fasten DMA memory-to-memory transfer?" here is the piece of advice: force your compiler to allocate all HAL code, related to your DMA transfer to the RAM, the best is to the RAM exclusively coupled with the Core. Your compiler will generate function code, which will be copied to the specific RAM at startup, and then all that functions will be called from the RAM and sped up because of it. However, that is also true for copying "by hand".
In this case, it is recommended to allocate to the RAM the following files/functions:
stm32[whatever]_hal_dma.c
DMA[N]_Stream[M]_IRQHandler(), where N and M are the numbers of your DMA and stream used for the transfer respectively.
I'm doing reverse engineering about a ultrasound probe on the Linux side. I want to capture raw data from an ultrasound probe. I'm programming with C and using the libusb API.
There are two BULK IN endpoints in the device (2 and 6). The device is sending 2048 bytes data, but it is sending data as 512 bytes with four block.
This picture is data flow on the Windows side, and I want to copy that to the Linux side. You see four data blocks with endpoint 02 and after that four data blocks with endpoint 06.
But there is a problem about timing. The first data block of endpoint 02's and first data block of endpoint 06's are close to each other acoording to time. But in data flow they are not in sequence.
I see that the computer is reading the first data blocks of endpoint 02 and 06. After that, the computer is reading the other three data blocks of endpoint 02 and endpoint 06. But in USB Analyzer, the data flow is being viewed according to the endpoint number. The sequence is different according to time.
On the Linux side, I write code like this:
int index = 0;
imageBuffer2 = (unsigned char *) malloc(2048);
imageBuffer6 = (unsigned char *) malloc(2048);
while (1) {
libusb_bulk_transfer(devh, BULK_EP_2, imageBuffer2, 2048, &actual2, 0);
libusb_bulk_transfer(devh, BULK_EP_6, imageBuffer6, 2048, &actual6, 0);
//Delay
for(index = 0; index <= 10000000; index ++)
{
}
}
So that result is in picture as below
In other words, in my code all reading data is being read in sequence according to time and endpoint number. My result is different from the data flow on the Windows side.
In brief, I have two BULK IN endpoints, and they are starting read data close according to time. How is it possible?
It's not clear to me whether you're using a different method for getting the data on Windows or not, I'm going to assume that you are.
I'm not an expert on libusb by any means, but my guess would be that you are overwriting you data with each call, since you're using the same buffer each time. Try giving your buffer a fixed value before using the transfer method, and then evaluate the result.
If it is the case, I believe something along the lines of the following would also work in C:
imageBuffer2 = (unsigned char *) malloc(2048);
char *imageBuffer2P = imageBuffer2;
imageBuffer6 = (unsigned char *) malloc(2048);
char *imageBuffer6P = imageBuffer6;
int dataRead2 = 0;
int dataRead6 = 0;
while(dataRead2 < 2048 || dataRead6 < 2048)
{
int actual2 = 0;
int actual6 = 0;
libusb_bulk_transfer(devh, BULK_EP_2, imageBuffer2P, 2048-dataRead2, &actual2, 200);
libusb_bulk_transfer(devh, BULK_EP_6, imageBuffer6P, 2048-dataRead6, &actual6, 200);
dataRead2 += actual2;
dataRead6 += actual6;
imageBuffer2P += actual2;
imageBuffer6P += actual6;
usleep(1);
}
I'm continuously sending 2D arrays of pixel values (uint32) from LabVIEW to a C-program through TCP/IP with the resolution 160x120. The purpose of the C-program is to display the received pixel values as 2D arrays in the console application. I'm sending the pixels as stream of bytes, and using the recv function in Ws2_32.lib to receive the bytes in the C-program. Then I'm converting the bytes to uint32 values and displaying them in the console application using a 2D arrays, so every 2D array will represent an image.
I have en issue with the frame rate though. I'm able to send 30 frames per second in LabVIEW, but when I open the TCP/IP connection with the C-program, the frame rate goes down to 1 frame per second. It must be an issue with the C-program, since I managed to send the desired frames per second with the same LabVIEW program to a corresponding C# program.
The C-code:
#define DEFAULT_BUFLEN 256
#define IMAGEX 120
#define IMAGEY 160
WSADATA wsa;
SOCKET s , new_socket;
struct sockaddr_in server , client;
int c;
int iResult;
char recvbuf[DEFAULT_BUFLEN];
int recvbuflen = DEFAULT_BUFLEN;
typedef unsigned int uint32_t;
unsigned int x=0,y=0,i,n;
uint32_t image[IMAGEX][IMAGEY];
size_t len;
uint32_t* p;
p = (uint32_t*)recvbuf;
do
{
iResult = recv(new_socket, recvbuf, recvbuflen, 0);
len = iResult/sizeof(uint32_t);
for(i=0; i < len; i++)
{
image[x][y] = p[i];
x++;
if (x >= IMAGEX)
{
x=0;
y++;
}
if (y >= IMAGEY)
{
y = 0;
x = 0;
//print image
for (n=0; n< IMAGEX*IMAGEY; n++)
{
printf("%d",image[n%IMAGEX][n/IMAGEY]);
if (n % IMAGEX)
{
printf(" ");
}
else
{
printf("\n");
}
}
}
}
} while ( iResult > 0 );
try reducing the prints .. Since you are reading and printing in the same thread, the data in the TCP connection will fill up and it will then back pressure the other end (LABView) and the LABView will stop sending data until it gets the green signal from the other end (you C program)
To start with you can debug by replacing this
for (n=0; n< IMAGEX*IMAGEY; n++)
{
printf("%d",image[n%IMAGEX][n/IMAGEY]);
if (n % IMAGEX)
{
printf(" ");
}
else
{
printf("\n");
}
}
with
printf("One frame recv\n");
and see if it makes any difference. I am assuming your tcp connection has ample bandwidth
Very hard to diagnose without further information. I can give a few suggestions, however.
First of all, your recv call is using a small buffer, so you are spending a lot of time calling it. Why not read a whole frame at a time? Also, you read in the data and then copy it to the image array. Wouldn't it be simpler to just use the image array itself? Combining those two suggestions would have recv reading a full frame directly into the image array, saving a lot of time.
Another source of the problem could be the console. With the sample code you provided, you are attempting to write 30*120*160=57,600 integer values per second to the terminal. If the average value, with delimiter, takes up 8 characters, that's 4 million characters per second. It's entirely possible that the display just can't go that fast, in which case things would back up and slow down all the way to the server writing to the socket.
There are several ways to handle this, but it's too much to go into here.
I'm developing on an AD Blackfin BF537 DSP running uClinux. I have a total of 32MB SD-RAM available. I have an ADC attached, which I can access using a simple, blocking call to read().
The most interesting part of my code is below. Running the program seems to work just fine, I get a nice data package that I can fetch from the SD-card and plot. However, if I comment out the float calculation part (as noted in the code), I get only zeroes in the ft_all.raw file. The same occurs if I change optimization level from -O3 to -O0.
I've tried countless combinations of all sorts of things, and sometimes it works, sometimes it does not - earlier (with minor modifications to below), the code would only work when optimization was disabled. It may also break if I add something else further down in the file.
My suspicion is that the data transferred by the read()-function may not have been transferred fully (is that possible, even though it returns the correct number of bytes?). This is also the first time I initialize pointers using direct memory adresses, and I have no idea how the compiler reacts to this - perhaps I missed something, here?
I've spent days on this issue now, and I'm getting desperate - I would really appreciate some help on this one! Thanks in advance.
// Clear the top 16M memory for data processing
memset((int *)0x01000000,0x0000,(size_t)SIZE_16M);
/* Prep some pointers for data processing */
int16_t *buffer;
int16_t *buf16I, *buf16Q;
buffer = (int16_t *)(0x1000000);
buf16I = (int16_t *)(0x1600000);
buf16Q = (int16_t *)(0x1680000);
/* Read data from ADC */
int rbytes = read(Sportfd, (int16_t*)buffer, 0x200000);
if (rbytes != 0x200000) {
printf("could not sample data! %X\n",rbytes);
goto end;
} else {
printf("Read %X bytes\n",rbytes);
}
FILE *outfd;
int wbytes;
/* Commenting this region results in all zeroes in ft_all.raw */
float a,b;
int c;
b = 0;
for (c = 0; c < 1000; c++) {
a = c;
b = b+pow(a,3);
}
printf("b is %.2f\n",b);
/* Only 12 LSBs of each 32-bit word is actual data.
* First 20 bits of nothing, then 12 bits I, then 20 bits
* nothing, then 12 bits Q, etc...
* Below, the I and Q parts are scaled with a factor of 16
* and extracted to buf16I and buf16Q.
* */
int32_t *buf32;
buf32 = (int32_t *)buffer;
uint32_t i = 0;
uint32_t n = 0;
while (n < 0x80000) {
buf16I[i] = buf32[n] << 4;
n++;
buf16Q[i] = buf32[n] << 4;
i++;
n++;
}
printf("Saving to /mnt/sd/d/ft_all.raw...");
outfd = fopen("/mnt/sd/d/ft_all.raw", "w+");
if (outfd == NULL) {
printf("Could not open file.\n");
}
wbytes = fwrite((int*)0x1600000, 1, 0x100000, outfd);
fclose(outfd);
if (wbytes < 0x100000) {
printf("wbytes not correct (= %d) \n", (int)wbytes);
}
printf(" done.\n");
Edit: The code seems to work perfectly well if I use read() to read data from a simple file rather than the ADC. This leads me to believe that the rather hacky-looking code when extracting the I and Q parts of the input is working as intended. Inspecting the assembly generated by the compiler confirms this.
I'm trying to get in touch with the developer of the ADC driver to see if he has an explanation of this behaviour.
The ADC is connected through a SPORT, and is opened as such:
sportfd = open("/dev/sport1", O_RDWR);
ioctl(sportfd, SPORT_IOC_CONFIG, spconf);
And here are the options used when configuring the SPORT:
spconf->int_clk = 1;
spconf->word_len = 32;
spconf->serial_clk = SPORT_CLK;
spconf->fsync_clk = SPORT_CLK/34;
spconf->fsync = 1;
spconf->late_fsync = 1;
spconf->act_low = 1;
spconf->dma_enabled = 1;
spconf->tckfe = 0;
spconf->rckfe = 1;
spconf->txse = 0;
spconf->rxse = 1;
A bfin_sport.h file from Analog Devices is also included: https://gist.github.com/tausen/5516954
Update
After a long night of debugging with the previous developer on the project, it turned out the issue was not related to the code shown above at all. As Chris suggested, it was indeed an issue with the SPORT driver and the ADC configuration.
While debugging, this error messaged appeared whenever the data was "broken": bfin_sport: sport ffc00900 status error: TUVF. While this doesn't make much sense in the application, it was clear from printing the data, that something was out of sync: the data in buffer was on the form 0x12000000,0x34000000,... rather than 0x00000012,0x00000034,... whenever the status error was shown. It seems clear then, why buf16I and buf16Q only contained zeroes (since I am extracting the 12 LSBs).
Putting in a few calls to usleep() between stages of ADC initialization and configuration seems to have fixed the issue - I'm hoping it stays that way!
The Overview
I am using the low-level calls in the libbzip2 library: BZ2_bzCompressInit(), BZ2_bzCompress() and BZ2_bzCompressEnd() to compress chunks of data to standard output.
I am migrating working code from higher-level calls, because I have a stream of bytes coming in and I want to compress those bytes in sets of discrete chunks (a discrete chunk is a set of bytes that contains a group of tokens of interest — my input is logically divided into groups of these chunks).
A complete group of chunks might contain, say, 500 chunks, which I want to compress to one bzip2 stream and write to standard output.
Within a set, using the pseudocode I outline below, if my example buffer is able to hold 101 chunks at a time, I would open a new stream, compress 500 chunks in runs of 101, 101, 101, 101, and one final run of 96 chunks that closes the stream.
The Problem
The issue is that my bz_stream structure instance, which keeps tracks of the number of compressed bytes in a single pass of the BZ2_bzCompress() routine, seems to claim to be writing more compressed bytes than the total bytes in the final, compressed file.
For example, the compressed output could be a file with a true size of 1234 bytes, while the number of reported compressed bytes (which I track while debugging) is somewhat higher than 1234 bytes (say 2345 bytes).
My rough pseudocode is in two parts.
The first part is a rough sketch of what I do to compress a subset of chunks (and I know that I have another subset coming after this one):
bz_stream bzStream;
unsigned char bzBuffer[BZIP2_BUFFER_MAX_LENGTH] = {0};
unsigned long bzBytesWritten = 0UL;
unsigned long long cumulativeBytesWritten = 0ULL;
unsigned char myBuffer[UNCOMPRESSED_MAX_LENGTH] = {0};
size_t myBufferLength = 0;
/* initialize bzStream */
bzStream.next_in = NULL;
bzStream.avail_in = 0U;
bzStream.avail_out = 0U;
bzStream.bzalloc = NULL;
bzStream.bzfree = NULL;
bzStream.opaque = NULL;
int bzError = BZ2_bzCompressInit(&bzStream, 9, 0, 0);
/* bzError checking... */
do
{
/* read some bytes into myBuffer... */
/* compress bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
/* error checking... */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError == BZ_OK);
}
while (/* while there is a non-final myBuffer full of discrete chunks left to compress... */);
Now we wrap up the output:
/* read in the final batch of bytes into myBuffer (with a total byte size of `myBufferLength`... */
/* compress remaining myBufferLength bytes in myBuffer */
bzStream.next_in = myBuffer;
bzStream.avail_in = myBufferLength;
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
do
{
bzStream.next_out = bzBuffer;
bzStream.avail_out = BZIP2_BUFFER_MAX_LENGTH;
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
/* bzError error checking... */
/* increment cumulativeBytesWritten by `bz_stream` struct `total_out_*` members */
bzBytesWritten = ((unsigned long) bzStream.total_out_hi32 << 32) + bzStream.total_out_lo32;
cumulativeBytesWritten += bzBytesWritten;
/* write compressed data in bzBuffer to standard output */
fwrite(bzBuffer, 1, bzBytesWritten, stdout);
fflush(stdout);
}
while (bzError != BZ_STREAM_END);
/* close stream */
bzError = BZ2_bzCompressEnd(&bzStream);
/* bzError checking... */
The Questions
Am I calculating cumulativeBytesWritten (or, specifically, bzBytesWritten) incorrectly, and how would I fix that?
I have been tracking these values in a debug build, and I do not seem to be "double counting" the bzBytesWritten value. This value is counted and used once to increment cumulativeBytesWritten after each successful BZ2_bzCompress() pass.
Alternatively, am I not understanding the correct use of the bz_stream state flags?
For example, does the following compress and keep the bzip2 stream open, so long as I keep sending some bytes?
bzError = BZ2_bzCompress(&bzStream, BZ_RUN);
Likewise, can the following statement compress data, so long as there are at least some bytes are available to access from the bzStream.next_in pointer (BZ_RUN), and then the stream is wrapped up when there are no more bytes available (BZ_FINISH)?
bzError = BZ2_bzCompress(&bzStream, (bzStream.avail_in) ? BZ_RUN : BZ_FINISH);
Or, am I not using these low-level calls correctly at all? Should I go back to using the higher-level calls to continuously append a grouping of compressed chunks of data to one main file?
There's probably a simple solution to this, but I've been banging my head on the table for a couple days in the course of debugging what could be wrong, and I'm not making much progress. Thank you for any advice.
In answer to my own question, it appears I am miscalculating the number of bytes written. I should not use the total_out_* members. The following correction works properly:
bzBytesWritten = sizeof(bzBuffer) - bzStream.avail_out;
The rest of the calculations follow.