I am writing a C library that reads a binary file format. I don't control the binary format; it's produced by a proprietary data acquisition program and is relatively complicated. As it is one of my first forays into C programming and binary file parsing, I am having a bit of trouble figuring out how to structure the code for testing and portability.
For testing purposes, I thought the easiest course of action was to build the library to read an arbitrary byte stream. But I ended up implementing a stream data type that encapsulates the type of stream (memstream, filestream, etc). The interface has functions like stream_read_uint8 such that the client code doesn't have to know anything about where the bytes are coming from. My tests are against a memstream, and the filestream stuff is essentially just a wrapper around FILE* and fread, etc.
From an OOP perspective, I think this is a reasonable design. However, I get the feeling that I am stuffing the wrong paradigm into the language and ending up with overly abstracted, overly complicated code as a result.
So my question: is there a simpler, more idiomatic way to do binary format reading in plain C while preserving automated tests?
Note: I realize that FILE* is essentially an abstract stream interface. But the implementation of memory streams (fmemopen) is non-standard and I want Standard C for portability.
What you described is a low-level I/O functionality. Since fmemopen() is not 100% portable (off Linux, it creaks, I suspect), then you need to provide yourself with something portable that you write that is sufficiently close that you can use your surrogate functions (only) when necessary and use the native functions when possible. Of course, you should be able to force the use of your functions even in your native habitat so that you can test your code.
This code can be tested with known data to ensure that you pick up all the characters in the input streams and can faithfully return them. If the raw data is in a specific endian-ness, you can ensure that your 'larger' types — hypothetically, functions such as stream_read-uint2(), stream_read_uint4(), stream_read_string() etc — all behave appropriately. For this phase, you don't really need the actual data; you can manufacture data to suit yourself and your testing.
Once you've got that in place, you will also need to write code for reading the data with the larger types, and ensuring that these higher level function actually can interpret the binary data accurately and invoke appropriate actions. For this, you finally need examples of what the format supplied; up until this phase you probably could get away with data you manufactured. But once you're reading the actual files, you need examples of those to work on. Or you'll have to manufacture them from your understanding and test as best you can. How easy this is depends on how clearly documented the binary format is.
One of the key testing and debugging tools will be canonical 'dump' functions that can present data for you. The scheme I use is:
extern void dump_XyzType(FILE *fp, const char *tag, const XyzType *data);
The stream is self-evident; usually it is stderr, but by making it an argument, you can get the data to any open file. The tag is included in the information printed; it should be unique to identify the location of call. The last argument is a pointer to the data type. You can analyze and print that. You should take the opportunity to assert all validity checks that you can think of, to head off problems.
You can extend the interface with , const char *file, int line, const char *func and arrange to add __FILE__, __LINE__ and __func__ to the calls. I've never quite needed it, but if I were to do it, I'd use:
#define DUMP_XyzType(fp, tag, data) \
dump_XyzType(fp, tag, data, __FILE__, __LINE__, __func__)
As an example, I deal with a type DATETIME, so I have a function
extern void dump_datetime(FILE *fp, const char *tag, const ifx_dtime_t *dp);
One of the tests I was using this week could be persuaded to dump a datetime value, and it gave:
DATETIME: Input value -- address 0x7FFF2F27CAF0
Qualifier: 3594 -- type DATETIME YEAR TO SECOND
DECIMAL: +20120913212219 -- address 0x7FFF2F27CAF2
E: +7, S = 1 (+), N = 7, M = 20 12 09 13 21 22 19
You might or might not be able to see a value 2012-09-13 21:22:19 in there. Interestingly, this function itself calls on another function in the family, dump_decimal() to print out the decimal value. One year, I'll upgrade the qualifier print to include the hex version, which is a lot easier to read (3594 is 0x0E0A, which is readily understandable by those in the know as 14 digits (E), starting with YEAR (the second 0) to second (A), which is certainly not so obvious from the decimal version. Of course, the information is the in the type string: DATETIME YEAR TO SECOND. (The decimal format is somewhat inscrutable to the outsider, but pretty clear to an insider who knows there's an exponent (E), a sign (S), a number of (centesimal) digits (N = 7), and the actual digits (M = ...). Yes, the name decimal is strictly a misnomer as it uses a base-100 or centesimal representation.)
The test doesn't produce that level of detail by default, but I simply had to run it with a high-enough level of debugging set (by command line option). I'd regard that as another valuable feature.
The quietest way of running the tests produces:
test.bigintcvasc.......PASS (phases: 4 of 4 run, 4 pass, 0 fail)(tests: 92 run, 89 pass, 3 fail, 3 expected failures)
test.deccvasc..........PASS (phases: 4 of 4 run, 4 pass, 0 fail)(tests: 60 run, 60 pass, 0 fail)
test.decround..........PASS (phases: 1 of 1 run, 1 pass, 0 fail)(tests: 89 run, 89 pass, 0 fail)
test.dtcvasc...........PASS (phases: 25 of 25 run, 25 pass, 0 fail)(tests: 97 run, 97 pass, 0 fail)
test.interval..........PASS (phases: 15 of 15 run, 15 pass, 0 fail)(tests: 178 run, 178 pass, 0 fail)
test.intofmtasc........PASS (phases: 2 of 2 run, 2 pass, 0 fail)(tests: 12 run, 8 pass, 4 fail, 4 expected failures)
test.rdtaddinv.........PASS (phases: 3 of 3 run, 3 pass, 0 fail)(tests: 69 run, 69 pass, 0 fail)
test.rdtimestr.........PASS (phases: 1 of 1 run, 1 pass, 0 fail)(tests: 16 run, 16 pass, 0 fail)
test.rdtsub............PASS (phases: 1 of 1 run, 1 pass, 0 fail)(tests: 19 run, 15 pass, 4 fail, 4 expected failures)
Each program identifies itself and its status (PASS or FAIL) and summary statistics. I've been bug hunting and fixing a bug other than the ones I found coincidentally, so there are some 'expected failures'. That should be a temporary state of affairs; it allows me to claim legitimately that the tests are all passing. If I wanted more detail, I could run any of the tests, with any of the phases (sub-sets of the tests which are somewhat related, though the 'somewhat' is actually arbitrary), and see the results in full, etc. As shown, it takes less than a second to run that set of tests.
I find this helpful where there are repetitive calculations - but I've had to calculate or verify the correct answer for every single one of those tests at some point.
Related
this idea had been flowing in my head for 3 years and i am having problems to apply it
i wanted to create a compression algorithm that cuts the file size in half
e.g. 8 mb to 4 mb
and with some searching and experience in programming i understood the following.
let's take a .txt file with letters (a,b,c,d)
using the IO.File.ReadAllBytes function , it gives the following array of bytes : ( 97 | 98 | 99 | 100 ) , which according to this : https://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart is the decimal value of the letter.
what i thought about was : how to mathematically cut this 4-membered-array to only 2-membered-array by combining each 2 members into a single member but you can't simply mathematically combine two numbers and simply reverse them back as you have many possibilities,e.g.
80 | 90 : 90+80=170 but there is no way to know that 170 was the result of 80+90 not like 100+70 or 110+60.
and even if you could overcome that , you would be limited by the maximum value of bytes (255 bytes) in a single member of the array.
i understand that most of the compression algorithms use the binary compression and they were successful,but imagine cutting a file size in half , i would like to hear your ideas on this.
Best Regards.
It's impossible to make a compression algorithm that makes every file shorter. The proof is called the "counting argument", and it's easy:
There are 256^L possible files of length L.
Lets say there are N(L) possible files with length < L.
If you do the math, you find that 256^L = 255*N(L)+1
So. You obviously cannot compress every file of length L, because there just aren't enough shorter files to hold them uniquely. If you made a compressor that always shortened a file of length L, then MANY files would have to compress to the same shorter file, and of course you could only get one of them back on decompression.
In fact, there are more than 255 times as many files of length L as there are shorter files, so you can't even compress most files of length L. Only a small proportion can actually get shorter.
This is explained pretty well (again) in the comp.compression FAQ:
http://www.faqs.org/faqs/compression-faq/part1/section-8.html
EDIT: So maybe you're now wondering what this compression stuff is all about...
Well, the vast majority of those "all possible files of length L" are random garbage. Lossless data compression works by assigning shorter representations (the output files) to the files we actually use.
For example, Huffman encoding works character by character and uses fewer bits to write the most common characters. "e" occurs in text more often than "q", for example, so it might spend only 3 bits to write "e"s, but 7 bits to write "q"s. bytes that hardly ever occur, like character 131 may be written with 9 or 10 bits -- longer than the 8-bit bytes they came from. On average you can compress simple English text by almost half this way.
LZ and similar compressors (like PKZIP, etc) remember all the strings that occur in the file, and assign shorter encodings to strings that have already occurred, and longer encodings to strings that have not yet been seen. This works even better since it takes into account more information about the context of every character encoded. On average, it will take fewer bits to write "boy" than "boe", because "boy" occurs more often, even though "e" is more common than "y".
Since it's all about predicting the characteristics of the files you actually use, it's a bit of a black art, and different kinds of compressors work better or worse on different kinds of data -- that's why there are so many different algorithms.
I have a file in my system with 1024 rows,
student_db.txt
Name Subject-1 Subject-2 Subject-3
----- --------- --------- ---------
Alex 98 90 80
Bob 87 95 73
Mark 90 83 92
.... .. .. ..
.... .. .. ..
I have an array structures in my C code,
typedef struct
{
char name[10];
int sub1;
int sub2;
int sub3;
} student_db;
student_db stud_db[1024];
What is the efficient way to read this file and mapping to this array of structures ?
If the number entries is less then we can go for normal fgets in a while with strtok but here number of entries is 1024.
So please suggest some efficient way to do this task.
Check the size of the file, I think, it is max. 100 KByte. This is literally nothing, even a poorly written PHP script can read it in some millisec. There's no slow method to load such small quantity of data.
I assume, loading this file is only the first step, the real task will be to process this list (search, filter etc.). Instead of optimizing the loading speed, you should focus on the processing speed.
Premature optimization is evil. Make a working unoptimized code, see if you're satisfied with the result and speed. Probably you never should optimize it.
You can try to store data in a binary way. The way file is written now is character way. So everything is represented as a character (numbers, strings and other things).
If you store things binary, then it means you store numbers, strings and other stuff the way they are, so when you write integer to a file you write a number, when you open this in a text editor later, then you will see character, which corresponding number in eg. ASCII is the one you have written to a file.
You use standard functions to store things in a binary way:
fopen("test.bin","wb");
w stands for write and b stands for binary format. And function to write is:
fwrite(&someInt, sizeof(someInt), 1, &someFile);
where someInt is variable you want to write (function takes pointer),
sizeof(someInt) is sizeof element int,
1 stands for number of elements if first argument is an array, and someFile is a file, where you want to store your data.
This way size of the file can be reduced, so loading of the file also will be faster. It's also simplier to process sata in file
I have test.csv (300 lines) file as below
10 20 100 2 5 4 5 7 9 10 ....
55 600 7000 500 25
3 10
2 5 6
....
Each line has different number of integers (maximum number of records =1000) and I need to proceed these records line by line. I tried as below
integer,dimension(1000)::rec
integer::i,j
open(unit=5,file="test.csv",status="old",action="read")
do i=1,300
read(unit=5,fmt=*) (rec(j),j=1,1000)
!do some procedue with rec
enddo
close(unit=50)
but it seems like that rec array is not constructed by line by line. It means that when i=n, rec get the numbers from non-nth line. How can I solve this problem.
thank you
List directed formatting (as specified by the star in the read statement) reads what it needs to satisfy the list (hence it is "list directed"). As shown, your code will try and read 1000 values each iteration, consuming as many records (lines) as required each iteration in order to do that.
(List directed formatting has a number of surprising features beyond that, which may have made sense with card based input forms 40 years ago, but are probably misplaced today. Before using list directed input you should understand exactly what the rules around it say.)
A realistic and robust approach to this sort of situation is to read in the input line by line, then "manually" process each line, tokenising that line and extracting values as per whatever rules you are following.
(You should get in touch with whoever is naming files that have absolutely no commas or semicolons with an extension ".csv", and have a bit of a chat.)
(As a general rule, low value unit numbers are to be avoided. Due to historical reasons they may have been preconnected for other purposes. Reconnecting them to a different file is guaranteed to work for that specific purpose, but other statements in your program may implicitly be assuming that the low value unit is still connected as it was before the program started executing - for example, PRINT and READ statements intended to work with the console might start operating on your file instead.)
I am trying to initialize an array using DATA statement in Fortran 90. The code is following:
PROGRAM dt_state
IMPLICIT NONE
INTEGER :: a(4), b(2:2), c(10)
DATA a/4*0/
WRITE (6,*) a(:)
DATA a/4,3,2,1/
WRITE (6,*) a(:)
END PROGRAM dt_state
I expected that results on the screen will be 0 0 0 0 and 4 3 2 1. However, what I got is 0 0 0 0 and 0 0 0 0. It means DATA statement does not overwrite the values of a, does it?
A variable can appear in a DATA statement only once. The DATA statement is for initialization, which is done only once at program start.
In executable code use assignment to set array values
a = (/ 4, 3, 2, 1 /)
(in Fortran 90)
or
a = [ 4, 3, 2, 1 ]
(in Fortran 2003).
It is better to use this syntax also for initialization.
Your code is not standard-compliant. That is: from F2008 5.2.3:
A variable, or part of a variable, shall not be explicitly initialized more than once in a program.
The DATA statement (is one thing that) performs such explicit initialization (5.4.7.1), and so in particular two cannot appear for the same variable.
For your second "initialization", use assignment. [As given by #VladimirF who is a faster typist than I.] Further, while one can put a DATA statement with executable statements, as in this case, the standard goes as far as making that obsolescent (B.2.5):
The ability to position DATA statements amongst executable statements is very rarely used, unnecessary, and a potential source of error.
As the code is non-standard, and the error is not one the compiler is required to detect, the compiler is free to whatever it likes with the code. I thought it would be interesting to see what we do see with a small selection:
One refused to compile (and also pointed out the obsolesence);
Two quietly went ahead, using the first initialization;
One used the second initialization, warning.
Of course, one wouldn't want to rely on any of these behaviours, even if it was a desired one.
I'm novice programmer and am writing a simple wav-player in C as a pet project. Part of the file loading process requires reading specific data (sampling rate, number of channels,...) from the file header.
Currently what I'm doing is similar to this:
Scan for a sequence of bytes and skip past it
Read 2 bytes into variable a
Check value and return on error
Skip 4 bytes
Read 4 bytes into variable b
Check value and return on error
...and so on. (code see: https://github.com/qgi/Player/blob/master/Importer.c)
I've written a number of helper functions to do the scanning/skipping/reading bit. Still I'm repeating the reading, checking, skipping part several times, which doesn't seem to be neither very effective nor very smart. It's not a real issue for my project, but as this seems to be quite a common task when handling binary files, I was wondering:
Is there some kind of a pattern on how to do this more effectively with cleaner code?
Most often, people define structs (often with something like #pragma pack(1) to assure against padding) that matches the file's structures. They then read data into an instance of that with something like fread, and use the values from the struct.
The cleanest option that I've come across is the scanf-like function unpack presented by Kernighan & Pike on page 219 of The Practice of Programming, which can be used like
// assume we read the file header into buf
// and the header consists of magic (4 bytes), type (2) and length (4).
// "l" == 4 bytes (long)
// "s" == 2 bytes (short)
unpack(buf, "lsl", &magic, &type, &length);
For efficiency using a buffer of say size 4096 to read into and then doing your parsing on the data in the buffer would be more efficient, and ofcource doing a single scan parsing where you only go forward is most efficient.