Best way of creating large bit arrays in Lua - c

I want to read a large binary file (1MB in size) into memory using Lua. The target device is mobile so I very much want to minimise the memory footprint.
From a quick look online it seems that Lua tables will use 16B for each sequential integer index (key) plus the space to store the value, which as I am storing binary data will hopefully only use 2 bits but lets just say 1 byte.
For 1e6 records that will be 1e6*17 =~ 17MB - which is huge!
From my brief reading it seems that I can use userdata to implement anything I want in C. I have not used C before but it seems that it would use
1b * 1e6 = 125kB
Shall i do this or have I got something very wrong / is there an easier way to do this.
Any advice or even name-calling for crappy calculations very much welcome :)
EDIT: Some interesting answers below about storing the data in a String (thanks!) and using bitwise ops. I just came accross an example in the PIL book (3rd edition pg293) that compares storing arrays of booleans in C so they use 3% of the mem. While this is cool and useful it may be overkill for me as the solutions below suggest I can fit in 1MB which is fine for me.
EDIT: Came across this C blob impl
EDIT: Solution - I read the file contents into a String as suggested and as Im using 5.1 had to use a 3rd party bit op lib - I went with a pure Lua implementation LuaBit. Thanks everyone!!

You can store a big blob in Lua string, it will work with any binary data. Now the question is what you want to do with the data. In any way, you can use string.byte to extract any individual byte, and use Lua's bit32 library to get down to bits. (For Lua 5.1 and older, you'll either have to write your own C routines, or use third party package.)

You can store the data in a string and manipulate it with the string library and Lua BitOp
Lua5.2's built-in bit32 library is preferred if available.

If you want to read 1 MB into memory, you won't end up with 250 kB...
If you read the file into a Lua string, you end up with 1 MB, as Lua strings are just 8-bit clean bytes.
After that, you can process the data according its structure, using the perhaps the struct library.

Related

How to export data from C to MATLAB (on different machines)

I am generating long double float data in a C program on a Linux cluster. I need to export the data to Matlab, which is not installed on the cluster.
What is the best way? My advisor says to export using printf statements. I assume he means sending the data to a comma separated file (and fprintf). But it seems to me like that could be slow and use too much memory and we may lose a lot of precision.
I've found this web page for reading and writing .MAT files, but I don't really understand the page, or the example, which I copied to my cluster, but cannot compile (because it's missing libraries which, obviously, come from MATLAB.
What is the best, or easiest, or fastest way to export data from Linux/C to Windows/MATLAB? How do I get started with that method? Be advised when you answer that I am pretty new to C and will likely need help figuring out how to obtain, install, configure, and link any libraries. But once that's done, I think I'm pretty good at learning to use them.
Why do you think you would you lose precision? The only drawback with CSVs is that ASCII files require much more storage than binary files, but in this day and age where you get terabytes of storage for the price of a good haircut, that hardly seems like a problem.
It will only be noticeably slower if you're writing gigabytes upon gigabytes, but normally calculations take so much longer that the difference between ASCII and binary is completely negligible (and if the calculations don't take so long: why do you need a cluster then?)
In any case, I'd go for ASCII -- how to write and read some binary blob needs to be documented in two places, it's easier to create bugs in both the writing end and the reading end, it's harder to solve them since no human can read the file, etc. Also, MAT file formats may change in the next Matlab release (as they have in the past).
With ASCII, you have none of these problems, the only drawback I can think of is that you have to write a small cluster-specific file reader in Matlab (which is still a lot less work than working out all the bugs and maintaining a MAT file writer).
Anyway, there's tons of tools available in Matlab for ASCII: textread, dlmread, importdata, to name a few. On the C-end, indeed just use fprintf (documentation here).
I once had this problem as well (well, kind of...) and used a simple binary format to do the job.
If your data format is static, that means if it will never change, you can restrict yourself to exactly what you need and hard-code the exact format in your loading program. If you want to stay flexible to add and remove columns, however, you should define a kind of header to add information about the data format and evaluate that on reading.
The trick for simple importing of data is the following:
Let the MATLAB program know how longs your data records are and how they are composed.
Read the data with
rest = fread(fid, 'uchar=>uint8', 'b').';
in order to have a row vector of uint8s.
Reshape the data with
rest = reshape(rest, recordlength, []).';
in order to get your data in recordlength columns and as many rows as you need.
For each data column, combine the relevant uint8 rows into a "sub-matrix", using a combination of reshape, typecast, swapbytes to group your data appropriately and convert them into the wanted format.
The most important thing here is the typecast() function, which accepts the "byte-wise" data as 1st and the wanted data type as 2nd parameter. There is a wide range of accepted data types, such as intXX, uintXX (with XX one of 8, 16, 32 and (AFAIK) 64) as well as float and double.
For example, typecast([1, 1], 'uint16') gives you 257, while typecast([0, 0, 96, 64], 'float') gives you 3.5.
Once you do so, you can improve the reading speed - compared with a text file - by factor 20 or so. (At least, this was the case in the application I wrote this for: there were about 10 different measure values every 10 ms, one measurement could be of several minutes or even hours and I wanted to read in such a file as fast as possible. So I recoded the stuff from text to binary and gained about factor 20, or maybe 15 - don't know exactly. But it was a lot...)
I would save the workspace as a .MAT file, as you said. Then you have whatever values are contained in all the present variables saved as a workspace at that moment. However, if you are reading arrays (your data) that are gigabytes of long, then probably you read them chunk by chunk (due to RAM restrictions maybe?) and saving the workspace in that case might not help you.
I would never printf anything for transporting. In my work (on long time asymptotics, so I have huge outputs), I save everything as binary files using fwrite. Converting to text is slow and expensive, as far as I know.
I hope this helps a little bit!

Alternative to Hash Map for Small Data set in C

I am currently working on a command line interface for a particle simulator. Its parser takes reads input in the following format:
[command] [argument]* (-[flag] [flag argument])
Currently, the command is sent through a conditional block, compared to various known commands and its corresponding data packet is sent to the matching function. This, however, seems clunky, inefficient and inelegant.
I am thinking about using a hashmap instead, with a string representation of a command as the key and a function pointer as the value. The function referenced would then be sent a data packet containing arguments, flags, etc.
Is a hash map overkill in this situation? Does the extra infrastructure required to implement one outweigh the potential benefits? I am aiming for speed, elegance, function, and, since this is an open-source project, extensibility.
Thanks for the help.
You might want to consider the Ternary Search Tree. It has good performnce, efficient use of storage; and you don't need a hash function or a collision strategy.
The linked Bentley/Sedgwick article is a very thorough-yet-readable explanation of the accompanying C source.
I've been using a TST for name-lookup in the past 3 versions of my postscript interpreter. The only changes that have been needed have been due to changes in memory management. Here's a version I modified (lightly) to use explicit pointers. I use yet another version in my postscript interpreter, any of the xpost2*.zip versions, in the file core.c, which uses byte-offsets for pointers (have to be added to the user-memory byte-pointer to yield a real pointer).
Speed gained will probably be minimal, but you could hash the command to convert it to a number and then use a switch statement. Faster than a hash map.

something like an "extended" C string library?

I have used several dynamically typed languages and I have been avoiding C but enough is enough, it's the right tool for the job sometimes and I need to get over it.
The things I miss working with C are associative arrays and large string libraries. Is there a library that gives more options then string.h? Any general advice when it comes to make the transition with strings?
Thanks for reading-Patrick
You can take a look at the Better String Library. The description from the site:
The Better String Library is an
abstraction of a string data type
which is superior to the C library
char buffer string type, or C++'s
std::string. Among the features
achieved are:
Substantial mitigation
of buffer overflow/overrun problems
and other failures that result from
erroneous usage of the common C string
library functions
Significantly
simplified string manipulation
High
performance interoperability with
other source/libraries which expect
'\0' terminated char buffers
Improved
overall performance of common string
operations
Functional equivalency with
other more modern languages
The
library is totally stand alone,
portable (known to work with gcc/g++,
MSVC++, Intel C++, WATCOM C/C++, Turbo
C, Borland C++, IBM's native CC
compiler on Windows, Linux and Mac OS
X), high performance, easy to use and
is not part of some other collection
of data structures. Even the file I/O
functions are totally abstracted (so
that other stream-like mechanisms,
like sockets, can be used.)
Nevertheless, it is adequate as a
complete replacement of the C string
library for string manipulation in any
C program.
POSIX gives you <string.h>, <strings.h> and <regex.h>.
If you really need more of a string library than this, C is probably not the right tool for that particular job.
As for a hash table, you can't get a type-safe hash table in C without a lot of nasty macros.
If you're OK with just storing void-pointers, or with doing some manual work for each type of map, then you shouldn't be lacking for options. Coding your own hash table is a hoot and a half - just search Stackoverflow for help with the hash function. If you don't want to roll your own, strmap [LGPL] looks decent.
GLib provides many pre-made data structures and string handling functions, but it's a set of functions and types completely separated from the "usual" ones, and it's not a very lightweight dependency.
If instead C++ is a viable alternative for your task, it bundles a string class and several generic containers ready-made into the standard library (and much other related stuff can be found in Boost).
What specifically are you looking for in your extended c-string library?
One way to get better at C, is to create your own c-string library. Then make it open source, and let others help refine it.
I don't usually advocate creating your own string libaries, but w.r.t. C, it's a great way to learn C.
Much of the power of C consists of the ability to have direct control over the memory as a sequence of bytes. It is a bit against the philosophy of the language to treat strings as something higher-level than that.
I would recommend rolling your own very basic one. It will be an enlightening experience especially to learn pointer arithmetics and loops.
For example, learn about "Schlemiel the Painter's algorithm" regarding strcat and design your library to solve this problem.
I've not used it myself, but you should at least review the SEI/CERT library Specifications for Managed Strings, 2nd Edition. The code can be found at CERT.
An associative array associating string keys and struct values in C consists of:
A hash function for strings
An array with a prime number of elements, inside each of which is a linked-list head.
Linked-list elements containing char * pointers to the stored keys and (optionally) a struct * pointer to the corresponding value for each key.
To store a string key in your associative array:
Hash it modulo that prime array size.
In that array bin, add it to the linked-list.
Assign the value pointer to the value you are adding.

How can I convert (encode) RGB buffer to JPEG in plain C?

I would expect that there's some sort of library that I can use in this fashion:
int* buffer[720*480]; // read in from file/memory/network stream
raw_params params;
params.depth = 16;
params.width = 720;
params.height = 480;
params.map = "rgb";
params.interleave = JPEG_RAW_INTERLEAVE_OFF;
jpeg_encode(buffer, params)
But I can't seem to find it.
Clarification
I'm looking for a simple example of how to use such a library as well. Code as robust as the source of netpbm and magick are too complex for my level of understanding.
Yes, there is a library for this (luckily!): http://freshmeat.net/projects/libjpeg/ equal (?) to http://sourceforge.net/projects/libjpeg/
The libjpeg library from the Independent JPEG Group is the reference implementation of the JPEG standard, written by folks who were and are involved in the creation of the standard. It is robust, stable, and highly portable.
It comes as a source kit, which includes extensive documentation and samples.
Its preferred representation for a source image is as an array pointers to of arrays of pixels, not as a single monolithic array as you show. You can easily create the array of pointers to rows to address your image buffer, however.
It is highly configurable and extensible. Specifically, it allows both the data source and the destination to be configured through call backs. This adds some complexity to a first use, but it isn't all that hard to deal with.
There are certainly commercial libraries available as well, and I know that there is commercially available IP for implementation of JPEG in hardware. A commercial library is likely to be faster, but will never be cheaper.
Here's a fairly simple example that's copied in a number of places on the web:
http://oaf.mutualads.org/svn/code/renderer/tr_image_jpg.c
http://www.torquepowered.com/community/forums/viewthread/45455
And this particular page of the documentation is relevant to understanding the require structs:
http://apodeline.free.fr/DOC/libjpeg/libjpeg-2.html

Writing more to a file than just plain text

I have always been able to read and write basic text files in C++, but so far no one has discussed much more than that.
My question is this:
If developing a file type by myself for use by an application I also create, how would I go about writing the data to a file and preserve the layout, formatting, etc.? Are there any standards, or does it just depend on the creativity of the programmer?
You basically have to come up with your own file format and write binary data.
You can also serialize your object model and write the output to a file, but that's usually less efficient.
Better to use an existing database, or use xml (or other) for simple needs. If you want to write a file in a format that already exists, find a library that supports it.
You have to know the binary file format for the file you are trying to create. Consider Joel's post on this topic: the 97-2003 File Format is a 349 page spec.
Nearly all the time, to do something like that, you use an API, to avoid the grunt work. Be careful however, because trial and error and figuring out "what works" by trial and error can result in an upgrade of the program breaking your code. Plus you have to take into account other operating systems, minor version differences, patches, etc.
There are a number of standards of course. The likely one to use is some flavor of xml since there are libraries and tools that already exist to help you work with it, but nothing is stopping you from inventing your own.
Well you could store the data in a format you could read, but which maintained the integrity of your data (XML or JSON for instance).
Or (shudder) you could come up with your own propriatory binary format, and use that.
you would go at it exactly the same way as you would a text file. writing your data byte by byte, encoded in such a way that when you read the file you know what you are reading.
for a spreadsheet application you could even use a text format (OOXML, OpenDocument) to store presentation and content information.
Or you could define binary datastructures and write that directly to the file.
the choice between text or binary format depends on the application. for a configuration file you may prefer a text file which can be modified outside your app, for a database you will most likely choose a binary format for performance reasons.
See wotsit.org for information on file formats for various file types. Example: You can figure out exactly how to write out a .BMP file and how it is composed.
Writing to a database can be done by using a wrapper class in your language, mainly passing it SQL commands.
If you create a binary file , you can write any file to it . The only drawback is that you have to know exactly where it starts and where it ends .
Use xml (something open, descriptive, and validatable), and stick with the text. There are standards for this sort of thing as well, including ODF
You can open the file as binary, instead of text (how one does this depends somewhat on the platform), from there you can write the data directly out to disk. The only real caveat to this is endianess, which can become an issue when moving the files from one architecture to another (x86 to PPC for instance).
Writing binary data to disk is really no harder than writing text, and really, your creativity is key for how you store the data.
The general problem is usually referred to as serialization of your application state and in your case with a source/target of a file in whatever format makes sense for you. These days the preferred input/output format is XML, and you may want to look into the existing standards in this field. The problem then becomes how do I map from the state of my system to the particular schema. Boost has a serialization framework that you may want to check out.
/Allan
There are a variety of approaches you can take, but in general you'll want some sort of serialization library. BOOST::Serialization, or Google's Protocal Buffers are a good example of these. The basic idea is that you have memory structures (classes and objects) that represent your data, and you want to write that data to a file in a way that can be used to reconstruct those structures again.
If you're hesitant to use a library, you can do it all manually, but realize that you can end up writing a lot of redundant code, or developing your own library. See fopen, fread, fwrite and fclose for a starting point.
A typical binary file format for custom data is an "indexed file format" consisting of
-------
|index|
-------
|data |
-------
Where the index contains records "pointing" to the data.
The index consists of records containing an offset and a size. The offset tells you where in the file the data is stored and the size tells you the size of the data at that offset (i.e. the number of bytes to read).
typedef struct {
size_t offset
size_t size
} Index
typedef struct {
int ID
char First[20]
char Last[20]
char *RandomInfo
} Data
Suppose you want to store 50 records in the file you would create 50 indices and 50 data structures. The 50 index structures would be written to the file first, followed by the 50 data structures.
To read the file you would read in the 50 index structures, then from the data in the read-in index structures you could tell where to "seek" to read the data records.
Look up (fopen, fread, fwrite, fclose, ftell) for functions to read/write the data.
(Sorry my semicolon key doesn't work)
You usually use a third party library for these things. For example, you would link in a database library for say Oracle that would allow you to talk to the database. Because the underlying file type, ( i.e. Excel spreadsheet vs Openoffice, Oracle vs MySQL, etc. ) differ these libraries abstract away your need to care how the file is constructed.
Hope that helps you find what you're looking for!
1985 called, and said they have some help IFF you are willing to read up. The interchange file format is still in use today and provides some basic metadata around binary files, such as RIFF or WAV audio. (Unfortunately, TIFF is a false friend.) It allegedly even inspired PNG, so it can't be that bad.

Resources