Embedded C data storage module design

Embedded C data storage module design - c

I'm in the process of designing a embedded C data storage module. It will be included by files/modules who want access to this "shared" system-wide data. Multiple tasks aggregate dozens of inputs (GPIO, CAN, I2C/SPI/SSP data, etc) and stores those values off using the API. Then, other tasks can access the data safely through the API. The system is an embedded app with an RTOS, so mutexes are used to protect the data. These ideas will be used regardless of the implementation
I've designed something like this in the past, and I'm trying to improve upon it. I'm currently halfway through a new implementation and I'm running into a few hiccups and would really benefit from a fresh perspective.
Quick rundown of the requirements of this module:
Ideally, there would be one interface that can access the variables (one get, one set).
I'd like to return different variable types (floats, ints, etc). This means macros are probably needed.
I'm not pressed for code space, but it's always a concern
Quick gets/sets are absolutely paramount (which means storing in strings ala xml/json is out)
No new variables need to be added during runtime. Everything is statically defined at boot
The question is how would you go about designing something like this? Enumerations, structures, accessors, macros, etc? I'm not looking for code here, just discussing general overall design ideas. If there's a solution on the internet that addresses these sorts of things, perhaps even just a link is sufficient.

I've been in this situation a couple times myself. Every time I've ended "rolling my own", and I definitely don't suffer from Not Invented Here (NIH) syndrome. Sometimes the space, processing turnaround time or reliability/recoverability requirements just make that the least painful path.
So, rather than writing the great American novel on the topic, I'll just throw out some thoughts here, as your question is pretty broad (but thank you for at least forming a question & providing background).
Is C++ on the table? Inline functions, templates, and some of the Boost libraries might be useful here. But I'm guessing this is straight-up C.
If you're using C99, you can at least use inline functions, which are a step above macros when it comes to type safety.
You might want to think about using several mutexes to protect different parts of the data; even though the updates are quick, you might want to break up the data into sections (e.g. configuration data, init data, error logging data, trace data, etc.) and give each its own mutex, reducing the funnel/choke points.
You could also consider making all access to the data go through a server task. All reads & writes go through an API which communicates with the server task. The server tasks pulls reads & write requests in order from its queue, handles them quickly by writing to a RAM mirror, sending responses if needed (at least for read requests), and then buffers data to NVM in the background, if necessary. Sounds heavyweight compared to simple mutexes but it has its advantages in certain use cases. Don't know enough about your application to know if this is a possibility.
One thing I will say, is the idea of getting/setting by a tag (e.g. maybe a list of enums like CONFIG_DATA, ADDRESS_DATA, etc.) is a huge step forward from directly addressing data (e.g. "give me the 256 bytes at address ox42000). I've seen many many shops suffer great pain when the whole physical-addressing scheme finally breaks down & they need to re-factor / re-design. Try to keep the "what" decoupled from the "how" - clients shouldn't have to know or care where stuff is stored, how big it is, etc. (You might already know all this, sorry if so, I just see it all the time...)
One last thing. You mentioned mutexes. Beware of priority inversion... that can make those "quick accesses" take a very long time in some cases. Most kernel mutex implementations allow you to account for this, but often it's not enabled by default. Again, sorry if this is old news...

A few approaches I've had experience with and have found each good for its own needs. Just writing down my thoughts on this issue, hope this gives you some ideas you can go with...
For more complex data with lots of dependencies and constraints, I found a higher-level module is usually preferable, even when coming in the expense of speed. It saves you a lot of headache, and its usually the way to go in my experience, unless you have really tough constraints. The thing is to store most of your "static" data, that doesn't change a lot, in this type of module, and benefit from the ease of use and the more "complex" features (e.g. Data Validation, Referential Integrity, Querying, Transactions, Privileges etc). However, if speed is crucial in some areas, you might want to keep it out of this module altogether, or implement some simple "mirroring" scheme (where you could for example periodically synchronize back and forth between your lightweight mirrors and main configuration DB on some idle task). Having some experience with this issue on embedded systems, I have found sqlite to be the easiest to work with. Seeing your comments, I can testify the porting is not that big of a deal if you have the C standard library supported for your platform. I have ported the system to a PowerPC architecture in about a week, and we had pure in-house proprietary code. It was mostly a matter of getting the OS abstraction layer figured out and was smooth sailing from there. I highly recommend sqlite for this.
If the above is still too heavy (either in terms of performance, or perhaps it is just overkill), you can get most of the important benefits from using a basic key-value system. The idea is similar to what was mentioned here before: define your keys via enum's, store your data in a table of same-sized (e.g. 64 bits) values, and let each variable reside in one cell. I find this to work for simpler projects, with not THAT many types of configuration entries, while you can keep your flexibility and the system is still easy to develop and use. A few pointers for this:
Divide your enum's into groups or categories, each with its own value range. A scheme I like is to define the groups themselves in one enum, and use those values for starting off the first value of each of the groups (EntryInGroupID = (groupid << 16) | serial_entry_id).
It's easy to implement "tables" using this scheme (where you have more than one entry for each need or configuration, e.g. routing tables). You can generalize the design by making anything a "table", and if you want to define a single value, just make it a table with one entry. In embedded systems I would recommend allocating everything up-front, according to the maximum number of entries possible for that table, and "using" part of it. That way you've covered the worst case and are being deterministic.
Another thing I found very useful was using Hooks for everything. These hooks can be implemented using Function Pointers (for each behavior / policy / operation), which are given through a "descriptor" struct which is used to define the data entry. Usually, your entries will use the "default" hook functions for read/write (e.g. direct memory access, or file access), and NULL-out the other possible hooks (e.g. "validate", "default value", "notify for change" and such). Once in a while, when you have a more complicated case, you can take benefit of the hooks system you've planted in with little development cost. This hooks system can inherently provides you with a "hardware abstraction layer" for the storage of the values themselves, and can access data from memory, registers, or files with the same ease. A "notify for changes" register/unregister facility can also really help for notifying tasks about configuration changes, and is not that hard to add.
For ultra-real-time needs, I would just recommend you keep it simple and save your data in structs. Its pretty much the most lightweight way you can go, and it usually handles all your needs. However, it is hard to handle concurrency in this system, so you would either have to wrap all access with mutexes like you suggested, or just keep a mirror for every "task". If this is indeed ultra-real-time, I would go for the mirroring approach with periodical sync's throughout the system). Both this system and the one suggested above make it super easy to serialize your data (as it is always stored in binary format), which will ease things down the road and just keep things simple (all you need for the serialization is a memcpy with a cast or two). Also, using unions from time to time might help, and some optimizers handle it well, however it really depends on the architecture and toolchain you're working with.
If you're really worried about having to upgrade your data for new schemas, you might want to consider an "upgrader module", which can be configured with several schemas, and will be used to convert old formats to the new when invoked. You can use code generation techniques from XML schemas to generate the database itself (e.g. into structs and enums), and take advantage of XML tools to perform the "conversion code", by writing XSLT or just plain python-like scripts. You might want to perform the conversion off-target (e.g. at the host), with a more high-level script, rather than cramming that logic into the target itself.
For Data Type Safety you can use lots of template and meta-programming wizardry. While tough to maintain as the framework coder, it eases things up for the users of the system. If coded right (which is tricky!), it can save you lots of cycles and not impose a lot of code overhead (with inlining, explicit template instantiation and much care). If your compiler isn't all that good (which is usually the case with embedded platform compilers), just use wrapper macros for conversion.
About the concurrency issue, I generally find myself resorting to one of the following methods, depending of the overall design of the system
Like you suggested, wrap the whole thing with a mutex and hope for the best. You can get more precision by adding per-table or even per-entry locks, but this has its own set of problems and I would recommend just keeping it simple and stick to a global mutex. If you have replications for the hard real-time stuff, this probably won't cripple your performance since you only do the sync once in a while.
Design your database as a "single task", and wrap all access to it via messages (MQ's, sockets, whatever). This works better for the non-critical-path data (again), and is an easy solution. If there aren't too much simultaneous accesses going on, this might be the way to go. Note that if you're in the same memory range and have shared memory, the performance of reads/writes isn't that expensive (since you aren't copying data), and if you play with your task's priorities right, it can be a simple solution that will solve most of your needs (by avoiding the problem of concurrent access altogether).
Most systems I've worked with required some type of configuration module, usually over both volatile and non-volatile memory. It's very easy (and even tempting) to over-design this, or bring an overkill solution where its not needed. I have found that trying to be "future-proof" on these issues will in a lot of cases just be a waste of time, usually the simplest way to go is the best (in real-time and embedded systems anyway). Also, performance issues tend to creep up when you scale the system, and its sometimes to stick with the simpler and faster solution up ahead.
Therefore, given the information you've supplied and the above analysis, I'd recommend you go with the key-value or direct structs approach. I'm a personal fan of the "single task" approach for my systems, but there is really no one absolute answer for this, and it requires deeper analysis. For these solutions, having looked for a "generic" and off-the-shelf solution, I always find myself implementing it myself in 1-2 weeks and saving myself a lot of headaches.
Hope the answer wasn't overkill :-)

I usually go for a simple dictionary-like API using an int as the key and a fixed-size value. This executes quickly, uses a very small amount of program RAM and has predictable data RAM usage. In other words, the lowest-level API looks like:
void data_set(uint16 key, uint32 value);
uint32 data_get(uint16 key);
Keys become a list of constants:
#define KEY_BOGOMIPS 1
#define KEY_NERDS_PER_HOUR 2
You handle different data types by casting. Sucks, but you can write macros to make the code a little cleaner:
#define data_get_float(key) (float)data_get(key)
Achieving type safety is difficult to do without writing a separate macro or accessor function for each item. On one project, I needed validation of input data, and this became the type-safety mechanism.
The way you structure the physical storage of data depends how much data memory, program memory, cycles and separate keys you have. If you've got lots of program space, hash the key to get a smaller key that you can look up directly in an array. Usually, I make the underlying storage look like:
struct data_item_t {
uint16 key;
uint32 value;
}
struct data_item_t items[NUM_ITEMS];
and iterate through. For me, this has been fast enough even on very small (8-bit) microcontrollers, though it might not fit for you if you've got a lot of items.
Remember that your compiler will probably inline or optimise the writes nicely, so cycles per access may be lower than you'd expect.

Figured I'd update one of my only unaccepted questions. Here's my final implementation. Been using this for over a year and it works fantastic. Very easy to add variables to and very little overhead for the benefit it gives us.
lib_data.h:
#ifndef __LIB_DATA_H
#define __LIB_DATA_H
#include <type.h>
/****************************************************************************************
* Constant Definitions
***************************************************************************************/
/* Varname, default value (uint32_t) */
#define DATA_LIST \
DM(D_VAR1, 0) \
DM(D_VAR2, 1) \
DM(D_VAR3, 43)
#define DM(y, z) y,
/* create data structure from the macro */
typedef enum {
DATA_LIST
NUM_DATA_VARIABLES
} dataNames_t;
typedef struct {
dataNames_t name;
uint32_t value;
} dataPair_t;
/* the macro has to be undefined to allow the fault list to be reused without being
* defined multiple times
*
* this requires:
* a file specific lint option to suppress the rule for use of #undef
*/
#undef DM
/****************************************************************************************
* Data Prototypes
***************************************************************************************/
/****************************************************************************************
* External Function Prototypes
***************************************************************************************/
/**
* Fetch a machine parameter
*
* \param dataName The variable from DATA_LIST that you want to fetch
*
* \return The value of the requested parameter
*
*/
uint32_t lib_data_Get(dataNames_t dataName);
/**
* Set a machine parameter
*
* \param dataName The variable from DATA_LIST that you want to set
* \param dataVal The value you want to set the variable to
*
* \return void
*
*/
void lib_data_Set(dataNames_t dataName, uint32_t dataVal);
#endif /* __LIB_DATA_H */
lib_data.c:
#include <type.h>
#include "lib_data.h"
/****************************************************************************************
* Variable Declarations
***************************************************************************************/
/* Used to initialize the data array with defaults ##U appends a 'U' to the bare
* integer specified in the DM macro */
#define DM(y, z) \
dataArray[y].name = y; \
dataArray[y].value = z##U;
static bool_t dataInitialized = FALSE;
static dataPair_t dataArray[NUM_DATA_VARIABLES];
/****************************************************************************************
* Private Function Prototypes
***************************************************************************************/
static void lib_data_Init(void);
/****************************************************************************************
* Public Functions
***************************************************************************************/
uint32_t lib_data_Get(dataNames_t dataName) {
if(!dataInitialized) {
lib_data_Init();
}
/* Should only be used on systems that do word-sized asm reads/writes.
* If the lib gets extended to multi-word storage capabilities, a mutex
* is necessary to protect against multi-threaded access */
return dataArray[dataName].value;
}
void lib_data_Set(dataNames_t dataName, uint32_t dataVal) {
if(!dataInitialized) {
lib_data_Init();
}
/* Should only be used on systems that do word-sized asm reads/writes.
* If the lib gets extended to multi-word storage capabilities, a mutex
* is necessary to protect against multi-threaded access */
dataArray[dataName].value = dataVal;
}
/****************************************************************************************
* Private Functions
***************************************************************************************/
/**
* initialize the machine data tables
*
* \param none
*
* \return none
*
*/
static void lib_data_Init(void) {
/* Invoke the macro to initialize dataArray */
DATA_LIST
dataInitialized = TRUE;
}

I could think of some mixture concepts from AUTOSAR modules, like NVM or even Com. For most of them, the actual type is irrelevant for the operation of this module, its just a number of bytes.
NvM has a statically configured configuration, generated by a tool according to a input definition. Each block is given an ID which is also configurable (e.g. typedef uint16 NvM_BlockIdType).
The config also contains the data size, a possible CRC (8,16,32), some infos like used with NvM_ReadAll or NvM_WriteAll (or not), possible init callbacks or init blocks for initializing the RamBlock, NvM Block Type as native, redundant or DataSet...
The Block is handled by NVM usually with one or two queues (Standard Job Queue and Immediate Job Queue).
From application side, its just a call of:
Std_ReturnType NvM_ReadBlock(NvM_BlockIdType BlockId, uint8* data);
Std_ReturnType NvM_WriteBlock(NvM_BlockIdType BlockId, uint8* data);
Std_ReturnType NvM_ReadAll(void);
Std_ReturnType NvM_WriteAll(void);
Std_ReturnType NvM_GetErrorStatus(NvM_BlockIdType BlockId, NvM_RequestResultType* res);
Usually Std_ReturnType will return E_OK on an accepted request, E_NOT_OK on some failure, e.g. ID not found, NvM not up...
Regarding some signal handling, like flags or like signals which are not primitive types like uint8, uint16 .. but maybe like uint11, uint4 .. Com actually receives IPDUs, and stores them to buffers. For transmission, it is also copies an IPDU for transmission.
On the higher layers, you have Com_SendSignal(uint16 ID, uint8* data) or Com_ReceiveSignal(uint16 ID, uint8* data).
Some implementation just create an big buffer by the size of all IPDUs. Then the have a SignalGroup and Signal configuration by Signal-ID and SignalGroup-ID, which stores the offset into the array as index, plus startbit and bitsize to finally pack/unpack the signal data from/to the pointer to data passed into the functions.
Similar to Com regarding pack/unpack can be seen in SomeIP transformer.

Related

How do I correctly use libsodium so that it is compatible between versions?

I'm planning on storing a bunch of records in a file, where each record is then signed with libsodium. However, I would like future versions of my program to be able to check signatures the current version has made, and ideally vice-versa.
For the current version of Sodium, signatures are made using the Ed25519 algorithm. I imagine that the default primitive can change in new versions of Sodium (otherwise libsodium wouldn't expose a way to choose a particular one, I think).
Should I...
Always use the default primitive (i.e. crypto_sign)
Use a specific primitive (i.e. crypto_sign_ed25519)
Do (1), but store the value of sodium_library_version_major() in the file (either in a dedicated 'sodium version' field or a general 'file format revision' field) and quit if the currently running version is lower
Do (3), but also store crypto_sign_primitive()
Do (4), but also store crypto_sign_bytes() and friends
...or should I do something else entirely?
My program will be written in C.

Let's first identify the set of possible problems and then try to solve it. We have some data (a record) and a signature. The signature can be computed with different algorithms. The program can evolve and change its behaviour, the libsodium can also (independently) evolve and change its behaviour. On the signature generation front we have:
crypto_sign(), which uses some default algorithm to produce signatures (at the moment of writing is just invokes crypto_sign_ed25519())
crypto_sign_ed25519(), which produces signatures based on specific ed25519 algorithm
I assume that for one particular algorithm given the same input data and the same key we'll always get the same result, as it's math and any deviation from this rule would make the library completely unusable.
Let's take a look at the two main options:
Using crypto_sign_ed25519() all the time and never changing this. Not that bad of an option, because it's simple and as long as crypto_sign_ed25519() exists in libsodium and is stable in its output you have nothing to worry about with stable fixed-size signature and zero management overhead for this. Of course, in future someone can discover some horrible problem with this algorithm and if you're not prepared to change the algorithm that could mean horrible problem for you.
Using crypto_sign(). With this we suddenly have a lot of problems, because the algorithm can change, so you must store some metadata along with the signature, which opens up a set of questions:
what to store?
should this metadata be record-level or file-level?
What do we have in mentioned functions for the second approach?
sodium_library_version_major() is a function to tell us the library API version. It's not directly related to changes in supported/default algorithms so it's of little use for our problems.
crypto_sign_primitive() is a function that returns a string identifying the algorithm used in crypto_sign(). That's a perfect match for what we need, because supposedly its output will change at exactly the time when the algorithm would change.
crypto_sign_bytes() is a function that returns the size of signature produced by crypto_sign() in bytes. That's useful for determining the amount of storage needed for the signature, but it can easily stay the same if algorithm changes, so it's not the metadata we need to store explicitly.
Now that we know what to store there is a question of processing that stored data. You need to get the algorithm name and use that to invoke matching verification function. Unfortunately, from what I see, libsodium itself doesn't provide any simple way to get the proper function given the algorithm name (like EVP_get_cipherbyname() or EVP_get_digestbyname() in openssl), so you need to make one yourself (which of course should fail for unknown name). And if you have to make one yourself maybe it would be even easier to store some numeric identifier instead of the name from library (more code though).
Now let's get back to file-level vs record-level. To solve that there are another two questions to ask — can you generate new signatures for old records at any given time (is that technically possible, is that allowed by policy) and do you need to append new records to old files?
If you can't generate new signatures for old records or you need to append new records and don't want the performance penalty of signature regeneration, then you don't have much choice and you need to:
have dynamic-size field for your signature
store the algorithm (dynamic string field or internal (for your application) ID) used to generate the signature along with the signature itself
If you can generate new signatures or especially if you don't need to append new records, then you can get away with simpler file-level approach when you store the algorithm used in a special file-level field and, if the signature algorithm changes, regenerate all signatures when saving the file (or use the old one when appending new records, that's also more of a compatibility policy question).
Other options? Well, what's so special about crypto_sign()? It's that its behaviour is not under your control, libsodium developers choose the algorithm for you (no doubt they choose good one), but if you have any versioning information in your file structure (not signature-specific, I mean) nothing prevents you from making your own particular choice and using one algorithm with one file version and another with another (with conversion code when needed, of course). Again, that's also based on the assumption that you can generate new signature and that's allowed by policy.
Which brings us back to the original two choices with question of whether it's worth the trouble of doing all that compared to just using crypto_sign_ed25519(). That mostly depends on your program life span, I'd probably say (just as an opinion) that if that's less than 5 years then it's easier to just use one particular algorithm. If it can easily be more than 10 years, then no, you really need to be able to survive algorithm (and probably even whole crypto library) changes.

Just use the high-level API.
Functions from the high-level API are not going to use a different algorithm without the major version of the library being bumped.
The only breaking change one can expect in libsodium 1.x.y is the removal of deprecated/undocumented functions (that don't even exist in current releases compiled with the --enable-minimal switch). Everything else will remain backward compatible.
New algorithms might be introduced in 1.x.y versions without high-level wrappers, and will be stabilized and exposed via a new high-level API in libsodium 2.
Therefore, do not bother calling crypto_sign_ed25519(). Just use crypto_sign().

Managing system-wide parameters in C

I'm developing a system of many processes which have to be aware of many configurations, options and data of the system. For doing that, I implement a shared object that use a pointer to a shared memory block of parameters and their data. The data of the parameters are the types, the values, default values, functions for get/set and etc. Basically the data is in a kind of look-up table.
This shared object has some functions for get/set these parameters, so all the processes in the system can get/set these many parameters. I have many defines for the parameters codes and many possibilities for each parameter, for example, one code can be a float value, and another is an array of ints. You can only imagine the complexity of the code with all the switch and cases..
My questions are:
Does this practice is correct for handling a system-wide parameters and configurations? for speed and efficiency I don't want to use a DB file, I have to keep the data close in the RAM. I thought about moving the look-up table in-memory DB but the processing time is critical and I don't want to waste time on building SQL statements and compiling them. Any ideas of what is the best way to do it?

Your program design sounds fine, given that he parameters are properly encapsulated in a separate file, declared as static and only accessible through set/get functions. Then the code for accessing the data, as well as any potential thread-safety code, can be placed in the same file and hidden from the caller.
Whenever it makes most sense to keep the parameters in RAM or in a DB really just depends on how fast you need the data available. It sounds like this isn't an option for you, because naturally a DB will be slower to access. It makes more sense to implement a DB if you have multiple clients that need to access the data, but that doesn't seem to be the case here.

Fastest conversion of MFC CArray<int> to List<int> in Managed C++

We are currently in the process of changing a lot of code as we move from a 'single system' application to one that can farm out tasks to distributed processing nodes. The existing code is in a mix of unmanaged and now managed C++ code, but is also using C# code to provide a WCF interface between node and controller.
As a result of this move, a common code pattern I'm seeing which is likely to remain for the foreseeable future is a basic conversion of integer ID values from an MFC CArray to a managed List to enable serialisation over WCF. The current pattern is:
List<int>^ fixtures = gcnew List<int>(aFixtureIds.GetCount());
for(int i = 0; i < aFixtureIds.GetCount(); i++) //aFixtureIds is of MFC type CArray<int,int>
{
fixtures->Add(aFixtureIds[i]);
}
We also use something similar in reverse, where if a List is returned we may convert it to a CIntArray for the calling function by iterating through it in a loop and calling Add.
I appreciate that the above doesn't look very intensive but it does get called a lot - is there a better pattern for performing this basic List<->CArray conversion that would use up less processing time? Is this the kind of code that can be optimised effectively by the compiler (I suspect not but am willing to be corrected)? List sizes vary but will typically be anything from 1 to tens of thousands of items, potentially more.

A few suggestions although a lot of it will depend on your application details:
Measure: I suppose it gets old that every SO question on performance has people just saying "measure first" but this is usually (always?) good advice, particularly in this case. I fear that you're drastically over estimating how much time such a conversion takes on a modern desktop computer. For example, on my pokey 4-year old desktop a quick test shows I can add 100 million integers to a CArray in just 250 ms. I would expect roughly the same performance in C# for the List. This means that your 10,000 element list would take around 25 microseconds to run. Now, obviously if you are doing this 1000s of times a second or for billions of elements it becomes an issue although the solution in these cases would not be a faster algorithm but a better one.
Update When Needed: Only update the arrays when you actually need the information. If you are updating frequently just because you 'might' need it you will likely be wasting conversions that are never used.
Synchronize Updates: Instead of updating the entire array at a time update the elements in both copies of lists at the same time.
Just Use One List: While it sounds like you can't do this at least consider just using one list instead of two copies of the same information. Hide the element access in a function/method/class so you don't need to know whether the data is stored in a CArray<> or a List<>.
Better Design/Algorithm: If you measure and you do find that a lot time is being spent in this conversion then you're probably going to get a far better return by improving the code design to reduce or eliminate the conversions needed. Off-hand, there's probably not a whole lot of improvement possible in just copying a CArray<> to a List<>.

Large mutable byte array in Erlang

As I am writing a simple Minecraft server application in Erlang, I am now concerned with the question of how to efficiently store and modify chunk data.
For those who don't know about Minecraft's internals: I need to store a lot of binaries (100-1000) of up to 32kB size in memory. Until this point Erlang's builtin binaries are sufficient. But the server has to read and change some bytes (by their id) in these binaries quite often and I don't want to copy them around all the time.
A nice to have feature would be import and export from/to Erlang's standard binaries.
Is there any Erlang extension or database or whatever I could use for this?

Since binaries are read-only, I can think of the following approaches (assuming you expect high rate of changes):
Use tree-like structure with relatively small immutable binaries in the leafs. In that case, when you modify you data, you only need to re-create small leaf binary + all nodes up to the root. Assuming that changes are "local" to some position, I think, you can start with octo-tree.
Use "big" binaries + list of changes (that could be as simple list of functions). When you need to modify world, just add new function to the list. When someone asks for the world state, take base binary and apply all changes from the list. From time to time "squash" all changes and prepare a new baseline state binary. This could be combined with previous approach (tree with pairs of binary/changes in the leafs).
Move mutable world management to the external code. You can use NIFs or Ports. I think, that would be the fastest way. Also, I think it would be relatively easy to implement it. The first version of API could be as simple as world:new(X, Y, Z) -> ref(); world:get(Ref, X, Y, Z); world:set(Ref, X, Y, Z, Value);

My suggestion is to use a "rope" structure. Basically a tree-like structure, usually a splay or finger tree in which you allow to change parts of the tree only. Doing this the right way allows you to have the best of both worlds: immutability together with fast updates.
I don't know of a rope-implementation, but this is what you want to do.
An alternative is to use ets as a mutable array. It is quite fast for that kind of thing.
The third option is to use a spatial tree-structure to represent your world. Octrees or a BSP-like structure is something I'd grab for, blindly.

1000 32kb binaries is not a big deal. you say change some bytes? maybe it would be better if you just made a record of the different parts the binary can contain or that are often modified, that way you can modify parts of the binary without copying others. With ets you don't have to worry about a bunch of binary copies waiting to be gc'ed. It still makes a copy and it's still gc'ed but not in the same way as a process. You can can also use the fullsweep_after opt to make a process cleanup more often.

Practical to save thousands of data structures in a file and do specific lookups?

There's been a discussion between me and some colleagues that are taking the same class as me (and thus have the same project) about saving data to files and read from those files only when we need that specific data.
For instance, the project is something about managing a social network. I'm not going into specifics because it doesn't matter, but the idea is to use the best data structures to manipulate this data.
Let's say I'm using an Hash Table to save the users profile data. Some of them argue that only some specific information should be saved in the data structures, like and ID that represents an user. Everything else should be put on files. We should access the files to get that data we want when we want.
I don't think this is practical... It could be if we were using some library for a database like SQLite or something, but are not and I don't think we are supposed to. We are only supposed to code everything ourselves and use C functions, like these. Nor do I think we are supposed to do a perfect memory management. The requisites of the project are not for us to code a database, or even a pseudo-database. What this project demands of us, are the best data structures (as long as we know how to justify why we picked those instead of others) to store the type of data and the all data specified for the project.
I should let you know that we had 2 classes before where the knowledge we got there is to be applied on this project. One of those dealt with the basis of C, functions, structures, arrays, strings, file IO, recursion, pointers and simple data structures like binary trees and linked lists, stuff like that. The other one was about more complex data structures, hash tables, AVL trees, heaps, graphs, etc... It also talked about time complexity, big O notation and stuff like that.
For instance, let's say all I have in memory is the IDs of the users and then I need to find all friends of a specific user. I'll have to process the whole file (or files) finding out the friends of that user. It would be much easier if I could have all that data in memory already.
It makes no sense to me that we need to pick (and justify) the data structures that we best see fit for the project and then only use them to lookup for an ID. We will then need to do a second lookup, to get the real data we need, which will take it's time, won't it? Why did we bother with the data structures in the first place if we still need to get to search a bunch of files on the hard drive?
How could it be possible, using standard C functions, coding everything manually and still simulate some kind of database? Is this practical at all?
Am I missing something here?

It sounds like the project might be more about how you design the relationships between your data "entities," and not as much about how you store them. I don't think storing data off in files would be a good solution - file IO will be much slower than accessing things in memory. If you had the need to persist data on the disk, you'd probably want to just use a database, rather than files (I know it's an academic course though, so who knows).
I think you should focus more on how you design your data types, and their relationships, to maximize the speed of lookups, searches, etc. For example, you could store all the users in a linked list, or store them in a tree, or a graph, but each will have its implications on how fast you can find users, etc. Depending on what features you want in your social networking site, there will be different designs that will allow different types of behavior to perform better than it would in other designs.

From what you're saying I doubt that you need to store anything on disk.
One thing that I would ask the teacher is if you're optimizing for time or space complexity (there will be a trade off between these two depending on what you're trying to achieve).

That can certainly be done. The resource forks in Mac System 5-8 files were stored as binary indexed databases (general use of the term, don't think SQL!). (I think the interface was actually written in assembly, but I could do it in c).
The only thing is: it's a pain in the butt. Such files typically need to start with some kind of index or header, and then hold a bunch of records at predictable locations. (OK, sometimes the first index just points at some more indexes. How many layers of indirection do you care to manage?)
If you're going to do it, just remember: binary mode access.

Hmm... what about persistent storage?
If your project requires you to be able to remember friend data between two restarts of the app, then don't you think file storage (or whatever else becomes an issue)?

I'm having a very hard time figuring out what you are trying to ask here.
But there is a general rule that may apply:
If all of your data will fit in memory at once, it is usually best to load all of it into memory at once and keep it there. You write out to a file only to save, to exit, or for backup.
There are lots of exceptions to this rule, but for a class project where this is going to be the only major application running on the machine, you may as well store everything in memory. After all, you have already paid for the memory; you don't want it just sitting there idle.
I may have completely misunderstood the question you are trying to ask...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight