Safely storing and accessing EEPROM - c

I've recently established the need to store infrequently-updated configuration variables in the EEPROM of a microcontroller. Adding state to the program immediately forces one to worry about
detection of uninitialized data in EEPROM (i.e. first boot),
converting or invalidating data from old firmware versions, and
addressing of multiple structures, each of which which may grow in firmware updates.
Extensive Googling has only turned up one article that addresses keeping your EEPROM data valid through firmware updates. Has anyone used the approach discussed in that article? Is there a better alternative approach?

Personally, I prefer a "tagged table" format.
In this format, your data is split up into a series of "tables". Each table has a header that follows a predictable format and a body that can change as you need it to.
Here's an example of what one of the tables would look like:
Byte 0: Table Length (in 16-bit words)
Byte 1: Table ID (used by firmware to determine what this data is)
Byte 2: Format Version (incremented every time the format of this table changes)
Byte 3: Checksum (simple sum-to-zero checksum)
Byte 4: Start of body
...
Byte N: End of body
I wasn't storing a lot of data, so I used a single byte for each field in the header. You can use whatever size you need, so long as you never change it. The data tables are written one after another into the EEPROM.
When your firmware needs to read the data out of the EEPROM, it starts reading at the first table. If the firmware recognizes the table ID and supports the listed table version, it loads the data out of the body of the table (after validating the checksum, of course). If the ID, version, or checksum don't check out, the table is simply skipped. The length field is used to locate the next table in the chain. When firmware sees a table with a length of zero, it knows that it has reached the end of the data and that there are no more tables to process.
I find this format flexible (I can add any type of data into the body of a table) and robust (keep the header format constant and the data tables will be both forward- and backwards-compatible).
There are a couple of caveats, though they are not too burdensome. First, you need to ensure that your firmware can handle the case where important data either isn't in the table or is using an unsupported format version. You will also need to initialize the first byte of the EEPROM storage area to zero (so that on the first boot, you don't start loading in garbage thinking that it's data). Since each table knows its length it is possible to expand or shrink a table; however, you have to move the rest of the table storage area around in order to ensure that there are no "holes" (if the entire chain of tables can't fit in your device's memory, then this process can be annoying). Personally, I don't find any of these to be that big of a problem, and it is well worth the trouble I save over using some other methods of data storage.

Nigel Jones has covered some of the basics in your reference. There are plenty of alternatives.
One alternative, of you have lots of room, is storing key-value pairs instead of structures. Then you can update one value (by appending it) without erasing everything. This is most useful in devices that have a limited number of erase cycles. Your read routine will need to scan from the beginning, updating values each time the key is encountered. Of course your update routine will need to have a "garbage collector" that kicks in when the memory is full.
To handle device errors and power-downs in the middle of updates, we usually store multiple copies of the data. The simplest approach is to pingpong between to halves of the device using sequence number to determine which is newer. A CRC on each section is used to validate it. This also addresses the uninitialized data issue.
For the key-value version you'd need to append the new CRC after each write.

Related

Why does SQLite store hundreds of null bytes?

In a database I'm creating, I was curious why the size was so much larger than the contents, and checked out the hex code. In a 4 kB file (single row as a test), there are two major chunks that are roughly 900 and 1000 bytes, along with a couple smaller ones that are all null bytes 0x0
I can't think of any logical reason it would be advantageous to store thousands of null bytes, increasing the size of the database significantly.
Can someone explain this to me? I've tried searching, and haven't been able to find anything.
The structure of a SQLite database file (`*.sqlite) is described in this page:
https://www.sqlite.org/fileformat.html
SQLite files are partitioned into "pages" which are between 512 and 65536 bytes long - in your case I imagine the page size is probably 1KiB. If you're storing data that's smaller than 1KiB (as you are with your single test row, which I imagine is maybe 100 bytes long?) then that leaves 900 bytes left - and unused (deallocated) space is usually zeroed-out before (and after) use.
It's the same way computer working memory (RAM) works - as RAM also uses paging.
I imagine you expected the file to be very compact with a terse internal representation; this is the case with some file formats - such as old-school OLE-based Office documents but others (and especially database files) require a different file layout that is optimized simultaneously for quick access, quick insertion of new data, and is also arranged to help prevent internal fragmentation - this comes at the cost of some wasted space.
A quick thought-experiment will demonstrate why mutable (i.e. non-read-only) databases cannot use a compact internal file structure:
Think of a single database table as being like a CSV file (and CSVs themselves are compact enough with very little wasted space).
You can INSERT new rows by appending to the end of the file.
You can DELETE an existing row by simply overwriting the row's space in the file with zeroes. Note that you cannot actually "delete" the space by "moving" data (like using the Backspace key in Notepad) because that means copying all of the data in the file around - this is largely a bad idea.
You can UPDATE a row by checking to see if the new row's width will fit in the current space (and overwrite the remaining space with zeros), or if not, then append a new row at the end and overwrite the existing row (a-la INSERT then DELETE)
But what if you have two database tables (with different columns) and need to store them in the same file? One approach is to simply mix each table's rows in the same flat file - but for other reasons that's a bad idea. So instead, inside your entire *.sqlite file, you create "sub-files", that have a known, fixed size (e.g. 4KiB) that store only rows for a single table until the sub-file is full; they also store a pointer (like a linked-list) to the next sub-file that contains the rest of the data, if any. Then you simply create new sub-files as you need more space inside the file and set-up their next-file pointers. These sub-files are what a "page" is in a database file, and is how you can have multiple read/write database tables contained within the same parent filesystem file.
Then in addition to these pages to store table data, you also need to store the indexes (which is what allows you to locate a table row near-instantly without needing to scan the entire table or file) and other metadata, such as the column-definitions themselves - and often they're stored in pages too. Relational (tabular) database files can be considered filesystems in their own right (just encapsulated in a parent filesystem... which could be inside a *.vhd file... which could be buried inside a varbinary database column... inside another filesystem), and even the database systems themselves have been compared to operating-systems (as they offer an environment for programs (stored procedures) to run, they offer IO services, and so on - it's almost circular if you look at the old COBOL-based mainframes from the 1970s when all of your IO operations were restricted to just computer record management operations (insert, update, delete).

How to instantly query a 64Go database

Ok everyone, I have an excellent challenge for you. Here is the format of my data :
ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP
ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.
My problem is as follow:
I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.
First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.
Second is to store everything in memory into a huge hashtable. RAM is the limit.
Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?
When I say real-time, I mean about 100-200 query/second.
A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.
I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.
Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.
Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.
Given the way the data is used, you should do the following:
Create a record structure (fixed size) that is large enough to contain one full row of data
Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.
Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.

detecting when data has changed

Ok, so the story is like this:
-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore
-- these files are continuously updated with data, sometimes new, sometimes the same data
-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database
-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.
-- the files contains strings and numbers (titles, orders, prices etc.)
The only solutions I could think of are:
-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database
-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.
Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.
Any advice, will be appreciated.
Problem definition as understood.
Let’s say your file contains
ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40
As you stated Row can be added / updated , hence the file becomes
ID,Name,Age
1,Jim,20 -- to be discarded
2,Tim,35 -- to be updated
3,Kim,40 -- to be discarded
4,Zim,30 -- to be inserted
Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.
I am making following assumptions here
You cannot modify the existing process to create files.
You are using some batch processing [Reading from file - Processing in Memory- Writing in DB]
to upload the data in the database.
Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].
Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.
If (ID present)
--- compare hash
---found same then discard it
—found different create an update sql
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue
You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
Hope this helps.
Instead of computing the hash for each row from the database on demand, why don't you store the hash value instead?
Then you could just compute the hash value of the file in question and compare it against the database stored ones.
Update:
Another option that came to my mind is to store the Last Modified date/time information on the database and then compare it against that of the file in question. This should work, provided the information cannot be changed either intentionally or by accident.
Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.
Unless you can modify the process that writes to the files.
Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.
1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).
2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).
Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).
Note(s)
Both #1 and #2 are interesting only if you can make adjustment to the process that writes the data to the disk
If you can not modify the process that writes in the 25GB data then I don't see how checksums/hashes can help - you have to read all the data anyway to compute the hashes (since you don't know what changed) so you can directly compare while you read and come up with a list of rows to update/add (or update/add directly)
Using diff algorithms might be suboptimal, diff algorithm will not only look for the lines that changed, but also check for the minimal edit distance between two text files given certain formatting options. (in diff, this can be controlled with -H or --minimal to work slower or faster, ie search for exact minimal solution or use heuristic algorithm for which if iirc this algorithm becomes O(n log n); which is not bad, but still slower then O(n) which is available to you if you do direct comparison line by line)
practically it's kind of problem that has to be solved by backup software, so why not use some of their standard solutions?
the best one would be to hook the WriteFile calls so that you'll receive callbacks on each update. This would work pretty well with binary records.
Something that I cannot understand: the files are actually text files that are not just appended, but updated? this is highly ineffective ( together with idea of keeping 2 copies of files, because it will make the file caching work even worse).

How do databases deal with data tables that cannot fit in memory?

Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.

Alarm history stack or queue?

I'm trying to develop an alarm history structure to be stored in non-volatile flash memory. Flash memory has a limited number of write cycles so I need a way to add records to the structure without rewriting all of the flash pages in the structure each time or writing out updated pointers to the head/tail of the queue.
Additionally once the available flash memory space has been used I want to begin overwriting records previously stored in flash starting with the first record added first-in-first-out. This makes me think a circular buffer would work best for adding items. However when viewing records I want the structure to work like a stack. E.g. The records would be displayed in reverse chronological order last-in-first-out.
Structure size, head, tail, indexes can not be stored unless they are stored in the record itself since if they were written out each time to a fixed location it would exceed the maximum write cycles on the page where they were stored.
So should I use a stack, a queue, or some hybrid structure? How should I store the head, tail, size information in flash so that it can be re-initialized after power-up?
See a related question Circular buffer in Flash.
Lookup ring-buffer
Assuming you can work out which is the last entry (from a time stamp etc so don't need to write a marker) this also has the best wear leveling performance.
Edit: Doesn't apply to the OP's flash controller: You shouldn't have to worry about wear leveling in your code. The flash memory controller should handle this behind the scenes.
However, if you still want to go ahead an do this, just use a regular circular buffer, and keep pointers to the head and tail of the stack.
You could also consider using a Least Recently Used cache to manage where on flash to store data.
You definitely want a ring buffer. But you're right, the meta information is a bit...interesting.
Map your entries on several sections. When the sections are full, overwrite starting with the first section. Add a sequence-number (nbr sequence numbers > 2 * entries), so on reboot you know what the first entry is.
You could do a version of the ring-buffer, where the first element stored in the page is number of times that page has been written. This allows you to determine where you should write next by finding the first page where the number is lower than the previous page. If they're all the same, you start from the beginning with the next number.

Resources