So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.
I am doing an information retrieval project in c++. What are the advantages of using a database to store terms, as opposed to storing it in a data structure such as a vector? More generally when is it a good idea to use a database rather than a data structure?
(Shawn): Whenever you want to keep the data beyond the length of the instance of the program. (persistence across time)
(Michael Kjörling): Whenever you want many instances of your program, either in the same computer or in many computers, like in a network or the Net, access and manipulate (share) the same data. (persistence across network space)
Whenever you have very big amount of data that do not fit into memory.
Whenever you have very complex data structures and you prefer to not rewrite code to manipulate them, e.g search, update them, when the db programmers have already written such code and probably much faster than the code you (or I)'ll write.
Whenever you want to keep the data beyond the length of the instance of the program?
In addition to Shawn pointing out persistence: whenever you want multiple instances of the program to be able to easily share data?
In-memory data structures are great, but they are not a replacement for persistence.
It really depends on the scope. For example if you're going to have multiple applications accessing the data then a database is better because you won't have to worry about file locks, etc. Also, you'd use a database when you need to do things like joining other data, sorting, etc... unless you like to implement Quicksort.
I'm in the process of designing a embedded C data storage module. It will be included by files/modules who want access to this "shared" system-wide data. Multiple tasks aggregate dozens of inputs (GPIO, CAN, I2C/SPI/SSP data, etc) and stores those values off using the API. Then, other tasks can access the data safely through the API. The system is an embedded app with an RTOS, so mutexes are used to protect the data. These ideas will be used regardless of the implementation
I've designed something like this in the past, and I'm trying to improve upon it. I'm currently halfway through a new implementation and I'm running into a few hiccups and would really benefit from a fresh perspective.
Quick rundown of the requirements of this module:
Ideally, there would be one interface that can access the variables (one get, one set).
I'd like to return different variable types (floats, ints, etc). This means macros are probably needed.
I'm not pressed for code space, but it's always a concern
Quick gets/sets are absolutely paramount (which means storing in strings ala xml/json is out)
No new variables need to be added during runtime. Everything is statically defined at boot
The question is how would you go about designing something like this? Enumerations, structures, accessors, macros, etc? I'm not looking for code here, just discussing general overall design ideas. If there's a solution on the internet that addresses these sorts of things, perhaps even just a link is sufficient.
I've been in this situation a couple times myself. Every time I've ended "rolling my own", and I definitely don't suffer from Not Invented Here (NIH) syndrome. Sometimes the space, processing turnaround time or reliability/recoverability requirements just make that the least painful path.
So, rather than writing the great American novel on the topic, I'll just throw out some thoughts here, as your question is pretty broad (but thank you for at least forming a question & providing background).
Is C++ on the table? Inline functions, templates, and some of the Boost libraries might be useful here. But I'm guessing this is straight-up C.
If you're using C99, you can at least use inline functions, which are a step above macros when it comes to type safety.
You might want to think about using several mutexes to protect different parts of the data; even though the updates are quick, you might want to break up the data into sections (e.g. configuration data, init data, error logging data, trace data, etc.) and give each its own mutex, reducing the funnel/choke points.
You could also consider making all access to the data go through a server task. All reads & writes go through an API which communicates with the server task. The server tasks pulls reads & write requests in order from its queue, handles them quickly by writing to a RAM mirror, sending responses if needed (at least for read requests), and then buffers data to NVM in the background, if necessary. Sounds heavyweight compared to simple mutexes but it has its advantages in certain use cases. Don't know enough about your application to know if this is a possibility.
One thing I will say, is the idea of getting/setting by a tag (e.g. maybe a list of enums like CONFIG_DATA, ADDRESS_DATA, etc.) is a huge step forward from directly addressing data (e.g. "give me the 256 bytes at address ox42000). I've seen many many shops suffer great pain when the whole physical-addressing scheme finally breaks down & they need to re-factor / re-design. Try to keep the "what" decoupled from the "how" - clients shouldn't have to know or care where stuff is stored, how big it is, etc. (You might already know all this, sorry if so, I just see it all the time...)
One last thing. You mentioned mutexes. Beware of priority inversion... that can make those "quick accesses" take a very long time in some cases. Most kernel mutex implementations allow you to account for this, but often it's not enabled by default. Again, sorry if this is old news...
A few approaches I've had experience with and have found each good for its own needs. Just writing down my thoughts on this issue, hope this gives you some ideas you can go with...
For more complex data with lots of dependencies and constraints, I found a higher-level module is usually preferable, even when coming in the expense of speed. It saves you a lot of headache, and its usually the way to go in my experience, unless you have really tough constraints. The thing is to store most of your "static" data, that doesn't change a lot, in this type of module, and benefit from the ease of use and the more "complex" features (e.g. Data Validation, Referential Integrity, Querying, Transactions, Privileges etc). However, if speed is crucial in some areas, you might want to keep it out of this module altogether, or implement some simple "mirroring" scheme (where you could for example periodically synchronize back and forth between your lightweight mirrors and main configuration DB on some idle task). Having some experience with this issue on embedded systems, I have found sqlite to be the easiest to work with. Seeing your comments, I can testify the porting is not that big of a deal if you have the C standard library supported for your platform. I have ported the system to a PowerPC architecture in about a week, and we had pure in-house proprietary code. It was mostly a matter of getting the OS abstraction layer figured out and was smooth sailing from there. I highly recommend sqlite for this.
If the above is still too heavy (either in terms of performance, or perhaps it is just overkill), you can get most of the important benefits from using a basic key-value system. The idea is similar to what was mentioned here before: define your keys via enum's, store your data in a table of same-sized (e.g. 64 bits) values, and let each variable reside in one cell. I find this to work for simpler projects, with not THAT many types of configuration entries, while you can keep your flexibility and the system is still easy to develop and use. A few pointers for this:
Divide your enum's into groups or categories, each with its own value range. A scheme I like is to define the groups themselves in one enum, and use those values for starting off the first value of each of the groups (EntryInGroupID = (groupid << 16) | serial_entry_id).
It's easy to implement "tables" using this scheme (where you have more than one entry for each need or configuration, e.g. routing tables). You can generalize the design by making anything a "table", and if you want to define a single value, just make it a table with one entry. In embedded systems I would recommend allocating everything up-front, according to the maximum number of entries possible for that table, and "using" part of it. That way you've covered the worst case and are being deterministic.
Another thing I found very useful was using Hooks for everything. These hooks can be implemented using Function Pointers (for each behavior / policy / operation), which are given through a "descriptor" struct which is used to define the data entry. Usually, your entries will use the "default" hook functions for read/write (e.g. direct memory access, or file access), and NULL-out the other possible hooks (e.g. "validate", "default value", "notify for change" and such). Once in a while, when you have a more complicated case, you can take benefit of the hooks system you've planted in with little development cost. This hooks system can inherently provides you with a "hardware abstraction layer" for the storage of the values themselves, and can access data from memory, registers, or files with the same ease. A "notify for changes" register/unregister facility can also really help for notifying tasks about configuration changes, and is not that hard to add.
For ultra-real-time needs, I would just recommend you keep it simple and save your data in structs. Its pretty much the most lightweight way you can go, and it usually handles all your needs. However, it is hard to handle concurrency in this system, so you would either have to wrap all access with mutexes like you suggested, or just keep a mirror for every "task". If this is indeed ultra-real-time, I would go for the mirroring approach with periodical sync's throughout the system). Both this system and the one suggested above make it super easy to serialize your data (as it is always stored in binary format), which will ease things down the road and just keep things simple (all you need for the serialization is a memcpy with a cast or two). Also, using unions from time to time might help, and some optimizers handle it well, however it really depends on the architecture and toolchain you're working with.
If you're really worried about having to upgrade your data for new schemas, you might want to consider an "upgrader module", which can be configured with several schemas, and will be used to convert old formats to the new when invoked. You can use code generation techniques from XML schemas to generate the database itself (e.g. into structs and enums), and take advantage of XML tools to perform the "conversion code", by writing XSLT or just plain python-like scripts. You might want to perform the conversion off-target (e.g. at the host), with a more high-level script, rather than cramming that logic into the target itself.
For Data Type Safety you can use lots of template and meta-programming wizardry. While tough to maintain as the framework coder, it eases things up for the users of the system. If coded right (which is tricky!), it can save you lots of cycles and not impose a lot of code overhead (with inlining, explicit template instantiation and much care). If your compiler isn't all that good (which is usually the case with embedded platform compilers), just use wrapper macros for conversion.
About the concurrency issue, I generally find myself resorting to one of the following methods, depending of the overall design of the system
Like you suggested, wrap the whole thing with a mutex and hope for the best. You can get more precision by adding per-table or even per-entry locks, but this has its own set of problems and I would recommend just keeping it simple and stick to a global mutex. If you have replications for the hard real-time stuff, this probably won't cripple your performance since you only do the sync once in a while.
Design your database as a "single task", and wrap all access to it via messages (MQ's, sockets, whatever). This works better for the non-critical-path data (again), and is an easy solution. If there aren't too much simultaneous accesses going on, this might be the way to go. Note that if you're in the same memory range and have shared memory, the performance of reads/writes isn't that expensive (since you aren't copying data), and if you play with your task's priorities right, it can be a simple solution that will solve most of your needs (by avoiding the problem of concurrent access altogether).
Most systems I've worked with required some type of configuration module, usually over both volatile and non-volatile memory. It's very easy (and even tempting) to over-design this, or bring an overkill solution where its not needed. I have found that trying to be "future-proof" on these issues will in a lot of cases just be a waste of time, usually the simplest way to go is the best (in real-time and embedded systems anyway). Also, performance issues tend to creep up when you scale the system, and its sometimes to stick with the simpler and faster solution up ahead.
Therefore, given the information you've supplied and the above analysis, I'd recommend you go with the key-value or direct structs approach. I'm a personal fan of the "single task" approach for my systems, but there is really no one absolute answer for this, and it requires deeper analysis. For these solutions, having looked for a "generic" and off-the-shelf solution, I always find myself implementing it myself in 1-2 weeks and saving myself a lot of headaches.
Hope the answer wasn't overkill :-)
I usually go for a simple dictionary-like API using an int as the key and a fixed-size value. This executes quickly, uses a very small amount of program RAM and has predictable data RAM usage. In other words, the lowest-level API looks like:
void data_set(uint16 key, uint32 value);
uint32 data_get(uint16 key);
Keys become a list of constants:
#define KEY_BOGOMIPS 1
#define KEY_NERDS_PER_HOUR 2
You handle different data types by casting. Sucks, but you can write macros to make the code a little cleaner:
#define data_get_float(key) (float)data_get(key)
Achieving type safety is difficult to do without writing a separate macro or accessor function for each item. On one project, I needed validation of input data, and this became the type-safety mechanism.
The way you structure the physical storage of data depends how much data memory, program memory, cycles and separate keys you have. If you've got lots of program space, hash the key to get a smaller key that you can look up directly in an array. Usually, I make the underlying storage look like:
struct data_item_t {
uint16 key;
uint32 value;
}
struct data_item_t items[NUM_ITEMS];
and iterate through. For me, this has been fast enough even on very small (8-bit) microcontrollers, though it might not fit for you if you've got a lot of items.
Remember that your compiler will probably inline or optimise the writes nicely, so cycles per access may be lower than you'd expect.
Figured I'd update one of my only unaccepted questions. Here's my final implementation. Been using this for over a year and it works fantastic. Very easy to add variables to and very little overhead for the benefit it gives us.
lib_data.h:
#ifndef __LIB_DATA_H
#define __LIB_DATA_H
#include <type.h>
/****************************************************************************************
* Constant Definitions
***************************************************************************************/
/* Varname, default value (uint32_t) */
#define DATA_LIST \
DM(D_VAR1, 0) \
DM(D_VAR2, 1) \
DM(D_VAR3, 43)
#define DM(y, z) y,
/* create data structure from the macro */
typedef enum {
DATA_LIST
NUM_DATA_VARIABLES
} dataNames_t;
typedef struct {
dataNames_t name;
uint32_t value;
} dataPair_t;
/* the macro has to be undefined to allow the fault list to be reused without being
* defined multiple times
*
* this requires:
* a file specific lint option to suppress the rule for use of #undef
*/
#undef DM
/****************************************************************************************
* Data Prototypes
***************************************************************************************/
/****************************************************************************************
* External Function Prototypes
***************************************************************************************/
/**
* Fetch a machine parameter
*
* \param dataName The variable from DATA_LIST that you want to fetch
*
* \return The value of the requested parameter
*
*/
uint32_t lib_data_Get(dataNames_t dataName);
/**
* Set a machine parameter
*
* \param dataName The variable from DATA_LIST that you want to set
* \param dataVal The value you want to set the variable to
*
* \return void
*
*/
void lib_data_Set(dataNames_t dataName, uint32_t dataVal);
#endif /* __LIB_DATA_H */
lib_data.c:
#include <type.h>
#include "lib_data.h"
/****************************************************************************************
* Variable Declarations
***************************************************************************************/
/* Used to initialize the data array with defaults ##U appends a 'U' to the bare
* integer specified in the DM macro */
#define DM(y, z) \
dataArray[y].name = y; \
dataArray[y].value = z##U;
static bool_t dataInitialized = FALSE;
static dataPair_t dataArray[NUM_DATA_VARIABLES];
/****************************************************************************************
* Private Function Prototypes
***************************************************************************************/
static void lib_data_Init(void);
/****************************************************************************************
* Public Functions
***************************************************************************************/
uint32_t lib_data_Get(dataNames_t dataName) {
if(!dataInitialized) {
lib_data_Init();
}
/* Should only be used on systems that do word-sized asm reads/writes.
* If the lib gets extended to multi-word storage capabilities, a mutex
* is necessary to protect against multi-threaded access */
return dataArray[dataName].value;
}
void lib_data_Set(dataNames_t dataName, uint32_t dataVal) {
if(!dataInitialized) {
lib_data_Init();
}
/* Should only be used on systems that do word-sized asm reads/writes.
* If the lib gets extended to multi-word storage capabilities, a mutex
* is necessary to protect against multi-threaded access */
dataArray[dataName].value = dataVal;
}
/****************************************************************************************
* Private Functions
***************************************************************************************/
/**
* initialize the machine data tables
*
* \param none
*
* \return none
*
*/
static void lib_data_Init(void) {
/* Invoke the macro to initialize dataArray */
DATA_LIST
dataInitialized = TRUE;
}
I could think of some mixture concepts from AUTOSAR modules, like NVM or even Com. For most of them, the actual type is irrelevant for the operation of this module, its just a number of bytes.
NvM has a statically configured configuration, generated by a tool according to a input definition. Each block is given an ID which is also configurable (e.g. typedef uint16 NvM_BlockIdType).
The config also contains the data size, a possible CRC (8,16,32), some infos like used with NvM_ReadAll or NvM_WriteAll (or not), possible init callbacks or init blocks for initializing the RamBlock, NvM Block Type as native, redundant or DataSet...
The Block is handled by NVM usually with one or two queues (Standard Job Queue and Immediate Job Queue).
From application side, its just a call of:
Std_ReturnType NvM_ReadBlock(NvM_BlockIdType BlockId, uint8* data);
Std_ReturnType NvM_WriteBlock(NvM_BlockIdType BlockId, uint8* data);
Std_ReturnType NvM_ReadAll(void);
Std_ReturnType NvM_WriteAll(void);
Std_ReturnType NvM_GetErrorStatus(NvM_BlockIdType BlockId, NvM_RequestResultType* res);
Usually Std_ReturnType will return E_OK on an accepted request, E_NOT_OK on some failure, e.g. ID not found, NvM not up...
Regarding some signal handling, like flags or like signals which are not primitive types like uint8, uint16 .. but maybe like uint11, uint4 .. Com actually receives IPDUs, and stores them to buffers. For transmission, it is also copies an IPDU for transmission.
On the higher layers, you have Com_SendSignal(uint16 ID, uint8* data) or Com_ReceiveSignal(uint16 ID, uint8* data).
Some implementation just create an big buffer by the size of all IPDUs. Then the have a SignalGroup and Signal configuration by Signal-ID and SignalGroup-ID, which stores the offset into the array as index, plus startbit and bitsize to finally pack/unpack the signal data from/to the pointer to data passed into the functions.
Similar to Com regarding pack/unpack can be seen in SomeIP transformer.
There's been a discussion between me and some colleagues that are taking the same class as me (and thus have the same project) about saving data to files and read from those files only when we need that specific data.
For instance, the project is something about managing a social network. I'm not going into specifics because it doesn't matter, but the idea is to use the best data structures to manipulate this data.
Let's say I'm using an Hash Table to save the users profile data. Some of them argue that only some specific information should be saved in the data structures, like and ID that represents an user. Everything else should be put on files. We should access the files to get that data we want when we want.
I don't think this is practical... It could be if we were using some library for a database like SQLite or something, but are not and I don't think we are supposed to. We are only supposed to code everything ourselves and use C functions, like these. Nor do I think we are supposed to do a perfect memory management. The requisites of the project are not for us to code a database, or even a pseudo-database. What this project demands of us, are the best data structures (as long as we know how to justify why we picked those instead of others) to store the type of data and the all data specified for the project.
I should let you know that we had 2 classes before where the knowledge we got there is to be applied on this project. One of those dealt with the basis of C, functions, structures, arrays, strings, file IO, recursion, pointers and simple data structures like binary trees and linked lists, stuff like that. The other one was about more complex data structures, hash tables, AVL trees, heaps, graphs, etc... It also talked about time complexity, big O notation and stuff like that.
For instance, let's say all I have in memory is the IDs of the users and then I need to find all friends of a specific user. I'll have to process the whole file (or files) finding out the friends of that user. It would be much easier if I could have all that data in memory already.
It makes no sense to me that we need to pick (and justify) the data structures that we best see fit for the project and then only use them to lookup for an ID. We will then need to do a second lookup, to get the real data we need, which will take it's time, won't it? Why did we bother with the data structures in the first place if we still need to get to search a bunch of files on the hard drive?
How could it be possible, using standard C functions, coding everything manually and still simulate some kind of database? Is this practical at all?
Am I missing something here?
It sounds like the project might be more about how you design the relationships between your data "entities," and not as much about how you store them. I don't think storing data off in files would be a good solution - file IO will be much slower than accessing things in memory. If you had the need to persist data on the disk, you'd probably want to just use a database, rather than files (I know it's an academic course though, so who knows).
I think you should focus more on how you design your data types, and their relationships, to maximize the speed of lookups, searches, etc. For example, you could store all the users in a linked list, or store them in a tree, or a graph, but each will have its implications on how fast you can find users, etc. Depending on what features you want in your social networking site, there will be different designs that will allow different types of behavior to perform better than it would in other designs.
From what you're saying I doubt that you need to store anything on disk.
One thing that I would ask the teacher is if you're optimizing for time or space complexity (there will be a trade off between these two depending on what you're trying to achieve).
That can certainly be done. The resource forks in Mac System 5-8 files were stored as binary indexed databases (general use of the term, don't think SQL!). (I think the interface was actually written in assembly, but I could do it in c).
The only thing is: it's a pain in the butt. Such files typically need to start with some kind of index or header, and then hold a bunch of records at predictable locations. (OK, sometimes the first index just points at some more indexes. How many layers of indirection do you care to manage?)
If you're going to do it, just remember: binary mode access.
Hmm... what about persistent storage?
If your project requires you to be able to remember friend data between two restarts of the app, then don't you think file storage (or whatever else becomes an issue)?
I'm having a very hard time figuring out what you are trying to ask here.
But there is a general rule that may apply:
If all of your data will fit in memory at once, it is usually best to load all of it into memory at once and keep it there. You write out to a file only to save, to exit, or for backup.
There are lots of exceptions to this rule, but for a class project where this is going to be the only major application running on the machine, you may as well store everything in memory. After all, you have already paid for the memory; you don't want it just sitting there idle.
I may have completely misunderstood the question you are trying to ask...