So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.
I'm developing a distributed application using Ada's Distributed System Annex and PolyORB and I have a somewhat peculiar request.
Let's say I have a RCI (Remote Call Interface) unit called U and the main program called MAIN that uses it.
What I want to know is:
Can I configure DSA to create multiple copies of the partition U ?
If the answer is yes, can I then call a specific one of these
partitions from my code in MAIN?
I can't find info on this online, right now the only solution I can think of would be to have a pre-processor generate multiple "copies" of U from a generic template and patch the DSA configuration file accordingly. Is there a less "hacky" way to do this?
After reading all the documentation I could find, I conclude that this is not possible, ie: it's not contemplated in DSA.
I still think the hacky solution I proposed in my original question would work, however it doesn't make much sense from a practical point of view.
For my application, I decided to switch to SOAP, using the AWS web server and ditching DSA altogether.
EDIT: in response to shark8 comment:
My idea, in detail, was this. Let's assume I have some kind of configuration file in which I describe how many copies of the partitions I want and how each one is configured, here's how I'd do it:
Have a pre-processor read the configuration file and generate source files for all the needed
partitions from a template (basically this would just create a copy
of the files and alter the package name and file name to avoid
collisions)
Have the pre-processor patch the DSA configuration file accordingly, to generate those partitions
I would then have one final package that acts as a "gateway", or a "selector" if you
want, to the other packages. It would contain "generic" procedures
that would act as selectors, something like:
procedure Some_Procedure(partitionID : Integer; params : whatever) is
case partitionID is
when 1 =>
Partition1.Some_Procedure(params);
when 2 =>
Partition2.Some_Procedure(params);
etc...
end case;
end Some_Procedure;
Obviously, like the partitions themselves, the various lines in the case will have to be generated by the pre-processor. Also: the main program that calls this procedure needs to pass it's ID, but this is no problem since it will have read the same configuration as the preprocessor did at 1) so it would know the IDs and which one it wants to call, depending on its internal logic.
As I said, it's a tortuous method, but I think it should work.
The DSA (polyorb) is very flexible and you can have multiple unique RCI . Why would you want to do this? is it a maximisation of processing speed. Having used the DSA quite a bit I would question your motives for doing this in this way and perhaps suggest that you change your design.
If its processing speed I would suggest you do your processing in multiple client partitions, and use the asynchronous messaging mechanism in the DSA. If you need more processing speed, then do your calculation/ processing in multiple stages across client partitions.
Theres a great example of this messaging mechanism in the banking example application included in the DSA examples.
My goal is to achieve the highest performance available for copying a block of data from the database into a C-function to be processed and returned as the result of a query.
I am new to PostgreSQL and I am currently researching possible ways to move the data. Specifically, I am looking for nuances or keywords related specifically to PostgreSQL to move big data fast.
NOTE:
My ultimate goal is speed, so I am willing to accept answers outside of the exact question I have posed as long as it gets big performance results. For example, I have come across the COPY keyword (PostgreSQL only), which moves data from tables to files quickly; and vice versa. I am trying to stay away from processing that is external to the database, but if it provides a performance improvement that out-weighs the obvious drawback of external processing, then so be it.
It sounds like you probably want to use the server programming interface (SPI) to implement a stored procedure as a C language function running inside the PostgreSQL back-end.
Use SPI_connect to set up the SPI.
Now SPI_prepare_cursor a query, then SPI_cursor_open it. SPI_cursor_fetch rows from it and SPI_cursor_close it when done. Note that SPI_cursor_fetch allows you to fetch batches of rows.
SPI_finish to clean up when done.
You can return the result rows into a tuplestore as you generate them, avoiding the need to build the whole table in memory. See examples in any of the set-returning functions in the PostgreSQL source code. You might also want to look at the SPI_returntuple helper function.
See also: C language functions and extending SQL.
If maximum speed is of interest, your client may want to use the libpq binary protocol via libpqtypes so it receives the data produced by your server-side SPI-using procedure with minimal overhead.
Here is the scenario:
I am handling a SQL Server database with a stored procedure which takes care of returning headers for Web feed items (RSS/Atom) I am serving as feeds through a web application.
This stored procedure should, when called by the service broker task running at a given interval, verify if there has been a significant change in the underlying data - in that case, it will trigger a resource intensive activity of formatting the feed item header through a call to the web application which will get/retrieve the data, format them and return to the SQL database.
There the header would be stored ready for a request for RSS feed update from the client.
Now, trying to design this to be as efficient as possible, I still have a couple of turning point I'd like to get your suggestions about.
My tentative approach at the stored procedure would be:
get together the data in a in-memory table,
create a subquery with the signature columns which change with the information,
convert them to XML with a FOR XML AUTO
hash the result with MD5 (with HASHBYTES or fn_repl_hash_binary depending on the size of the result)
verify if the hash matches with the one stored in the table where I am storing the HTML waiting for the feed requests.
if Hash matches do nothing otherwise proceed for the updates.
The first doubt is the best way to check if the base data have changed.
Converting to XML inflates significantly the data -which slows hashing-, and potentially I am not using the result apart from hashing: is there any better way to perform the check or to pack all the data together for hashing (something csv-like)?
The query is merging and aggregating data from multiple tables, so would not rely on table timestamps as their change is not necessarily related to a change in the result set
The second point is: what is the best way to serve the data to the webapp for reformatting?
- I might push the data through a CLR function to the web application to get data formatted (but this is synchronous and for multiple feed item would create unsustainable delay)
or
I might instead save the result set instead and trigger multiple asynchronous calls through the service broker. The web app might retrieve the data stored in some way instead of running again the expensive query which got them.
Since I have different formats depending on the feed item category, I cannot use the same table format - so storing to a table is going to be hard.
I might serialize to XML instead.
But is this going to provide any significant gain compared to re-running the query?
For the efficient caching bit, have a look at query notifications. The tricky bit in implementing this in your case is you've stated "significant change" whereas query notifications will trigger on any change. But the basic idea is that your application subscribes to a query. When the results of that query change, a message is sent to the application and it does whatever it is programmed to do (typically refreshing cached data).
As for serving the data to your app, there's a saying in the business: "don't go borrowing trouble". Which is to say if the default method of serving data (i.e. a result set w/o fancy formatting) isn't causing you a problem, don't change it. Change it only if and when it's causing you a significant enough headache that your time is best spent there.
What is people's prefered method of storing application configuration data in a database. From having done this in the past myself, I've utilised two ways of doing it.
You can create a table where you store key/value pairs, where key is the name of the config option and value is its value. Pro's of this is adding new values is easy and you can use the same routines to set/get data. Downsides are you have untyped data as the value.
Alternatively, you can hardcode a configuration table, with each column being the name of the value and its datatype. The downside to this is more maintenance setting up new values, but it allows you to have typed data.
Having used both, my preferences lie with the first option as its quicker to set things up, however its also riskier and can reduce performance (slightly) when looking up data. Does anyone have any alternative methods?
Update
It's necessary to store the information in a database because as noted below, there may be multiple instances of the program that require configuring the same way, as well as stored procedures potentially using the same values.
You can expand option 1 to have a 3rd column, giving a data-type. Your application can than use this data-type column to cast the value.
But yeah, I would go with option 1, if config files are not an option. Another advantage of option 1 is you can read it into a Dictionary object (or equivalent) for use in your application really easily.
Since configuration typically can be stored in a text file, the string data type should be more than enough to store the configuration values. If you're using a managed language, it's the code that knows what the data type should be, not the database.
More importantly, consider these things with configuration:
Hierarchy: Obviously, configuration will benefit from a
hierarchy
Versioning: Consider the benefit of being able to roll back to the configuration that was in effect at a certain date.
Distribution: Some time, it might be nice to be able to cluster an application. Some properties should probably be local to each node in a cluster.
Documentation: Depending on if you have a web tool or something, it is probably nice to store the documentation about a property close to the code that uses it. (Code annotations is very nice for this.)
Notification: How is the code going to know that a change has been made somewhere in the configuration repository?
Personally, i like an inverted way of handling configuration, where the configuration properties is injected into the modules which don't know where the values came from. This way, the configuration management system can be very complex or very simple depending on your (current) needs.
I use option 1.
My project uses a database table with four columns:
ID [pk]
Scope (default 'Application')
Setting
Value
Settings with a Scope of 'Application' are global settings, such as Maximum number of simultaneous users.
Each module has its own scope based; so our ResultsLoader and UserLoader have different scopes, but both have a Setting named 'inputPath'.
Defaults are either provided in the source code or are injected via our IoC container. If no value is injected or provided in the database, the default from the code is used (if one exists). Therefore, defaults are never stored in the database.
This works out quite well for us. Each time we backup the database we get a copy of the Configuration which is quite handy. The two are always in sync.
It seems overkill to use the DB for config data.
EDIT (sorry too long for comment box):
Of course there's no strict rules on how you implement any part of your program. For the sake of argument, slotted screwdrivers work on some philips screws! I guess I judged too early before knowing what your scenario is.
Relational database excels in massive data store that gives you quick storing, updating, and retrieval, so if your config data is updated and read constantly, then by all means use db.
Another scenario where db may make sense is when you have a server farm where you want your database to store your central config, but then you can do the same with a shared networked drive that point to the xml config file.
XML file is better when your config is hierarchically structured. You can easily organize, locate, and update what you need, and for bonus benefit you can version control the config file along with your source code!
All in all, it all depends on how the config data is used.
That concludes my opinion with limited knowledge of your application. I am sure you can make the right decision.
I guess this is more of a poll, so I'll say the column approach (option 2). However it will depend on how often your config changes, how dynamic it is, and how much data there is, etc.
I'd certainly use this approach for user configurations / preferences, etc.
Go with option 2.
Option 1 is really a way of implenting a database on top of a database, and that is a well-known antipattern, which is just going to give you trouble in the long run.
I can think of at least two more ways:
(a) Create a table with key, string-value, date-value, int-value, real-value columns. Leave unused types NULL.
(b) Use a serialization format like XML, YAML or JSON and store it all in a blob.
Where do you you store the configuration settings your app needs to connect to the database?
Why not store the other config info there too?
I'd go with option 1, unless the number of config options were VERY small (seven or less)
At my company, we're working on using option one (a simple dictionary-like table) with a twist. We're allowing for string substitution using tokens which contain the name of the config variable to be substituted.
For example, the table might contain rows ('database connection string', 'jdbc://%host%...') and ('host', 'foobar'). Encapsulating that with a simple service or stored procedure layer allows for an extremely simple, but flexible, recursive configuration. It supports our need to have multiple isolated environments (dev, test, prod, etc).
I've used both 1 and 2 in the past, and I think they're both terrible solutions. I think Option 2 is better because it allows typing, but it's a lot more ugly than option 1. The biggest problem I have with either is versioning the config file. You can version SQL reasonably well using standard version control systems, but merging changes is usually problematic. Given an opportunity to do this "right", I'd probably create a bunch of tables, one for each type of configuration parameter (not necessarily for each parameter itself), thus getting the benefit of typing and the benefit of the key/value paradigm where appropriate. You can also implement more advanced structures this way, such as lists and hierarchies, which will then be directly queryable by the app instead of having to load the config and then transform it somehow in memory.
I vote for option 2. Easy to understand and maintain.
Option 1 is good for an easily expandable, central storage location. In addition to some of the great column suggestions by folks like RB, Hugo, and elliott, you might also consider:
Include a Global/User setting flag with a user field or even a user/machine field (for machine-specific UI type settings).
Those can, of course, be stored in a local file, but since you are using the database anyway, that makes these available for aliasing a user when debugging - which can be important if the bug is setting related. It also allows an admin to manage setings when necessary.
I use a mix of option 2 and XML columns in SQL server.
You may also wan't to add a check constraint to keep the table at one row.
CREATE TABLE [dbo].[MyOption] (
[GUID] uniqueidentifier CONSTRAINT [dfMyOptions_GUID] DEFAULT newsequentialid() ROWGUIDCOL NOT NULL,
[Logo] varbinary(max) NULL,
[X] char(1) CONSTRAINT [dfMyOptions_X] DEFAULT 'X' NOT NULL,
CONSTRAINT [MyOptions_pk] PRIMARY KEY CLUSTERED ([GUID]),
CONSTRAINT [MyOptions_ck] CHECK ([X]='X')
)
for settings that have no relation to any db tables, i'd probably go for the EAV approach if you need the db to work with the values. otherwise a serialized field value is good if it's really just a store for app code.
but what about a format for a single field to store multiple config settings to be used by the db?
like one field per user that contains all their settings related to their messageboard view (like default sort order, blocked topics, etc.), and maybe another with all their settings for their theme (like text color, bg color, etc.)
Storing hierarchy and documents in a relational DB is madness. Firstly you either have to shred them, only to recombine them at some later stage. Or there bunged inside a BLOB, even more stupid.
Don't use use a relational db for non-relational data, the tool does not fit. Consider something like MongoDB or CouchDB for this. Schema-less no-relational data stores. Store it as JSON if it's coming down the wire in any way to a client, use XML for serverside.
CouchDB gives you versioning out of the box.
Don't store configuration data in a database unless you have a very good reason to. If you do have a very good reason, and are absolutely certain you are going to do it, you should probably store it in a data serialization format like JSON or YAML (not XML, unless you actually need a markup language to configure your app -- trust me, you don't) as a string. Then you can just read the string, and use tools in whatever language you work in to read and modify it. Store the strings with timestamps, and you have a simple versioning scheme with the ability to store hierarchical data in a very simple system. Even if you don't need hierarchical config data, at least now if you need it in the future you won't have to change your config interface to get it. Of course you lose the ability to do relational queries on your config data, but if you're storing that much config data, then you're probably doing something very wrong anyway.
Companies tend to store lots configuration data for their systems in a database, I'm not sure why, I don't think much thought goes into these decisions. I don't see this kind of thing done too often in the OSS world. Even large OSS programs that need lots of configuration like Apache don't need a connection to a database containing an apache_config table to work. Having a huge amount of configuration to deal with in your apps is a bad code smell, storing that data in a database just causes more problems (as this thread illustrates).