So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.
Related: How to store lightweight formatting (Textile, Markdown) in database?
I want to store comment formatting in some markup language in our DB. However, we want to allow multiple formatting languages (markdown, textile, restructuredText). It seems we should store a superset of their features, so that we can convert between them.
Will this work?
Is there such a superset?
Are there libraries to switch between them?
Is there a more structured format we should keep comments in, in the DB?
(Python/Google App Engine if it matters)
Have you considered something simpler: storing the comments in their original form, together with an extra column saying which format it is stored in (markdown, textile, etc...)?
I would think that any superset is either going to result in some loss of information by storing only one of the many possible different ways the syntax can be written in a specific markup or else it will be too complicated as it tries to allow for all the possible encodings of a specific syntax in all the allowable markups.
I work for a billing service that uses some complicated mainframe-based billing software for it's core services. We have all kinds of codes we set up that are used for tracking things: payment codes, provider codes, write-off codes, etc... Each type of code has a completely different set of data items that control what the code does and how it behaves.
I am tasked with building a new system for tracking changes made to these codes. We want to know who requested what code, who/when it was reviewed, approved, and implemented, and what the exact setup looked like for that code. The current process only tracks two of the different types of code. This project will add immediate support for a third, with the goal of also making it easy to add additional code types into the same process at a later date. My design conundrum is that each code type has a different set of data that needs to be configured with it, of varying complexity. So I have a few choices available:
I could give each code type it's own table(s) and build them independently. Considering we only have three codes I'm concerned about at the moment, this would be simplest. However, this concept has already failed or I wouldn't be building a new system in the first place. It's also weak in that the code involved in writing generic source code at the presentation level to display request data for any code type (even those not yet implemented) is not trivial.
Build a db schema capable of storing the data points associated with each code type: not only values, but what type they are and how they should be displayed (dropdown list from an enum of some kind). I have a decent db schema for this started, but it just feels wrong: overly complicated to query and maintain, and it ultimately requires a custom query to view full data in nice tabular for for each code type anyway.
Storing the data points for each code request as xml. This greatly simplifies the database design and will hopefully make it easier to build the interface: just set up a schema for each code type. Then have code that validates requests to their schema, transforms a schema into display widgets and maps an actual request item onto the display. What this item lacks is how to handle changes to the schema.
My questions are: how would you do it? Am I missing any big design options? Any other pros/cons to those choices?
My current inclination is to go with the xml option. Given the schema updates are expected but extremely infrequent (probably less than one per code type per 18 months), should I just build it to assume the schema never changes, but so that I can easily add support for a changing schema later? What would that look like in SQL Server 2000 (we're moving to SQL Server 2005, but that won't be ready until after this project is supposed to be completed)?
[Update]:
One reason I'm thinking xml is that some of the data will be complex: nested/conditional data, enumerated drop down lists, etc. But I really don't need to query any of it. So I was thinking it would be easier to define this data in xml schemas.
However, le dorfier's point about introducing a whole new technology hit very close to home. We currently use very little xml anywhere. That's slowly changing, but at the moment this would look a little out of place.
I'm also not entirely sure how to build an input form from a schema, and then merge a record that matches that schema into the form in an elegant way. It will be very common to only store a partially-completed record and so I don't want to build the form from the record itself. That's a topic for a different question, though.
Based on all the comments so far Xml is still the leading candidate. Separate tables may be as good or better, but I have the feeling that my manager would see that as not different or generic enough compared to what we're currently doing.
There is no simple, generic solution to a complex, meticulous problem. You can't have both simple storage and simple app logic at the same time. Either the database structure must be complex, or else your app must be complex as it interprets the data.
I outline five solution to this general problem in "product table, many kind of product, each product have many parameters."
For your situation, I would lean toward Concrete Table Inheritance or Serialized LOB (the XML solution).
The reason that XML might be a good solution is that:
You don't need to use SQL to pick out individual fields; you're always going to display the whole form.
Your XML can annotate fields for data type, user interface control, etc.
But of course you need to add code to parse and validate the XML. You should use an XML schema to help with this. In which case you're just replacing one technology for enforcing data organization (RDBMS) with another (XML schema).
You could also use an RDF solution instead of an RDBMS. In RDF, metadata is queriable and extensible, and you can model entities with "facts" about them. For example:
Payment code XYZ contains attribute TradeCredit (Net-30, Net-60, etc.)
Attribute TradeCredit is of type CalendarInterval
Type CalendarInterval is displayed as a drop-down
.. and so on
Re your comments: Yeah, I am wary of any solution that uses XML. To paraphrase Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use XML." Now they have two problems.
Another solution would be to invent a little Domain-Specific Language to describe your forms. Use that to generate the user-interface. Then use the database only to store the values for form data instances.
Why do you say "this concept has already failed or I wouldn't be building a new system in the first place"? Is it because you suspect there must be a scheme for handling them in common?
Else I'd say to continue the existing philosophy, and establish additional tables. At least it would be sharing an existing pattern and maintaining some consistency in that respect.
Do a web search on "generalized specialized relational modeling". You'll find articles on how to set up tables that store the attributes of each kind of code, and the attributes common to all codes.
If you’re interested in object modeling, just search on “generalized specialized object modeling”.
What is people's prefered method of storing application configuration data in a database. From having done this in the past myself, I've utilised two ways of doing it.
You can create a table where you store key/value pairs, where key is the name of the config option and value is its value. Pro's of this is adding new values is easy and you can use the same routines to set/get data. Downsides are you have untyped data as the value.
Alternatively, you can hardcode a configuration table, with each column being the name of the value and its datatype. The downside to this is more maintenance setting up new values, but it allows you to have typed data.
Having used both, my preferences lie with the first option as its quicker to set things up, however its also riskier and can reduce performance (slightly) when looking up data. Does anyone have any alternative methods?
Update
It's necessary to store the information in a database because as noted below, there may be multiple instances of the program that require configuring the same way, as well as stored procedures potentially using the same values.
You can expand option 1 to have a 3rd column, giving a data-type. Your application can than use this data-type column to cast the value.
But yeah, I would go with option 1, if config files are not an option. Another advantage of option 1 is you can read it into a Dictionary object (or equivalent) for use in your application really easily.
Since configuration typically can be stored in a text file, the string data type should be more than enough to store the configuration values. If you're using a managed language, it's the code that knows what the data type should be, not the database.
More importantly, consider these things with configuration:
Hierarchy: Obviously, configuration will benefit from a
hierarchy
Versioning: Consider the benefit of being able to roll back to the configuration that was in effect at a certain date.
Distribution: Some time, it might be nice to be able to cluster an application. Some properties should probably be local to each node in a cluster.
Documentation: Depending on if you have a web tool or something, it is probably nice to store the documentation about a property close to the code that uses it. (Code annotations is very nice for this.)
Notification: How is the code going to know that a change has been made somewhere in the configuration repository?
Personally, i like an inverted way of handling configuration, where the configuration properties is injected into the modules which don't know where the values came from. This way, the configuration management system can be very complex or very simple depending on your (current) needs.
I use option 1.
My project uses a database table with four columns:
ID [pk]
Scope (default 'Application')
Setting
Value
Settings with a Scope of 'Application' are global settings, such as Maximum number of simultaneous users.
Each module has its own scope based; so our ResultsLoader and UserLoader have different scopes, but both have a Setting named 'inputPath'.
Defaults are either provided in the source code or are injected via our IoC container. If no value is injected or provided in the database, the default from the code is used (if one exists). Therefore, defaults are never stored in the database.
This works out quite well for us. Each time we backup the database we get a copy of the Configuration which is quite handy. The two are always in sync.
It seems overkill to use the DB for config data.
EDIT (sorry too long for comment box):
Of course there's no strict rules on how you implement any part of your program. For the sake of argument, slotted screwdrivers work on some philips screws! I guess I judged too early before knowing what your scenario is.
Relational database excels in massive data store that gives you quick storing, updating, and retrieval, so if your config data is updated and read constantly, then by all means use db.
Another scenario where db may make sense is when you have a server farm where you want your database to store your central config, but then you can do the same with a shared networked drive that point to the xml config file.
XML file is better when your config is hierarchically structured. You can easily organize, locate, and update what you need, and for bonus benefit you can version control the config file along with your source code!
All in all, it all depends on how the config data is used.
That concludes my opinion with limited knowledge of your application. I am sure you can make the right decision.
I guess this is more of a poll, so I'll say the column approach (option 2). However it will depend on how often your config changes, how dynamic it is, and how much data there is, etc.
I'd certainly use this approach for user configurations / preferences, etc.
Go with option 2.
Option 1 is really a way of implenting a database on top of a database, and that is a well-known antipattern, which is just going to give you trouble in the long run.
I can think of at least two more ways:
(a) Create a table with key, string-value, date-value, int-value, real-value columns. Leave unused types NULL.
(b) Use a serialization format like XML, YAML or JSON and store it all in a blob.
Where do you you store the configuration settings your app needs to connect to the database?
Why not store the other config info there too?
I'd go with option 1, unless the number of config options were VERY small (seven or less)
At my company, we're working on using option one (a simple dictionary-like table) with a twist. We're allowing for string substitution using tokens which contain the name of the config variable to be substituted.
For example, the table might contain rows ('database connection string', 'jdbc://%host%...') and ('host', 'foobar'). Encapsulating that with a simple service or stored procedure layer allows for an extremely simple, but flexible, recursive configuration. It supports our need to have multiple isolated environments (dev, test, prod, etc).
I've used both 1 and 2 in the past, and I think they're both terrible solutions. I think Option 2 is better because it allows typing, but it's a lot more ugly than option 1. The biggest problem I have with either is versioning the config file. You can version SQL reasonably well using standard version control systems, but merging changes is usually problematic. Given an opportunity to do this "right", I'd probably create a bunch of tables, one for each type of configuration parameter (not necessarily for each parameter itself), thus getting the benefit of typing and the benefit of the key/value paradigm where appropriate. You can also implement more advanced structures this way, such as lists and hierarchies, which will then be directly queryable by the app instead of having to load the config and then transform it somehow in memory.
I vote for option 2. Easy to understand and maintain.
Option 1 is good for an easily expandable, central storage location. In addition to some of the great column suggestions by folks like RB, Hugo, and elliott, you might also consider:
Include a Global/User setting flag with a user field or even a user/machine field (for machine-specific UI type settings).
Those can, of course, be stored in a local file, but since you are using the database anyway, that makes these available for aliasing a user when debugging - which can be important if the bug is setting related. It also allows an admin to manage setings when necessary.
I use a mix of option 2 and XML columns in SQL server.
You may also wan't to add a check constraint to keep the table at one row.
CREATE TABLE [dbo].[MyOption] (
[GUID] uniqueidentifier CONSTRAINT [dfMyOptions_GUID] DEFAULT newsequentialid() ROWGUIDCOL NOT NULL,
[Logo] varbinary(max) NULL,
[X] char(1) CONSTRAINT [dfMyOptions_X] DEFAULT 'X' NOT NULL,
CONSTRAINT [MyOptions_pk] PRIMARY KEY CLUSTERED ([GUID]),
CONSTRAINT [MyOptions_ck] CHECK ([X]='X')
)
for settings that have no relation to any db tables, i'd probably go for the EAV approach if you need the db to work with the values. otherwise a serialized field value is good if it's really just a store for app code.
but what about a format for a single field to store multiple config settings to be used by the db?
like one field per user that contains all their settings related to their messageboard view (like default sort order, blocked topics, etc.), and maybe another with all their settings for their theme (like text color, bg color, etc.)
Storing hierarchy and documents in a relational DB is madness. Firstly you either have to shred them, only to recombine them at some later stage. Or there bunged inside a BLOB, even more stupid.
Don't use use a relational db for non-relational data, the tool does not fit. Consider something like MongoDB or CouchDB for this. Schema-less no-relational data stores. Store it as JSON if it's coming down the wire in any way to a client, use XML for serverside.
CouchDB gives you versioning out of the box.
Don't store configuration data in a database unless you have a very good reason to. If you do have a very good reason, and are absolutely certain you are going to do it, you should probably store it in a data serialization format like JSON or YAML (not XML, unless you actually need a markup language to configure your app -- trust me, you don't) as a string. Then you can just read the string, and use tools in whatever language you work in to read and modify it. Store the strings with timestamps, and you have a simple versioning scheme with the ability to store hierarchical data in a very simple system. Even if you don't need hierarchical config data, at least now if you need it in the future you won't have to change your config interface to get it. Of course you lose the ability to do relational queries on your config data, but if you're storing that much config data, then you're probably doing something very wrong anyway.
Companies tend to store lots configuration data for their systems in a database, I'm not sure why, I don't think much thought goes into these decisions. I don't see this kind of thing done too often in the OSS world. Even large OSS programs that need lots of configuration like Apache don't need a connection to a database containing an apache_config table to work. Having a huge amount of configuration to deal with in your apps is a bad code smell, storing that data in a database just causes more problems (as this thread illustrates).
I have the following data in my db:
ID Name Region
100 Sam North
101 Dam South
102 Wesson East
...
...
...
Now, the region will be different in different languages. I want the Sorting to happen right, based on the display value rather than the internal value.
Any Ideas? (And yeah, sorting in memory using Java is not an option!)
Unfortunately, if you want to do that in the database, you will have to store the internationalized versions as well in the database (or at least their order). Otherwise, how could the database engine possibly know how to sort?
You will have to create a table with three columns: the English version, the language code and the translated version (with the English version and the language code together being the primary key). Then join to this table in in you query by using the English word and a language code and then sort on the internationalized version.
The best approach to use internationalization is to remove the internationalization from your application and keep it in separated i18n databases. In your application you keep a key that can be used to access those separated databases, normally xml or yml.
As rule of thumb i would suggest:
keep all database in one format, one place
extract internationalization strings from your application
lets your application to pull i18n strings from your i18n from your internationalization database.
You can check the RAILS approach to i18n. It's simply, clean and easy to use.
What do you mean "Display Value"? I take it you are somehow converting that into a different language in java itself, and don't store that value (the localised one, I assume) in the db? If so, you're kind of screwed. You need to get that data into the DB to be able to sort with it there.