What markup language to store in a DB? - database

Related: How to store lightweight formatting (Textile, Markdown) in database?
I want to store comment formatting in some markup language in our DB. However, we want to allow multiple formatting languages (markdown, textile, restructuredText). It seems we should store a superset of their features, so that we can convert between them.
Will this work?
Is there such a superset?
Are there libraries to switch between them?
Is there a more structured format we should keep comments in, in the DB?
(Python/Google App Engine if it matters)

Have you considered something simpler: storing the comments in their original form, together with an extra column saying which format it is stored in (markdown, textile, etc...)?
I would think that any superset is either going to result in some loss of information by storing only one of the many possible different ways the syntax can be written in a specific markup or else it will be too complicated as it tries to allow for all the possible encodings of a specific syntax in all the allowable markups.

Related

How to store "meta" source code in a database

I would like to store a computer program in a database instead of a number of text files. It should contain the structure and all objects, methods, dependencies etc. of the program. I do not want to store a specific language in the database but some kind of "meta" programming language. In a second step I would like to transform/export this structure in the database into either source code of a classic language (C#, Java, etc.) or compile it directly for CLR/JVM.
I think I am not the first person with this idea. I searched the internet and I think what I am looking for is called "source code in a database (SCID)" - unfortunately I could not find an implementation of this idea.
So my questions is:
Is there any program that stores "meta" program code inside of a database and let's you generate traditional text source code from it that can be compiled/executed?
Short remarks:
- It can also be a noSQL database
- I currently don't care how the program is imported/entered into the database
It sounds like you're looking for some kind of common markup language that adequately describes the common semantics of each target language - e.g. objects, functions, inputs, return values, etc.
This is less about storing in a database, and more about having a single (I imagine, XML-like) structure that can subsequently be parsed and eval'd by the target language to produce native source/bytecode. If there was such a thing, storing it in a database would be trivial -- that's not the hard part. Even a key/value database could handle that.
The hard part will be finding something that can abstract away the nuances of multiple languages and attempt to describe them in a common format.
Similar questions have already been asked, without satisfying solutions.
It may be that you don't need the full source, but instead just a description of the runtime data-- formats like XML and JSON are intended exactly for this purpose and provide a simplified description of Objects that can be parsed and mapped to native equivalents, with your source code built around the dynamic parsing of that data.
It may be possible to go further in certain languages. For example, if your language of choice converts to bytecode first, you might technically be able to store the binary bytecode in a BLOB and then run it directly. Languages that offer reflection and dynamic evaluation can probably handle this -- then your DB is simply a wrapper for storing that data on compilation, and retrieving it prior to running it. That'd depend on your target language and how compilation is handled.
Of course, if you're only working with interpreted languages, you can simply store the full source and eval it (in whatever manner is preferred by the target language).
If you give more info on your intended use case, I'm sure you'll get some decent suggestions on how to handle it without having to invent a sourcecode Babelfish.

The best/recommended way to translate Django database values

I'm trying to figure out the best way to translate actual database values (textual strings, not date formats or anything complicated) when internationalizing my Django application. The most logical ways I could come up with were to either:
hold a database column for every language (e.g. description_en, description_de, description_fr, ...) or
have a different database for every language (e.g. schema_en, schema_fr, schema_de, ...).
Are these the best options, or is there something else I'm missing? Thanks.
I was reading up on my django extensions, and found the django-modeltranslation plugin. It seems to do exactly what you want it to do.
I also found this small project which purpose is to synchronize localized strings into standard message files for fields of registered models.
Example:
import vinaigrette
vinaigrette.register(YourModel, ['name', 'description'])
The standard command
$ manage.py makemessages
Would maintain messages for each distinct values found in registered fields.
I have not had the occasion to try it yet.
But this seems for me to be the simplest way to translate data from db.

Internationalization on database level

Could anyone point me to some patterns addressing internationalization on the database level tasks?
The simplest way would be to add a text column for every language for every text column, but that is somehow smelly - really i want to have ability to add supported languages dynamically.
The solution i'm coming to is one main language that is saved in the model and a dictionary entity that gets queried for translations and saved translations to.
All i want is to hear from other people who have done this.
You could create a table that has three columns: target language code, original string, translated string. The index on the table would be on the first two columns, but I wouldn't bind this table to other tables with foreign keys. You'd need to add a join (possibly a left join to account for missing translations) for each of the terms you need to translate in each query you run. However, this will make all your queries very hairy and possibly kill performance as well.
Another thing you need to be aware of is actually translating the terms and maintaining an up-to-date translation table. This is very inconvenient to do directly against the database and is often done by non-technical people.
Usually when localizing an application you'd use something like gettext. The idea behind this suite of tools is to parse can parse source code to extract strings for translation and then create translation files from them. Since this suite has been around for a long time, there are a lot of different utilities based on it that help with the translation task, one of which is Poedit, a nice GUI editor for translating strings into different languages. It might be simpler to generate the unique list of terms as they appear in the database in a format gettext could parse, and do the translation in the application code. This way you'd be able to do the translation of the hard coded strings in the application and the database values using the same technique.

Format data, before or after inserting into database?

I can never decide if it's better to format data before inserting it into the DB, or when pulling it out.
I'm not talking about data sanitization; we all know to protect against SQL injection. I'm talking about if the user gives you a URL, and it doesn't have http:// in front of it, should you add that before inserting it into the DB or when pulling it out? What about more complex things, like formatting a big wad of text. Do I want to mark it up with HTML (or strip it down) before or after? What if I change my mind later and want to format it differently? I can't do this if I've already formatted it, but I can if I store it unformatted... but then I'm doing extra work every time I pull a piece of data out of the DB, which I could have done once and been done with it.
What are your thoughts?
From the answers, there seems to be a general consensus that things like URLs, phone numbers, and emails (anything with a well-defined format) should be normalized first to a consistent format. Things like text should generally be left raw or in a manipulable format for maximum flexibility. If speed is an issue, both formats may be stored.
I think it's best to make sure data in the database is in the most consistent format possible. You might have multiple apps using this data, so if you can make sure it's all the same format, you won't have to worry about reformatting different formats in every application.
Normalising URLs to a canonical form prior to insertion is probably okay; performing any kind of extensive formatting, e.g. HTML conversion/parsing etc. strikes me as a bad idea - always have the "rawest" data possible in your database, especially if you want to change the presentation format later.
In terms of avoiding unnecessary post-processing on every query, you might look into adopting object caching or similar techniques for the more expensive operations.
You're asking two questions here.
Normalization should always be performed prior to the database insertion, e.g. if a column only has URLs then they should always be normalized first.
Regarding formating, that's a view problem and not a model (in this case DB) problem.
In my opinion, it should be formatted first. If you choose to do it at the time of retrieval instead of insertion, this can cause problems down the road when other applications/scripts want to use data out of the same database. They will all need to know how to clean up the data when they pull it out.
depends
if you are doing well defined items, SSN, zip code, phone number, store it formatted (this does not necessarily mean to include dashes or dots, etc. it may mean removing them so everyhting is consistent.
You have to be very careful if you change data before you store it. You could always run into a situation where you need to echo back to the original user the exact text that they gave you.
My inclination is usually to store data in the most flexible form possible. For instance, numbers should be stored using integer or floating-point types, not strings, because you can do math with numeric types but not with strings (although it's easy enough to parse a number into a string that this is not a big deal). Perhaps a more practical example: dates/times should be stored using the database's actual date/time data type instead of strings. Also, maybe it's easier to convert HTML into plain text than vice versa, in which case you'd want to store your text as HTML. Or maybe even using a format like Markdown which can be easily converted into either HTML or plain text.
It's the same reason vector graphics formats (SVG, EPS, etc.) exist: an SVG file is essentially a sequence of instructions specifying how to draw the image. It's easy to convert that into a bitmap image of any size, whereas if you only had a bitmap image to start with, you'd have a hard time changing its size (e.g. to create a thumbnail) without losing quality.
It is possible you might want to store both the formatted and unformatted versions of the data. For instance, let's use American phone numbers as an example. If you store one column with just the numbers and one column with the most frequently needed format, such as (111) 111-1111, then you can easily format to client specifications for the special cases or pull the most common one out quickly without lots of casting. This takes very little extra time at the time of insert (and can be accomplished with a calculated column so it always happens no matter where the data came from).
Data should be scrubbed before being put in the database so that invalid dates or nonnumeric data etc aren't ever placed in the field. Email is one field that people often put junk into for some reason. If it doesn't have an # sign, it shouldn't be stored. This is especially true if you actually send emails thorugh your application(s) using that field. It is a waste of time to try to send an email to 'contact his secretary' or 'aol.com' if you see what I mean.
If the format will be consistently needed, it is better to convert the data to that format once on insert or update and not have to convert it ever again. If the standard format changes, you will need to update the column for all existing records at that time, then use the new format going forth. If you have frequent changes of format and large tables or if differnt applications use different formats, it might be best to store unformatted.

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

Resources