I'm working on an application for which I need to be able to store EXIF metadata in a relational database. In the future I'd like to also support XMP and IPTC metadata, but at the moment my focus is on EXIF.
There are a few questions on Stack Overflow about what the table structure should look like when storing EXIF metadata. However, none of them really address my concern. The problem I have is that different EXIF tags have values in different formats, and there isn't really one column type which conveniently stores them all.
The most common type is a "rational" which is an array of two four-byte integers representing a fraction. But there are also non-fractional short and long integers, ASCII strings, byte arrays, and "undefined" (an 8-bit type which must be interpreted according to a priori knowledge of the specific tag.) I'd like to support all of these types, and I want to do so in a convenient, efficient, lossless (i.e. without converting the rationals to floats), extensible and searchable manner.
Here's what I've considered so far:
My current solution is to store everything as a string. This makes it pretty easy to store all of the different types, and is also convenient for searching and debugging. However, it's kind of clunky and inefficient because when I want to actually use the data, I have to do a bunch of string manipulation to convert the rational values into their fractional equivalents, e.g. fraction = float(value.split('/')[0]) / float(value.split('/')[1]). (It's not actually a messy one-liner like that in my real code, but this demonstrates the problem.)
I could grab the raw EXIF bytes for each value from the file and store them in a blob column, but then I'd have to reinterpret the raw bytes every time. This could be marginally more CPU-efficient than the string solution, but it's much, much worse in every other way - on the whole, not worth it.
I could have a different table for each different EXIF datatype. Using this pattern I can maintain my foreign key relationships while storing my values in several different tables. However, this will make my most common query, which is to select all EXIF metadata for a given photo, kind of nasty. It will also become unwieldy very quickly when I add support for other metadata formats.
I'm not a database expert by any means, so there some pattern or magic union-style column type I'm missing that can make this problem go away? Or am I stuck picking my poison from among the three options above?
This is probably a very cheap solution, but I would personally just store the json or something like that within the database.
There is a cool way to extract EXIF data and parse it to json.
Here is the link: Img2JSON
I hope this kind of helps you!
Related
Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.
I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)
This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.
Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.
We're in a bit of internal conflict on this issue, and can't seem to come to a happy conclusion.
We'll only be storing latitudes and longitudes, and possibly simple polygons. All we need it for is computing distance between two points (and possibly to see if a point is within a polygon), and the entirety of the data is in such close proximity to make planar estimations acceptable.
Since our requirements are so relaxed, half of the dev team suggests using SqlGeometry types, which are apparently simpler. I'm having trouble accepting this, though, since we're storing geographic data, which seems like storing them in SqlGeography is the right thing to do. Also, I'm not finding any substantive evidence that the SqlGeometry data type is that much easier to work with than the SqlGeography type.
Does anyone have advice as to which type would be more appropriate for this relatively simple scenario?
It's not a question of comparing features, or accuracy, or simplicity - the two spatial datatypes are for working with different sorts of data.
As an analogy, suppose you were choosing the best datatype for a column that contained a unique identifier for each row. If that UID only contained integer values, you'd use int, whereas if it was a 6-character alphanumeric value you'd use char(6). And if it had variable-length unicode values, you'd use nvarchar instead, right?
The same logic goes for spatial data - you choose the appropriate datatype based on the values that that column contains; if you're working with geographic (i.e. latitude/longitude) coordinates, use the SqlGeography datatype. It's that simple.
You can use SqlGeometry to store latitude/longitude values, but it would be like using nvarchar(max) to store an integer... and I promise you it will lead to further problems down the line (when all your area calculations come out measured in degrees squared, for example)
The SqlGeography type has less methods available than SqlGeometry (especially in Sql 2008).
SqlGeography reference
SqlGeometry reference
For example, suppose you want to get the centroid of a polygon in Sql2008. You have a native method for that in geometry, but not in geography.
Also, it has the following limitations:
You can't have a geography exceeding one hemisphere
The ring-order matters when creating the polygon
Also, most API and libraries available (that I know of) handle geometries better than geographies.
That said, if the distance calculation has to be precise, you have large distances and have coordinates all over the world, geography would probably be a better fit. Otherwise, and according to your description of the problem, you would be well served with the geometry type.
Regarding your question: "is that much easier to work?". It depends. Anyway, and as a rule of thumb, for simple scenarios I typically opt for SqlGeometry.
Anyway, IMHO you shouldn't worry too much on that decision. It's relatively easy to create a new column with the other type and migrate the data if necessary.
Four years later, it has become apparent that we should have stored the data in SqlGeometry instead of SqlGeography.
Why?
We were importing information from legislative district maps, and their data was stored in SqlGeometry. When determining if a particular lat/long was within a certain legislative district boundary, we'd get inconsistent results when the point was close to two boundaries.
This required us to do additional work to identify locations that were "close" to a boundary, and manually verify that they were assigned to the proper district. Not ideal.
Moral of the story: if you're relying on any data, consider what type it's stored in to help guide your decision.
My site requires users to input a great deal of mathematical and scientific notation. Naturally, I need a way to store this server-side for future use. This is my ideal course of action: a user inputs a chunk of text, potentially with or without mathematical symbols, the input is submitted, it is then turned into a format that is suitable for a database, and finally, on every subsequent viewing of the inputted math, it should get converted to MathML. The MathML is converted to a readable format by MathJax.
I have never worked with storing math in a database before, and the fact that I want to allow users to be able to insert math inline with text creates a few implications. Any ideas?
The desire to normalize this kind of thing is understandable. It is also wrong.
The only thing you should "expand" into more semantic types for the use of a relational database is anything that would appear in a where clause.
If it's reasonable for you to want to run a query that looks like (pseudocode:)
SELECT * FROM FORMULAE
WHERE "/theta" in free_vars(FORMULAE.FORMULA)
AND "bob" in FORMULAE.USERS;
Then you want to use a data type that makes such a query efficient. If, on the other hand, the database will not be doing any kind of processing on the stored formula, then don't try to convert it to or from anything except the most useful external representation, the markup that properly represents the formula.
store the MathML as text. if it does not fit in a VARCHAR, then use a CLOB
I would recommend storing it as MathML text in the database. That way, at insertion time, you convert it to MathML once, then each time it is accessed, there's no need to convert it (ie: faster).
I need to find a relatively robust method of storing variable types data in a single column of a database table. The data may represent a single value or multiple values and may any of a long list of characters (too long to enumerate easily). I'm wondering what approaches might work in this process. I'd toyed with the ideas of adding some form of separator, but I'm worried that any simple separator or combination might occur naturally in the data. I'd also like to avoid XML or snippets since in fact the data could be XML. Arguably I could encode the XML, but that still seems fragile.
I realize this is naturally a bit of an opinion question, but I lack the mojo to make it community.
Edit for Clarification:
Background for the problem: the column will hold data that is then used to make a evaluation based on another column. Functionally it's the test criteria for a decision engine. Other columns hold the evaluation's nature and the source of the value to test. The data doesn't need to be searchable.
Does the data need to be searchable? If not, slap it in a varbinary(MAX) and have a field to assist in deserialization.
Incidentally, though; using the right XML API, there should be no trouble storing XML inside an XML node.
But my guess is there has to be a better way to do this... it seems... ugh!
JSON format, though I agree with djacobson, your question is like asking for the best way to saw a 2x4 in half with a teaspoon.
EDIT: The order in which data are stored in the JSON string is irrelevant; each datum is stored as a key-value pair.
There's not a "good" way to do this. There is a reason that data types exist in SQL.
The only conceivable way I can think of to make it close is to make your column a lookup column, which refers to a GUID or ID in another table, which itself has additional columns indicating which table and row have your data.
I can never decide if it's better to format data before inserting it into the DB, or when pulling it out.
I'm not talking about data sanitization; we all know to protect against SQL injection. I'm talking about if the user gives you a URL, and it doesn't have http:// in front of it, should you add that before inserting it into the DB or when pulling it out? What about more complex things, like formatting a big wad of text. Do I want to mark it up with HTML (or strip it down) before or after? What if I change my mind later and want to format it differently? I can't do this if I've already formatted it, but I can if I store it unformatted... but then I'm doing extra work every time I pull a piece of data out of the DB, which I could have done once and been done with it.
What are your thoughts?
From the answers, there seems to be a general consensus that things like URLs, phone numbers, and emails (anything with a well-defined format) should be normalized first to a consistent format. Things like text should generally be left raw or in a manipulable format for maximum flexibility. If speed is an issue, both formats may be stored.
I think it's best to make sure data in the database is in the most consistent format possible. You might have multiple apps using this data, so if you can make sure it's all the same format, you won't have to worry about reformatting different formats in every application.
Normalising URLs to a canonical form prior to insertion is probably okay; performing any kind of extensive formatting, e.g. HTML conversion/parsing etc. strikes me as a bad idea - always have the "rawest" data possible in your database, especially if you want to change the presentation format later.
In terms of avoiding unnecessary post-processing on every query, you might look into adopting object caching or similar techniques for the more expensive operations.
You're asking two questions here.
Normalization should always be performed prior to the database insertion, e.g. if a column only has URLs then they should always be normalized first.
Regarding formating, that's a view problem and not a model (in this case DB) problem.
In my opinion, it should be formatted first. If you choose to do it at the time of retrieval instead of insertion, this can cause problems down the road when other applications/scripts want to use data out of the same database. They will all need to know how to clean up the data when they pull it out.
depends
if you are doing well defined items, SSN, zip code, phone number, store it formatted (this does not necessarily mean to include dashes or dots, etc. it may mean removing them so everyhting is consistent.
You have to be very careful if you change data before you store it. You could always run into a situation where you need to echo back to the original user the exact text that they gave you.
My inclination is usually to store data in the most flexible form possible. For instance, numbers should be stored using integer or floating-point types, not strings, because you can do math with numeric types but not with strings (although it's easy enough to parse a number into a string that this is not a big deal). Perhaps a more practical example: dates/times should be stored using the database's actual date/time data type instead of strings. Also, maybe it's easier to convert HTML into plain text than vice versa, in which case you'd want to store your text as HTML. Or maybe even using a format like Markdown which can be easily converted into either HTML or plain text.
It's the same reason vector graphics formats (SVG, EPS, etc.) exist: an SVG file is essentially a sequence of instructions specifying how to draw the image. It's easy to convert that into a bitmap image of any size, whereas if you only had a bitmap image to start with, you'd have a hard time changing its size (e.g. to create a thumbnail) without losing quality.
It is possible you might want to store both the formatted and unformatted versions of the data. For instance, let's use American phone numbers as an example. If you store one column with just the numbers and one column with the most frequently needed format, such as (111) 111-1111, then you can easily format to client specifications for the special cases or pull the most common one out quickly without lots of casting. This takes very little extra time at the time of insert (and can be accomplished with a calculated column so it always happens no matter where the data came from).
Data should be scrubbed before being put in the database so that invalid dates or nonnumeric data etc aren't ever placed in the field. Email is one field that people often put junk into for some reason. If it doesn't have an # sign, it shouldn't be stored. This is especially true if you actually send emails thorugh your application(s) using that field. It is a waste of time to try to send an email to 'contact his secretary' or 'aol.com' if you see what I mean.
If the format will be consistently needed, it is better to convert the data to that format once on insert or update and not have to convert it ever again. If the standard format changes, you will need to update the column for all existing records at that time, then use the new format going forth. If you have frequent changes of format and large tables or if differnt applications use different formats, it might be best to store unformatted.