I can never decide if it's better to format data before inserting it into the DB, or when pulling it out.
I'm not talking about data sanitization; we all know to protect against SQL injection. I'm talking about if the user gives you a URL, and it doesn't have http:// in front of it, should you add that before inserting it into the DB or when pulling it out? What about more complex things, like formatting a big wad of text. Do I want to mark it up with HTML (or strip it down) before or after? What if I change my mind later and want to format it differently? I can't do this if I've already formatted it, but I can if I store it unformatted... but then I'm doing extra work every time I pull a piece of data out of the DB, which I could have done once and been done with it.
What are your thoughts?
From the answers, there seems to be a general consensus that things like URLs, phone numbers, and emails (anything with a well-defined format) should be normalized first to a consistent format. Things like text should generally be left raw or in a manipulable format for maximum flexibility. If speed is an issue, both formats may be stored.
I think it's best to make sure data in the database is in the most consistent format possible. You might have multiple apps using this data, so if you can make sure it's all the same format, you won't have to worry about reformatting different formats in every application.
Normalising URLs to a canonical form prior to insertion is probably okay; performing any kind of extensive formatting, e.g. HTML conversion/parsing etc. strikes me as a bad idea - always have the "rawest" data possible in your database, especially if you want to change the presentation format later.
In terms of avoiding unnecessary post-processing on every query, you might look into adopting object caching or similar techniques for the more expensive operations.
You're asking two questions here.
Normalization should always be performed prior to the database insertion, e.g. if a column only has URLs then they should always be normalized first.
Regarding formating, that's a view problem and not a model (in this case DB) problem.
In my opinion, it should be formatted first. If you choose to do it at the time of retrieval instead of insertion, this can cause problems down the road when other applications/scripts want to use data out of the same database. They will all need to know how to clean up the data when they pull it out.
depends
if you are doing well defined items, SSN, zip code, phone number, store it formatted (this does not necessarily mean to include dashes or dots, etc. it may mean removing them so everyhting is consistent.
You have to be very careful if you change data before you store it. You could always run into a situation where you need to echo back to the original user the exact text that they gave you.
My inclination is usually to store data in the most flexible form possible. For instance, numbers should be stored using integer or floating-point types, not strings, because you can do math with numeric types but not with strings (although it's easy enough to parse a number into a string that this is not a big deal). Perhaps a more practical example: dates/times should be stored using the database's actual date/time data type instead of strings. Also, maybe it's easier to convert HTML into plain text than vice versa, in which case you'd want to store your text as HTML. Or maybe even using a format like Markdown which can be easily converted into either HTML or plain text.
It's the same reason vector graphics formats (SVG, EPS, etc.) exist: an SVG file is essentially a sequence of instructions specifying how to draw the image. It's easy to convert that into a bitmap image of any size, whereas if you only had a bitmap image to start with, you'd have a hard time changing its size (e.g. to create a thumbnail) without losing quality.
It is possible you might want to store both the formatted and unformatted versions of the data. For instance, let's use American phone numbers as an example. If you store one column with just the numbers and one column with the most frequently needed format, such as (111) 111-1111, then you can easily format to client specifications for the special cases or pull the most common one out quickly without lots of casting. This takes very little extra time at the time of insert (and can be accomplished with a calculated column so it always happens no matter where the data came from).
Data should be scrubbed before being put in the database so that invalid dates or nonnumeric data etc aren't ever placed in the field. Email is one field that people often put junk into for some reason. If it doesn't have an # sign, it shouldn't be stored. This is especially true if you actually send emails thorugh your application(s) using that field. It is a waste of time to try to send an email to 'contact his secretary' or 'aol.com' if you see what I mean.
If the format will be consistently needed, it is better to convert the data to that format once on insert or update and not have to convert it ever again. If the standard format changes, you will need to update the column for all existing records at that time, then use the new format going forth. If you have frequent changes of format and large tables or if differnt applications use different formats, it might be best to store unformatted.
Related
I'm working on an application for which I need to be able to store EXIF metadata in a relational database. In the future I'd like to also support XMP and IPTC metadata, but at the moment my focus is on EXIF.
There are a few questions on Stack Overflow about what the table structure should look like when storing EXIF metadata. However, none of them really address my concern. The problem I have is that different EXIF tags have values in different formats, and there isn't really one column type which conveniently stores them all.
The most common type is a "rational" which is an array of two four-byte integers representing a fraction. But there are also non-fractional short and long integers, ASCII strings, byte arrays, and "undefined" (an 8-bit type which must be interpreted according to a priori knowledge of the specific tag.) I'd like to support all of these types, and I want to do so in a convenient, efficient, lossless (i.e. without converting the rationals to floats), extensible and searchable manner.
Here's what I've considered so far:
My current solution is to store everything as a string. This makes it pretty easy to store all of the different types, and is also convenient for searching and debugging. However, it's kind of clunky and inefficient because when I want to actually use the data, I have to do a bunch of string manipulation to convert the rational values into their fractional equivalents, e.g. fraction = float(value.split('/')[0]) / float(value.split('/')[1]). (It's not actually a messy one-liner like that in my real code, but this demonstrates the problem.)
I could grab the raw EXIF bytes for each value from the file and store them in a blob column, but then I'd have to reinterpret the raw bytes every time. This could be marginally more CPU-efficient than the string solution, but it's much, much worse in every other way - on the whole, not worth it.
I could have a different table for each different EXIF datatype. Using this pattern I can maintain my foreign key relationships while storing my values in several different tables. However, this will make my most common query, which is to select all EXIF metadata for a given photo, kind of nasty. It will also become unwieldy very quickly when I add support for other metadata formats.
I'm not a database expert by any means, so there some pattern or magic union-style column type I'm missing that can make this problem go away? Or am I stuck picking my poison from among the three options above?
This is probably a very cheap solution, but I would personally just store the json or something like that within the database.
There is a cool way to extract EXIF data and parse it to json.
Here is the link: Img2JSON
I hope this kind of helps you!
My site requires users to input a great deal of mathematical and scientific notation. Naturally, I need a way to store this server-side for future use. This is my ideal course of action: a user inputs a chunk of text, potentially with or without mathematical symbols, the input is submitted, it is then turned into a format that is suitable for a database, and finally, on every subsequent viewing of the inputted math, it should get converted to MathML. The MathML is converted to a readable format by MathJax.
I have never worked with storing math in a database before, and the fact that I want to allow users to be able to insert math inline with text creates a few implications. Any ideas?
The desire to normalize this kind of thing is understandable. It is also wrong.
The only thing you should "expand" into more semantic types for the use of a relational database is anything that would appear in a where clause.
If it's reasonable for you to want to run a query that looks like (pseudocode:)
SELECT * FROM FORMULAE
WHERE "/theta" in free_vars(FORMULAE.FORMULA)
AND "bob" in FORMULAE.USERS;
Then you want to use a data type that makes such a query efficient. If, on the other hand, the database will not be doing any kind of processing on the stored formula, then don't try to convert it to or from anything except the most useful external representation, the markup that properly represents the formula.
store the MathML as text. if it does not fit in a VARCHAR, then use a CLOB
I would recommend storing it as MathML text in the database. That way, at insertion time, you convert it to MathML once, then each time it is accessed, there's no need to convert it (ie: faster).
I need to find a relatively robust method of storing variable types data in a single column of a database table. The data may represent a single value or multiple values and may any of a long list of characters (too long to enumerate easily). I'm wondering what approaches might work in this process. I'd toyed with the ideas of adding some form of separator, but I'm worried that any simple separator or combination might occur naturally in the data. I'd also like to avoid XML or snippets since in fact the data could be XML. Arguably I could encode the XML, but that still seems fragile.
I realize this is naturally a bit of an opinion question, but I lack the mojo to make it community.
Edit for Clarification:
Background for the problem: the column will hold data that is then used to make a evaluation based on another column. Functionally it's the test criteria for a decision engine. Other columns hold the evaluation's nature and the source of the value to test. The data doesn't need to be searchable.
Does the data need to be searchable? If not, slap it in a varbinary(MAX) and have a field to assist in deserialization.
Incidentally, though; using the right XML API, there should be no trouble storing XML inside an XML node.
But my guess is there has to be a better way to do this... it seems... ugh!
JSON format, though I agree with djacobson, your question is like asking for the best way to saw a 2x4 in half with a teaspoon.
EDIT: The order in which data are stored in the JSON string is irrelevant; each datum is stored as a key-value pair.
There's not a "good" way to do this. There is a reason that data types exist in SQL.
The only conceivable way I can think of to make it close is to make your column a lookup column, which refers to a GUID or ID in another table, which itself has additional columns indicating which table and row have your data.
The question title is probably not correct because part of my question is to try and get some more understanding on the problem.
I am looking for the advantages of making sure data that is imported to a database (simple example: Excel table to Access database) should be given using the same schema and should also be valid to the business requirements.
I have an Excel table containing none normalised data and an Access database with normalised tables.
The Excel table comes from multiple third parties, none of which stick to the same format as each other or the database.
Some of the sources also do not supply all the relevant data.
Example of what could be supplied
contact_key, date, contact_title, reject_name, reject_cost, count_of_unique_contact
count_of_unique_contact is derived from distinct contact_title's and should not be imported.
contact_key is sometimes not supplied.
title is sometimes unknown and passed in as such "n/a", "name = ??1342", "#N/A" etc. rather random.
reject_name is often miss spelled.
the fields are sometimes not even supplied, e.g. date and contact_key are missing.
I am trying to find information to help explain the issues with the above.
Issues only related to incorrect data or fields making it difficult to have useful data in the database such as not being able to report a trend on reject costs in a month when the date is not supplied. Normalising the excel file is not an option available to me.
Requesting the values and fields in the Excel files to match the business requirements and the format to be the same for every third party that sends them is what I want to do but the request is falling on deaf ears.
I want to explain to the client that inputting fake data and checking for invalid/existing rejects/contacts all the time is wrong and doing it is going to fail or at the best be difficult without constant maintenance of a poor system.
Does anyone have any information on this problem?
Thanks
This is a common problem; this gets referred to in data processing circles as "garbage in, garbage out". Essentially, what you're running up against is that the data as given is of poor quality; you're correct to recognize that the problem is that it will be hard (if not impossible) to use this data to extract any useful information.
To some extent, this is a problem that should be fixed at the source; whatever your source of your data is, they need to be convinced that the data quality must improve. In the short term, you can sanitize your data; the term refers to removing or cleaning the bad entries to make the remainder of the data (the "good" data) importable into your database. Depending on just what percentage of your data is bad, you may or may not be able to do useful things with the sanitized data once you import it.
At some point, since you're not getting traction with management about the quality of the data, you will simply have to show them that the system is not working as intended because the quality of the data is bad. They'll need to improve their processes at that point to improve the quality of the data you get in at that point. Until then, though, keep pressing for better data; investigate the process of sanitizing the data and see what you can do with the remaining data. Good luck!
I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:
251
12.76 12.55 12.55 12.34 [etc., 200 more values...]
13.02 12.95 12.70 12.40 [etc., 200 more values...]
[etc., 250 more lines]
252
[etc., etc.]
I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).
We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.
I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.
Any bright ideas or products?
I've assembled your comments here:
I'd like to do all this "w/o writing my own file I/O code"
I need access from "Java Ruby MATLAB" and "FORTRAN routines"
When you add these up, you definitely don't want a new file format. Stick with the one you've got.
If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)
You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."
My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...
You said something that I want to address:
"leverage 40 yrs of DB optimization"
Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.
Here's the most useful thing I can tell you, based on everything you've told us. You said this:
"I am more interested in optimizing my time than the CPU's, though exec speed is good!"
This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.
And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)
I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.
Clarifications:
I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?
Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?
You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?
Is "compression" one of the things you care about, or are you just sick of flat files?
Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?
Also, what programming language(s) do you want to access this data with?
your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.
please describe how you need to be able to access the data/
Matt, thanks very much, and likewise longneck and jirv.
This post was partly an experiment, testing the quality of stackoverflow discourse. If you guys/gals/alien lifeforms are representative, I'm sold.
And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)
I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.
Thanks again, folks.