Technological solutions for extremely long term data archiving? - sql-server

Are there any good technical solutions for extremely long term archiving of data, for example for 25 to 100 years?
Somehow I just don't have a lot of confidence that a SQL 2000 backup file will be usable in court cases or for historians in 25 to 100 years.
This is a customer requirement, not just speculation.
This is comparable to trying to trying to do something useful with a back up for ENIAC or reading Atari Writer wordprocesing files. The hardware doesn't necessarily exist anymore, the storage media is likely corrupt, the professionals for using the technology probably don't exist anymore, etc.

Actually, printing on Acid-free paper is probably a much better solution than any more advanced technological one. It is much more likely that the IT tech of +100 years will be able to high-speed scan and load print than any digital data storage based on 100 year-old media access HW, technology and standards, 100 year-old disk/file format standards and 100 year-old data encoding standards.
Disagree? I've got a whole attic full of vinyl CD's, 8-tracks, cassette tapes, floppy disks (4 different densities!) that argue otherwise. And they are only 20 years old! (OK, the 8-tracks are closer to 30).
The fact is that there is only one data storage & archiving technology that has ever withstood the test of time over 100 years or more and still been cost effectively retrievable, and thats writing/printing on physical media.
My advice? Don't trust any archival strategy until it's been tested, and there's only one that has passed the 100-year test so far.

You'll need to convert to text - perhaps XML.
Then upload it to the cloud, make archival copies etc.
I think you need to pick a multi-modal approach.
If you have the budget: http://www.archives.gov/era/papers/thic-04.html

<joke>Print it.</joke>

script the data into flat files (either one file per table or summarize multiple tables into a file), write them to high end archival CDs. in 100 years they will have to load this data into whatever "database" they have, so so some manual conversion will be necessary, so a nice schema script dump into a single file would help the poor guy trying to read these files and make the proper joins.
EDIT
offer the client a service contract, where you make sure they are up to date with the latest archival technology on a yearly basis. this could be a good thing $

I suggest you consult a specialist company in this field.
You might also be interested in this article:
Strategies for long-term data retention
It might help to speak to one of those companies/organisations

I don't know if anyone reads this thread or not anymore but there is a really good solution for this.
There is a new company called Millenniata, the have a product called M-Disc. The M-Disc is essentially a DVD made out of rock like materials that give it an estimated shelf life of 1,000 years +. You have to have a special DVD burner to burn the DVD's but it is not that expensive. Plus any normal DVD reader can read them. I have a professor at BYU that helped form this company, it is some pretty cool technology. Good Luck.
Link to M-Disc Website

Related

What is a good relational database design for stock market data?

Suppose there are two types of messages, QUOTE and TRADE. Both have different fields. For example TRADE has only a single price. QUOTE has both a bid and ask price. I want process messages in time order to do something like the following:
if (QUOTE) {
...
}
if (TRADE) {
...
}
My problem is the two messages are in different formats so I can't get them into the same database table. If I can't get them into the same database table how do I process sequentially? Any ideas for a suitable design?
The answer depends entirely on what you're doing and on where your app plugs into the data streams.
At one extreme, you might merely be answering customer quotes that you're pulling from an API, and basically implementing a cache. In this case two tables are fine.
At the other extreme, you might be monitoring real-time quotes for a high frequency trading platform, in which case the throughput will probably rule out using a database at all (things built around lisp, such as allegrograph, might be more appropriate), except to periodically collect aggregate statistics.
The short answer is, 'not really' For stock market and other time series data a key value store like Berkley DB or Mongo is pretty good. Also, a data format like NetCDF (http://en.wikipedia.org/wiki/NetCDF) will likely serve you better in the long run. It also depends on what kind of access you want and how much time you want to store.
You didn't indicate what you were doing with the data, which should inform your choices of storage more than anything. For example, a high-speed trading application will have different storage tradeoffs than a historical batch processing system (where Hadoop + NetCDF would be great). YMMV
Kdb+/q
Is a very good option for tick data. Used by major banks.
here is the info about that.
You can install a trail version and play with it.

Data quality database model

Need an example of a database model to be attached to a database for data quality. Best form of the answer would at the very least be DDL that's executable in MySQL; other RDMS DDL's are okay, I'll just post another question asking for a porting of the code.
A good explaintion would be a huge plus.
Questions, comments, feedback, etc. -- just comment, thanks!!
The biggest problem is identifying meaningful measures of quality. That's so highly application-dependent, I doubt that anybody will be able to help you very much. (At least not without a lot more information--perhaps more than you're allowed to give.)
But let's say your application records observations of birds by individuals. (I'm just throwing this together off the top of my head. Read it for the gist, and expect the details to crumble under scrutiny.) Under average field conditions,
some species are hard for even a beginner to get wrong
some species are hard for an expert to get right
a specific individual's ability varies irregularly over time (good days, bad days)
individuals usually become more skilled over time
you might be highly skilled at identifying hawks, and totally suck at identifying gulls
individuals are prone to suggestion (who they're with makes a difference in their reliability)
So, to take a shot at assessing the quality of an identification, you might try to record a lot of information besides the observation "3 red-tailed hawks at Cape May on 05-Feb-2011 at 4:30 pm". You might try to record
weather
lighting
temperature (some birders suck in the cold)
hours afield (some birders suck after 3 hours, or after 20 cold minutes)
names of others present
average difficulty of correctly
identifying red-tailed hawks
probability that this individual
could correctly identify red-tails
under these field conditions
alcohol intake
Although this might be "meta" to field birders, to the database designer it's just data. And you'd design the tables just like you'd design them for any other application. (That's what I did, anyway.)

EMR (Electronic Medical Record) standard record format?

A few associates and myself are starting an EMR project (Electronic Medical Records). I have heard talk in the past - and more so lately - about a standard record format - to facilitate the transferring of records when appropriate (HIPAA) from one facility to another. Has anyone seen any information on this?
You can look to HL7 for interoperability between systems (http://www.hl7.org/). Patient demographic information and textual notes can be passed. I've been out of the EMR space too long to know if any standards groups have done anything interesting of late. A standard format that maintains semantic meaning is a really, really difficult problem. See SnoMed (http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html) for one long-running ontology effort -- barely the start of a rich interchange format.
A word of warning from someone who spent several years with an upstart EMR vendor...This is a very hard business to be in. Sales cycles for large health systems literally can take years, and the amount of hand-holding required for smaller medical practices can quickly erode margins. Integration with existing practice management systems is non-standard, even if those vendors claim otherwise. More and more issues abound. I'm not sure that it's a wise space for an unfunded start-up to enter.
I think it's an error to consider HL7 to be a standard in the sense you seem to mean. It is heavily customized and can be quite different from one customer to the next. It's one of those standards with too much flexibility.
I recommend you read the standard (which should take you a while), then try to find a community of developers working with the standard. Ask them for horror stories, and be prepared for what you'll hear.
A month late, but...
The standard to shoot for is definitely HL7. It is used in many fields, so is highly customizable but there is a well defined standard for healthcare. Each message (ACK, DSR MCF), segment (PID, PV1, OBR, MSH, etc), sequence and event type (A08, A12, A36) has a specific meaning regardless of your system of choice.
We haven't had a problem interfacing MiSYS, Statlan, Oacis, Epic, MUSE, GE Centricity/Lastword and others sending DICOM, ADT, PACS information between the systems we have in use. Most of these systems will be set up with an interface engine to tweak messages where needed, so adding a way to filter HL7 messages as they come through to your system, and as they go out to the downstreams, would be a must.
Even if there would be a new "presidential standard" for interoperability, and I would hazard a guess that it will be HL7 anyway, I would build the system with HL7 messaging as this is currently the industry accepted standard.
While solving interoperability, you shouldn't care only about the interchange format, the local storage formats should be standardized also, to simplify the transformation to the interchange format and vice versa.
openEHR is a great format for storage, it is more expressive than HL7 v2, v3 and CDA, so it can be transformed easily to any of those. The specs are open and here: http://openehr.org/programs/specification/releases/1.0.2
For the interchange format, any of HL7 v2, v3 and CDA are good. Also consider CCR and CCD.
http://www.aafp.org/practice-management/health-it/astm.html
If you want to go outside HL7 thinking and are looking for an comprehensive EMR or EHR with a specified record format rather than a record extract message interchange format, then have a look at openEHR, http://www.openehr.org/. The ISO 13606 extract standard is (almost) a subset of openEHR. You will also find open source reference libraries and openEHR implementations of different maturity available in Java, .NET, Ruby, Python, Groovy etc.
Some organisations are also producing HL7 artifacts like CDA as output from openEHR based EHR/EMR systems.
Have a look at the Continuity of Care Record--IIRC, that's what Google Health uses for input. It's not an HL7-family standard (there's a competing HL7-family standard--don't recall what it's called off-top).
There likely will not be a standard medical record format until the government dictates the format of one and requires its use by force of law.
That almost assuredly will not happen without socialized national health care. So in reality zero chance.
its correct answer but i think some add about meaningful use of emr..... Officials Announce ‘Meaningful Use,’ EHR Certification Criteria
Last week, CMS released proposed regulations defining the “meaningful use” of electronic health records, Reuters reports (Wutkowski/Heavey, Reuters, 12/31/09).
In addition, the Office of the National Coordinator for Health IT released an interim final rule describing the required certification standards for EHR technology (Simmons, HealthLeaders Media, 12/31/09).
Under the 2009 federal economic stimulus package, health care providers who demonstrate meaningful use of certified EHRs will qualify for incentive payments through Medicaid and Medicare.
Officials will offer a 60-day public comment period after both regulations are published in the Federal Register on Jan. 13. The interim final rule on EHR certification is scheduled to take effect 30 days after publication (Goedert, Health Data Management, 12/30/09). http://www.myemrstimulus.com/
This is a very hard problem because data collection starts with an MD and the only coding they know (ICD and CPT) is all about billing, not anything likely to be of use between providers (esp. in a form where the MD can be held legally liable). And they hate even that much paperwork.
Add to that the fact that HIPAA dictates that the patient not the provider owns the data. Not that they could understand it or do anything useful with it if they had it.
Good luck. Whatever happens will result from coercion by the govt and be a long long time coming IMHO.
Interestingly the one source of solid medical info turns out to be the VA (because they don't have the adversarial issues of payment and legal liability.) Go figure. That might be a good place to start for a standard with any existing data and some momentum, though. Here's another question with some info.

How do games handle saved content?

I don't see an answer to this question here on SO which makes me afraid that it's incredibly simple and I'm just missing something but here goes.
Background, feel free to skip: I need a single course for my bachelor's degree that I skipped out on years ago. Theoretically it's Computer Graphics, but since I left it has become more Game Development. And that's great because to me it's more interesting than the fill algorithms and translations and whatnot that it used to be. It's a 4th year course only offered every other year, but I've managed to talk the department into letting me take a 4th year independent study on the same topic and call that good enough.
The prof "running" the independent study doesn't teach the actual Computer Graphics course so while he's a smart guy this isn't really his field. So most of my questions are left to me, a text book and the internet. You know...like an independent study should be. :)
/Background
I've got a buddy that likes to develop game systems for fun. I plan to take one of his table top games and make it into a computer game using XNA.
I don't foresee any insurmountable challenges with the game mechanics but one thing I'm curious about is how do most games save their content? I mean that in a couple of ways and hopefully I can express them clearly.
Take the case of any RPG you've ever played. You can hit the "Save" button and save the world, your character's information and whatever other information is necessary. Then later on you can hit the "Load" button and bring it back.
Or the case of NPC dialogue. When I bump into Merchant #853 he randomly spits out one of 3 different greetings.
There are others that I can think of but they're really just variations on the same theme. Even with those two examples it seems to me the same mechanic could be used, but what is that mechanic?
I've been doing web development for years so my mind automatically jumps to "databases!". A database is the solution to any problem. And I can see how it could work here but the overhead seems pretty steep. "Here's my 6mb compiled game...oh and 68mb MySQL installation." Or even worse since I'm using XNA, maybe I'd need to find a way to bundle SQL Server. :)
I thought maybe XML but that doesn't feel right to me either. How would it work if I wanted to run on the XBox? Or Zune? (Those aren't necessary for what I'm doing, but there must be a solution somewhere that takes them into account.)
Anyone know the secret? Or have some ideas anyway?
Thanks
Jeff
There are two main ways how games are saved, a simple one and a complex one. The first way is to simply stores the current level, the current score and a handful of other stats. This is seen in games such as Super Mario Galaxy and most earlier console module based games. The save game doesn't restore your exact position, but just which levels you have completed. These save games are generally very simple and require very little memory.
The second way not only stores your overall progress, but stores each and every little detail, such as enemy positions, their current animation frame and so on, so that loading a save game will place you at the exact spot where you stopped, with all the enemies right in place, instead of back at the start of a level. These savegames tend to get much bigger than the other version and thus are mostly seen on PC games.
Databases are used in neither of these schemes, as the purpose of databases is to provide the ability to dynamically query data structures, what the game however needs isn't a way to query individual pieces, but just a way to statically store them. When a savegame is loaded, it is loaded completly into memory and from there on the game engine does its thing with the data. There are a handful of exceptions, such as MMORPGs which might work on a database, but single player games generally don't.
How the data is actually stored depends on the game. Most common seem to be simple binary data formats, as they are much better in terms of disk space than XML. In older games those binary formats where often raw dumps of a pieces of memory of the games process, so they didn't have any well thought out structure and often broke when a patch or a different version of a game got released, in some modern games that's still the case. XML can be used as too, as well as any other text based file format.
In large part this is more a game design issue than a programming one, as they way a game can be saved can drastically change how its played. The simple way, where you just save the level number and some stats, is however a lot easier to implement, as its just a few lines of of code. While the second one requires serialization of most of your classes, which for a complex game can be quite a tricky issue and lead to many subtle bugs.
One approach is to use .net serialization.
Make sure the state of you game is a fully connected graph and that each class in that graph is marked as Serializable (with the SerializableAttribute), the for saving (and loading) you can use normal .net serialization.
You can look at the codebase for Project Xenocide (open source XNA game) to see how it was done there.
You could use an SQLite database, with the SQLite.NET wrapper. I've used this, and found it quite simple. The whole DLL is only 850KB, and the database itself sits in a single file (with temp files created as needed). So your users shouldn't have an issue.
But you could also use a simple XML file, or a home-grown binary format. It all depends on how you're going to be querying the data, and how much data is involved. There is no one answer.
As others have noted, serialization is the way to go. And Gamasutra just published an article on data baking.
From my limited experience developing games, save games really don't use much storage. As tvanfosson said, you normally store most things in memory while playing the game, so saving state to disk isn't a problem.
Here's a short example. Assuming a single person RPG, if you needed to save your character's location only, you'd have perhaps a level number, xyz coordinates and maybe the direction you're facing. That's just a few bytes.
Now assume you need to save the state/location of things like health packs, crates, enemies, character's health and picked up items, etc. You could have a few hundred of these at most which would easily be less than 10KB.
Obviously things can get very complicated with more complex games. The trick is to only store what is truly necessary to recreate the player's experience. A lot of games only let you save at certain places, like the end of a level. In this case you only need to store the new level number plus the outcome of previous levels (e.g. health remaining, picked up items).
Even if you allow arbitrary save points you can ignore the state of any places/levels that you cannot return to. And you probably wouldn't want the user to be able to save mid-jump.
EDIT: With regard to file format... use any way that's convenient for the data type! XML is quite a nice way of doing things. Not sure how effective a database would be since for an RPG each fragment of data can be very different; You might end up with a bunch of tables with one row each.
Most games use their own, binary, file formats. Firstly this reduces the storage amount dramatically. Secondly, it helps prevent users cheating by editing the save game manually - if you have XML like <health value="10"/> it's very easy to edit the file to read <health value="100"/>. The downside of binary is that it's much more difficult for debugging.
While the game is running, I'd try to keep everything relating to the current context in memory. Your initialization can be kept in some suitable serialized format and read in on start up. XML would work, but it's somewhat verbose. A custom compact binary format is probably more appropriate. The same is true of the saved state. Whatever objects need to be reinitialized when the saved game is loaded should be serialized to a custom binary format and then reconstituted on load. If you run into memory problems, a small custom database optimized for speed would be another alternative. It could be pre-populated on installation.

What's a good way to store raster data?

I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:
251
12.76 12.55 12.55 12.34 [etc., 200 more values...]
13.02 12.95 12.70 12.40 [etc., 200 more values...]
[etc., 250 more lines]
252
[etc., etc.]
I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).
We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.
I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.
Any bright ideas or products?
I've assembled your comments here:
I'd like to do all this "w/o writing my own file I/O code"
I need access from "Java Ruby MATLAB" and "FORTRAN routines"
When you add these up, you definitely don't want a new file format. Stick with the one you've got.
If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)
You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."
My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...
You said something that I want to address:
"leverage 40 yrs of DB optimization"
Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.
Here's the most useful thing I can tell you, based on everything you've told us. You said this:
"I am more interested in optimizing my time than the CPU's, though exec speed is good!"
This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.
And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)
I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.
Clarifications:
I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?
Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?
You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?
Is "compression" one of the things you care about, or are you just sick of flat files?
Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?
Also, what programming language(s) do you want to access this data with?
your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.
please describe how you need to be able to access the data/
Matt, thanks very much, and likewise longneck and jirv.
This post was partly an experiment, testing the quality of stackoverflow discourse. If you guys/gals/alien lifeforms are representative, I'm sold.
And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)
I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.
Thanks again, folks.

Resources