Explaining benefits of an array to a lay person? - arrays

I develop code in our proprietry system using a scripting language that is unique to that system.
Our director has allowed us to request enhancements to this language, which currently lacks user definable arrays.
I need to write a concept brief on why we need arrays and how they can benefit us, however I need to explain it in a fashion that someone who has no understanding of code will understand.
I'm a programmer, therefore I suck at documentation and explaining things in a non-technical manner. I tried banging my head on the desk to see if anything useful would come out but it hasn't. Can anyone help?

I love analogies.
Much easier to have a 100 DVD holder that sits neatly on your floor and holds 100 dvds in order than 100 individual DVDs scattered around your house where you last used them
Especially relevant when you need to move the collection from one place to another or share it with a friend.

What's your application area? To speak the users' language you need to know that. Suppose it's stocks trading: then what to you is an array, to the users may be a portfolio -- get the quotes for several stock at once rather than having to do it repeatedly for one at a time. If your application area is CRM, then the array will let the users check on a group of customers at once, rather than do it one at a time. And so on, and so forth.
In every application area there will be cases in which users may want to deal with a bunch of things at once, it being easier than dealing with one thing at a time. Phrase it in the appropriate vocabulary, and you have the case for arrays!

You might want to see if you can move the business away from your custom scripting environment and into a standard scripting environment like LUA or Python. You might be surprised at how much easier it is to get LUA up and running than it is to :
Support an in house system
Create tools for it (do you have an IDE?)
Train new programmers in it
Live without modern features that you lack the time/skills to impliment.
Key to getting that to happen would be to make LUA interoperable with your standard scripting system or writing a translation from your old scripts to LUA scripts.

The benefit is that it makes the code shorter, and thus less money is spent coding and debugging. You can then present some example code that you could make it shorter had the language supported arrays.

Sounds like you've been asked to create code in the past (or anticipate having to create code in the future), where your job would have been faster/easier/cheaper if the system that you used had arrays.
That's the issue: you want to do more for your director and you need arrays to help you.
Your director will understand the business benefits of you having a better toolkit--you'll be able to do more for him or her. And that's how you increase business efficiency.
Tell your director: I want to improved my productivity for you and our team. To do so, arrays would be very helpful.

I like Alex's answer - it has to be put in terms of the user's problems. What problem (that they care about) can they do with it that they cannot do without it?
I used to teach introductory programming in college, and arrays are simply not something that comes easily to non-programmers. They need to understand some other basics first, like the sequential nature of programs, the lego-block way programs are constructed, the idea of run-time (as opposed to write-time) and really importantly the concept of a variable as a container of a value, and how that is different from its name, and how its contents changes with time while its name does not.
I found a useful way to get into this area is to let them program a very simple, decimal, simulated computer, in "machine language". They get the notion of memory address vs. memory contents, and that address is just a number. That makes it a lot easier to introduce arrays in a more "real" language.
Another approach is to have them work on a kind of problem where they really start wishing they could invent variables on-the-fly. Like they don't want to just have a variable A, but they feel a need for A1, A2, etc. and then they would really like to say Ai where i is a another variable. Once they feel the need for that, then they will grasp arrays. (For example, they could take a simple program that asks for their name and has a simple conversation with them, and then extend it to talk to two people at once, then three, and so on.)
Then, a useful next step is "parallel arrays" which can serve as rudimentary arrays of structures. i.e. N$(i) can be name of student i, while A(i) can be age of student i. This makes the idea useful.
Only then would I dare to start to introduce algorithms like sorting, merging, table lookup, and so on.

I think to fully realize the potential of arrays, you must somehow mention two things:
1) Array Algorithms
Sort, Find, etc. All the basics. Equate this in your business brief as structured data that can organize itself. No extra query language. No variable naming conventions. All you need is good standards.
2) Multi-Dimensional Arrays
The power of arrays seems fully realized to me with matrices. With these you can practically hold limitless data.
Plus, depending on the power of the propriety language you are using, arrays can store objects.

The power of an array is that it allows you to put a group of things together so that you can perform the same operation on all of them with less code.
Sorting is one example of an array operation, and is like having a box of index cards that you are putting in order.
Or if you had a collection of letters that need to go out, being able to write a loop that stamps each letter and then sends it is better than writing out
Take first letter, stamp it.
Mail it.
Takes second letter, stamp it,
Mail it.
Basically anything you'ld use to refer to the first, second, third, fifth, etc, is basically like an array.
And then indexed/hashed arrays are like having an index in the book - you know the author describes the Defenistration of Prague somewhere in the volume, but looking in the index shows that it's on page 255.

Here's the easiest-to-understand benefit: It lets you refer to things by a number. Try to emphasize the importance of this.

Related

When to use an array vs database

I'm a student and starting to relearn again the basics of programming.
The problem I stated above starts when I have read some Facebook posts that most of the programmers use arrays in their application and arrays are useful. And I started to realize that I never use arrays in my program.
I read some books but they only show the syntax of array and didn't discuss on when to apply them in creating real world applications. I tried to research this on the Internet but I cannot find any. Do you guys have circumstance when you use arrays. Can you please share it to me so I can have an idea.
Also, to clear my doubts can you please explain to me why arrays are good to store information because database can also store information. When is the right time for me to use database and arrays?
I hope to get a clear answer because I have one remaining semester before the internship and I want to clear my head on this. I do not include any specific programming language because I know most of the programming language have arrays.
I hope to get an answer that can I can easily understand.
When is the right time for me to use database and arrays?
I can see how databases and arrays may seem like competing solutions to the same problem, but you're comparing apples and oranges. Arrays are a way to represent structured data in memory. Databases are a tool to store data on disk until you need to retrieve it.
The question you pose is kind of like asking: "When is the right time to use an integer to hold a value, vs a piece of paper?" One of them is a structural representation in memory; the other is a storage tool.
Do you guys have circumstance when you use arrays
In most applications, databases and arrays work together. Applications often retrieve data from a database, and hold it in an array for easy processing. Here is a simple example:
Google allows you to receive an alert when something of interest is mentioned on the news. Let's call it the event. Many people can be interested in the event, so Google needs to keep a list of people to alert. How? Probably in a database.
When the event occurs, what does Google do? Well it needs to:
Retrieve the list of interested users from the DB and place it in an array
Loop through the array and send a notification to each user.
In this example, arrays work really well because users form a collection of similarly shaped data structures that needs to be put through a similar process. That's exactly what arrays are for!
Some other common uses of arrays
A bank wants to send invoice and payment due reminders at the end of the day. So it retrieves the users with past due payments from the DB, and loops through the users' array sending notifications.
An IT admin panel wants to check whether all critical websites in a list are still online. So it loops through the array of domains, pings each one and records the results in a log
An educational program wants to perform statistical functions on student test results. So it puts the results in an array to easily perform operations such as average, sum, standardDev...
Arrays are also awesome at keeping things in a predictable order. You can be certain that as you loop forward through an array, you encounter values in the order you put them in. If you're trying to simulate a checkout line at the store, the customers in a queue are a perfect candidate to represent in an array because:
They are similarly shaped data: each customer has a name, cart contents, wait time, and position in line
They will be put through a similar process: each customer needs methods for enter queue, request checkout, approve payment, reject payment, exit queue
Their order should be consistent: When your program executes next(), you should expect that the next customer in line will be the one at the register, not some customer from the back of the line.
Trying to store the checkout queue in a database doesn't make sense because we want to actively work with the queue while we run our simulation, so we need data in memory. The database can hold a historical record of all customers and their checkout outcomes, perhaps for another program to retrieve and use in another way (maybe build customized statistical reports)
There are two different points. Let's me try to explain the simple way:
Array: container objects to keep a fixed number of values. The array is stored in your memory. So it depends on your requirements but when you need a fixed and fast one, just use array.
Database: when you have a relational data or you would like to store it in somewhere and not really worry about the size of the objects. You can store 10, 100, 1000 records to you DB. It's also flexible and you can select/query/update the data flexible. Simple way to use is: have a relational data, large amount and would like to flexible it, use database.
Hope this help.
There are a number of ways to store data when you have multiple instances of the same type of data. (For example, say you want to keep information on all the people in your city. There would be some sort of object to hold the information on each person, and you might want to have a data structure that holds the information on every person.)
Java has two main ways to store multiple instances of data in memory: arrays and Collections.
Databases are something different. The difference between a database and an array or collection, as I see it, are:
databases are persistent, i.e. the data will stay around after your program has finished running;
databases can be shared between programs, often programs running in all different parts of the world;
databases can be extremely large, much, much larger than could fit in your computer's memory.
Arrays and collections, however, are intended only for use by one program as it runs. Your program may want to keep track of some information in order to do its calculations. But the data will be in your computer's memory, and therefore other programs on other computers won't be able to access it. And when your program is done running, the data is gone. However, since the data is in memory, it's much faster to use it than data in a database, which is stored on some sort of external device. (This is really an overgeneralization, and doesn't consider things like virtual memory and caching. But it's good enough for someone learning the basics.)
The Java run time gives you three basic kinds of collections: sets, lists, and maps. A set is an unordered collection of unique elements; you use that when the data doesn't belong in any particular order, and the main operations you want are to see if something is in the set, or return all the data in the set without caring about the order. A list is ordered, though; the data has a particular order, and provides operations like "get the Nth element" for some number N, and adding to the ends of the list or inserting in a particular place in the list. A map is unordered like a set, but it also attaches keys to the data, so that you can look for data by giving the key. (Again, this is an overgeneralization. Some sets do have order, like SortedSet. And I haven't even gotten into queues, trees, multisets, etc., some of which are in third-party libraries.)
Java provides a List type for ordered lists, but there are several ways to implement it. One is ArrayList. Like all lists, it provides the capability to get the Nth item in the list. But an ArrayList provides this capability faster; under the hood, it's able to go directly to the Nth item. Some other list implementations don't do that--they have to go through the first, second, etc., items, until they get to the Nth.
An array is similar to an ArrayList, but it has a different syntax. For an array x, you can get the Nth element by referring to x[n], while for an ArrayList you'd say x.get(n). As far as functionality goes, the biggest difference is that for an array, you have to know how big it is before you create it, while an ArrayList can grow. So you'd want to use an ArrayList if you don't know beforehand how big your list will be. If you do know, then an array is a little more efficient, I think. Although you can probably get by mostly with just ArrayList, arrays are still fundamental structures in every computer language. The implementation of ArrayList depends on arrays under the hood, for instance.
Think of an array as a book, and database as library. You can't share the book with others at the same time, but you can share a library. You can't put the entire library in one book, but you can checkout 1 book at a time.

Feature selection and unsupervised learning for multilingual data + machine learning algorithm selection

Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.

Practical to save thousands of data structures in a file and do specific lookups?

There's been a discussion between me and some colleagues that are taking the same class as me (and thus have the same project) about saving data to files and read from those files only when we need that specific data.
For instance, the project is something about managing a social network. I'm not going into specifics because it doesn't matter, but the idea is to use the best data structures to manipulate this data.
Let's say I'm using an Hash Table to save the users profile data. Some of them argue that only some specific information should be saved in the data structures, like and ID that represents an user. Everything else should be put on files. We should access the files to get that data we want when we want.
I don't think this is practical... It could be if we were using some library for a database like SQLite or something, but are not and I don't think we are supposed to. We are only supposed to code everything ourselves and use C functions, like these. Nor do I think we are supposed to do a perfect memory management. The requisites of the project are not for us to code a database, or even a pseudo-database. What this project demands of us, are the best data structures (as long as we know how to justify why we picked those instead of others) to store the type of data and the all data specified for the project.
I should let you know that we had 2 classes before where the knowledge we got there is to be applied on this project. One of those dealt with the basis of C, functions, structures, arrays, strings, file IO, recursion, pointers and simple data structures like binary trees and linked lists, stuff like that. The other one was about more complex data structures, hash tables, AVL trees, heaps, graphs, etc... It also talked about time complexity, big O notation and stuff like that.
For instance, let's say all I have in memory is the IDs of the users and then I need to find all friends of a specific user. I'll have to process the whole file (or files) finding out the friends of that user. It would be much easier if I could have all that data in memory already.
It makes no sense to me that we need to pick (and justify) the data structures that we best see fit for the project and then only use them to lookup for an ID. We will then need to do a second lookup, to get the real data we need, which will take it's time, won't it? Why did we bother with the data structures in the first place if we still need to get to search a bunch of files on the hard drive?
How could it be possible, using standard C functions, coding everything manually and still simulate some kind of database? Is this practical at all?
Am I missing something here?
It sounds like the project might be more about how you design the relationships between your data "entities," and not as much about how you store them. I don't think storing data off in files would be a good solution - file IO will be much slower than accessing things in memory. If you had the need to persist data on the disk, you'd probably want to just use a database, rather than files (I know it's an academic course though, so who knows).
I think you should focus more on how you design your data types, and their relationships, to maximize the speed of lookups, searches, etc. For example, you could store all the users in a linked list, or store them in a tree, or a graph, but each will have its implications on how fast you can find users, etc. Depending on what features you want in your social networking site, there will be different designs that will allow different types of behavior to perform better than it would in other designs.
From what you're saying I doubt that you need to store anything on disk.
One thing that I would ask the teacher is if you're optimizing for time or space complexity (there will be a trade off between these two depending on what you're trying to achieve).
That can certainly be done. The resource forks in Mac System 5-8 files were stored as binary indexed databases (general use of the term, don't think SQL!). (I think the interface was actually written in assembly, but I could do it in c).
The only thing is: it's a pain in the butt. Such files typically need to start with some kind of index or header, and then hold a bunch of records at predictable locations. (OK, sometimes the first index just points at some more indexes. How many layers of indirection do you care to manage?)
If you're going to do it, just remember: binary mode access.
Hmm... what about persistent storage?
If your project requires you to be able to remember friend data between two restarts of the app, then don't you think file storage (or whatever else becomes an issue)?
I'm having a very hard time figuring out what you are trying to ask here.
But there is a general rule that may apply:
If all of your data will fit in memory at once, it is usually best to load all of it into memory at once and keep it there. You write out to a file only to save, to exit, or for backup.
There are lots of exceptions to this rule, but for a class project where this is going to be the only major application running on the machine, you may as well store everything in memory. After all, you have already paid for the memory; you don't want it just sitting there idle.
I may have completely misunderstood the question you are trying to ask...

How do games handle saved content?

I don't see an answer to this question here on SO which makes me afraid that it's incredibly simple and I'm just missing something but here goes.
Background, feel free to skip: I need a single course for my bachelor's degree that I skipped out on years ago. Theoretically it's Computer Graphics, but since I left it has become more Game Development. And that's great because to me it's more interesting than the fill algorithms and translations and whatnot that it used to be. It's a 4th year course only offered every other year, but I've managed to talk the department into letting me take a 4th year independent study on the same topic and call that good enough.
The prof "running" the independent study doesn't teach the actual Computer Graphics course so while he's a smart guy this isn't really his field. So most of my questions are left to me, a text book and the internet. You know...like an independent study should be. :)
/Background
I've got a buddy that likes to develop game systems for fun. I plan to take one of his table top games and make it into a computer game using XNA.
I don't foresee any insurmountable challenges with the game mechanics but one thing I'm curious about is how do most games save their content? I mean that in a couple of ways and hopefully I can express them clearly.
Take the case of any RPG you've ever played. You can hit the "Save" button and save the world, your character's information and whatever other information is necessary. Then later on you can hit the "Load" button and bring it back.
Or the case of NPC dialogue. When I bump into Merchant #853 he randomly spits out one of 3 different greetings.
There are others that I can think of but they're really just variations on the same theme. Even with those two examples it seems to me the same mechanic could be used, but what is that mechanic?
I've been doing web development for years so my mind automatically jumps to "databases!". A database is the solution to any problem. And I can see how it could work here but the overhead seems pretty steep. "Here's my 6mb compiled game...oh and 68mb MySQL installation." Or even worse since I'm using XNA, maybe I'd need to find a way to bundle SQL Server. :)
I thought maybe XML but that doesn't feel right to me either. How would it work if I wanted to run on the XBox? Or Zune? (Those aren't necessary for what I'm doing, but there must be a solution somewhere that takes them into account.)
Anyone know the secret? Or have some ideas anyway?
Thanks
Jeff
There are two main ways how games are saved, a simple one and a complex one. The first way is to simply stores the current level, the current score and a handful of other stats. This is seen in games such as Super Mario Galaxy and most earlier console module based games. The save game doesn't restore your exact position, but just which levels you have completed. These save games are generally very simple and require very little memory.
The second way not only stores your overall progress, but stores each and every little detail, such as enemy positions, their current animation frame and so on, so that loading a save game will place you at the exact spot where you stopped, with all the enemies right in place, instead of back at the start of a level. These savegames tend to get much bigger than the other version and thus are mostly seen on PC games.
Databases are used in neither of these schemes, as the purpose of databases is to provide the ability to dynamically query data structures, what the game however needs isn't a way to query individual pieces, but just a way to statically store them. When a savegame is loaded, it is loaded completly into memory and from there on the game engine does its thing with the data. There are a handful of exceptions, such as MMORPGs which might work on a database, but single player games generally don't.
How the data is actually stored depends on the game. Most common seem to be simple binary data formats, as they are much better in terms of disk space than XML. In older games those binary formats where often raw dumps of a pieces of memory of the games process, so they didn't have any well thought out structure and often broke when a patch or a different version of a game got released, in some modern games that's still the case. XML can be used as too, as well as any other text based file format.
In large part this is more a game design issue than a programming one, as they way a game can be saved can drastically change how its played. The simple way, where you just save the level number and some stats, is however a lot easier to implement, as its just a few lines of of code. While the second one requires serialization of most of your classes, which for a complex game can be quite a tricky issue and lead to many subtle bugs.
One approach is to use .net serialization.
Make sure the state of you game is a fully connected graph and that each class in that graph is marked as Serializable (with the SerializableAttribute), the for saving (and loading) you can use normal .net serialization.
You can look at the codebase for Project Xenocide (open source XNA game) to see how it was done there.
You could use an SQLite database, with the SQLite.NET wrapper. I've used this, and found it quite simple. The whole DLL is only 850KB, and the database itself sits in a single file (with temp files created as needed). So your users shouldn't have an issue.
But you could also use a simple XML file, or a home-grown binary format. It all depends on how you're going to be querying the data, and how much data is involved. There is no one answer.
As others have noted, serialization is the way to go. And Gamasutra just published an article on data baking.
From my limited experience developing games, save games really don't use much storage. As tvanfosson said, you normally store most things in memory while playing the game, so saving state to disk isn't a problem.
Here's a short example. Assuming a single person RPG, if you needed to save your character's location only, you'd have perhaps a level number, xyz coordinates and maybe the direction you're facing. That's just a few bytes.
Now assume you need to save the state/location of things like health packs, crates, enemies, character's health and picked up items, etc. You could have a few hundred of these at most which would easily be less than 10KB.
Obviously things can get very complicated with more complex games. The trick is to only store what is truly necessary to recreate the player's experience. A lot of games only let you save at certain places, like the end of a level. In this case you only need to store the new level number plus the outcome of previous levels (e.g. health remaining, picked up items).
Even if you allow arbitrary save points you can ignore the state of any places/levels that you cannot return to. And you probably wouldn't want the user to be able to save mid-jump.
EDIT: With regard to file format... use any way that's convenient for the data type! XML is quite a nice way of doing things. Not sure how effective a database would be since for an RPG each fragment of data can be very different; You might end up with a bunch of tables with one row each.
Most games use their own, binary, file formats. Firstly this reduces the storage amount dramatically. Secondly, it helps prevent users cheating by editing the save game manually - if you have XML like <health value="10"/> it's very easy to edit the file to read <health value="100"/>. The downside of binary is that it's much more difficult for debugging.
While the game is running, I'd try to keep everything relating to the current context in memory. Your initialization can be kept in some suitable serialized format and read in on start up. XML would work, but it's somewhat verbose. A custom compact binary format is probably more appropriate. The same is true of the saved state. Whatever objects need to be reinitialized when the saved game is loaded should be serialized to a custom binary format and then reconstituted on load. If you run into memory problems, a small custom database optimized for speed would be another alternative. It could be pre-populated on installation.

What's a good way to store raster data?

I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:
251
12.76 12.55 12.55 12.34 [etc., 200 more values...]
13.02 12.95 12.70 12.40 [etc., 200 more values...]
[etc., 250 more lines]
252
[etc., etc.]
I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).
We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.
I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.
Any bright ideas or products?
I've assembled your comments here:
I'd like to do all this "w/o writing my own file I/O code"
I need access from "Java Ruby MATLAB" and "FORTRAN routines"
When you add these up, you definitely don't want a new file format. Stick with the one you've got.
If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)
You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."
My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...
You said something that I want to address:
"leverage 40 yrs of DB optimization"
Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.
Here's the most useful thing I can tell you, based on everything you've told us. You said this:
"I am more interested in optimizing my time than the CPU's, though exec speed is good!"
This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.
And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)
I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.
Clarifications:
I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?
Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?
You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?
Is "compression" one of the things you care about, or are you just sick of flat files?
Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?
Also, what programming language(s) do you want to access this data with?
your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.
please describe how you need to be able to access the data/
Matt, thanks very much, and likewise longneck and jirv.
This post was partly an experiment, testing the quality of stackoverflow discourse. If you guys/gals/alien lifeforms are representative, I'm sold.
And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)
I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.
Thanks again, folks.

Resources