Related
Sorry if this question is too silly or neurotic... But I can't figure it out by myself. So I want to see how others deal with it.
My question is:
I want to write a program show progress of do some thing. So I need to record which state is it currently in so that someone can check it by anytime. there are two method:
Use two field to represent the progress state: step and is_finished.
Just one filed: step. For example, if this thing need 5 step, then 6 means finished. ( 0 means not started? )
Compare above two methods.
two field:
Seems more clear. And the most important is that logically speaking step and finished or not are two concepts? I'm not sure about this.
If thing are finished. We change is_finished field to true ( or 1 as you like ). But what to do with step field now? Plus one, or just not touch it because it has no meaning any more now?
one field:
Simple, space saving. But not very intuitive. For example, we don't know what 6 really means by just looking at this field because it may represent finish or middle step. It need other information e.g. total step to determine. And potentially this meaning is not very stable if the total steps will change ( is_finished field in two field method would not affected by this).
So How do you will deal with it? Thanks!
UPDATE:
I forgot some point maybe useful in the previous post:
The story is: We provide a web-based service for customers. (This service has time limitation e.g. 1 year term) After customer purchase it our deployment programe prepare hardware(virtual machine) and deploy some software which need some time to finish. And we want to provide progress info for customer. When deployment is finished, the customer should be informed.
Database design:
It need a usage state field to represent running normal, running but owe (expired), stop. What confusing me is should it include not deployed yet and deploying information or not?
The progress info should include some other info e.g. the start time so we can tell how much time elapsed since start. But this info is no need to be persistent because we won't care about these info as long as it's finished. So I decide to store these progress info in a separate (temporary) table. Then I think it need another field in another more persistent table to tell if things are done . So can we combine it into the usage state field mentioned above?
I like the one-field approach better, for the following reasons:
(Assuming you want to search on steps) you can "cover" all steps using only one simple index.
Should you ever want to attach some additional information to each of the steps, the one-field approach can easily accommodate a FOREIGN KEY towards a new table containing that information.
Requires slightly less storage space. Storage is cheap these days, but that's not the point - caching and network performance is.
Two-field approach:
(Assuming you want to search on steps) might require a "fatter" composite index or even two indexes (which takes space, lowers the cache effectiveness and incurs maintenance cost for INSERT/UPDATE/DELETE operations).
Requires a CHECK to defend the database from "impossible" combinations. Funny enough, some DBMSes don't enforce CHECKs (I'm looking at you, MySQL).
Requires slightly more storage space (and therefore slightly less of it fits into cache, takes up slightly more network bandwidth etc.).
NOTE: Should you choose to use NULLs, that could have "interesting" consequences under certain DBMSes (for example, Oracle doesn't index NULLs).
For example, we don't know what 6 really means
That doesn't really matter, as long as the client application knows what it means.
Design the database for applications, not humans.
And potentially this meaning is not very stable if the total steps will change
True, but you have the same problem with two-field approach as well, if new step is added in the "middle" of existing steps.
Either UPDATE the table accordingly,
or never change the step values. For example, if the step 5 is the last one, then newly added step 6 is considered earlier despite having greater value - your application (or the additional table I mentioned) will know the order of steps, even if their values are not ordered. If you really want "order by value" without resorting to UPDATE, make the steps: 10, 20, 30 etc, so you can insert new steps in the gaps (the old BASIC line number trick).
It remains a matter of taste but I would suggest the second option of a single int field step. On inserting a new record, initialize the value of step to 0 which would indicate "not started yet". Any positive integer value would obviously denote the current step. As soon as the trajectory is completed I would set step to NULL. As you correctly stated this method does require solid documentation but I think that it is not too confusing
I am collecting a large amount of data which is most likely going to be a format as follows:
User 1: (a,o,x,y,z,t,h,u)
Where all the variables dynamically change with respect to time, except u - this is used to store the user name. What I am trying to understand since my background is not very intense in "big data", is when I end up with my array, it will be very large, something like 108000 x 3500, since I will be preforming analysis on each timestep, and graphing it, what would be an appropriate database to manage this in is what I am trying to determine. Since this is for scientific research I was looking at CDF and HDF5, and based on what I read here NASA I think I will want to use CDF. But is this the correct way to manage such data for speed and efficiency?
The final data set will have all the users as columns, and the rows will be timestamped, so my analysis program would read row by row to interpret the data. And make entries into the dataset. Maybe I should be looking at things like CouchDB and RDBMS, I just don't know a good place to start. Advice would be appreciated.
This is an extended comment rather than a comprehensive answer ...
With respect, a dataset of size 108000*3500 doesn't really qualify as big data these days, not unless you've omitted a unit such as GB. If it's just 108000*3500 bytes, that's only 3GB plus change. Any of the technologies you mention will cope with that with ease. I think you ought to make your choice on the basis of which approach will speed your development rather than speeding your execution.
But if you want further suggestions to consider, I suggest:
SciDB
Rasdaman, and
Monet DB
all of which have some traction in the academic big data community and are beginning to be used outside that community too.
I have been using CDF for some similarly sized data and I think it should work nicely. You will need to keep a few things in mind though. Considering I don't really know the details of your project, this may or may not be helpful...
3GB of data is right around the file size limit for the older version of CDF, so make sure you are using an up-to-date library.
While 3GB isn't that much data, depending on how you read and write it, things may be slow going. Make sure you use the hyper read/write functions whenever possible.
CDF supports meta-data (called global/variable attributes) that can hold information such as username and data descriptions.
It is easy to break data up into multiple files. I would recommend using one file per user. This will mean that you can write the user name just once for the whole file as an attribute, rather than in each record.
You will need to create an extra variable called epoch. This is well defined timestamp for each record. I am not sure if the time stamp you have now would be appropriate, or if you will need to process it some, but it is something you need to think about. Also, the epoch variable needs to have a specific type assigned to it (epoch, epoch16, or TT2000). TT2000 is the most recent version which gives nanosecond precision and handles leap seconds, but most CDF readers that I have run into don't handle it well yet. If you don't need that kind of precision, I recommend epoch16 as that has been the standard for a while.
Hope this helps, if you go with CDF, feel free to bug me with any issues you hit.
Each item is an array of 17 32-bit integers. I can probably produce 120-bit unique hashes for them.
I have an algorithm that produces 9,731,643,264 of these items, and want to see how many of these are unique. I speculate that at most 1/36th of these will be unique but can't be sure.
At this size, I can't really do this in memory (as I only have 4 gigs), so I need a way to persist a list of these, do membership tests, and add each new one if it's not already there.
I am working in C(gcc) on Linux so it would be good if the solution can work from there.
Any ideas?
This reminds me of some of the problems I faced working on a solution to "Knight's Tour" many years ago. (A math problem which is now solved, but not by me.)
Even your hash isn't that much help . . . at the nearly the size of a GUID, they could easily be unique accross all the the known universe.
It will take approximately .75 Terrabytes just to hold the list on disk . . . 4 Gigs of memory or not, you'd still need a huge disk just to hold them. And you'd need double that much disk or more to do the sort/merge solutions I talk about below.
If you could SORT that list, then you could just go threw the list one item at a time looking for unique copies next to each other. Of course sorting that much data would required a custom sort routine (that you wrote) since it is binary (coverting to hex would double the size of your data, but would allow you to use standard routines) . . . though likely even there they would probably choke on that much data . . . so your are back to your own custom routines.
Some things to think about:
Sorting that much data will take weeks, months or perhaps years. While you can do a nice heap sort or whatever in memory, because you only have so much disk space, you will likely be doing a "bubble" sort of the files regardless of what you do in memory.
Depending on what your generation algorithm looks like, you could generate "one memory load" worth of data, sort it in place then write it out to disk in a file (sorted). Once that was done, you just have to "merge" all those individual sorted files, which is a much easier task (even thought there would be 1000s of files, it would still be a relatively easier task).
If your generator can tell you ANYTHING about your data, use that to your advantage. For instance in my case, as I processed the Knight's Moves, I know my output values were constantly getting bigger (because I was always adding one bit per move), that small knowledge allowed me to optimize my sort in some unique ways. Look at your data, see if you know anything similar.
Making the data smaller is always good of course. For instance you talk about a 120 hash, but is that has reversable? If so, sort the hash since it is smaller. If not, the hash might not be that much help (at least for my sorting solutions).
I am interested in the machanics of issues like this and I'd be happy to exchange emails on this subject just to bang around ideas and possible solutions.
You can probably make your life a lot easier if you can place some restrictions on your input data: Even assuming only 120 significant bits, the high number of duplicate values suggests an uneven distribution, as an even distribution would make duplicates unlikely for a given sample size of 10^10:
2^120 = (2^10)^12 > (10^3)^12 = 10^36 >> 10^10
If you have continuous clusters (instead of sparse, but repeated values), you can gain a lot by operating on ranges instead of atomic values.
What I would do:
fill a buffer with a batch of generated values
sort the buffer in-memory
write ranges to disk, ie each entry in the file consists of start and end value of a continuous group of values
Then, you need to merge the individual files, which can be done online - ie as the files become available - the same way a stack-based mergesort operates: associate to each file a counter equal to the number of ranges in the file and push each new file on a stack. When the file on top of the stack has a counter greater or equal to the previous file, merge the files into a new file whose counter is the number of ranges in the merged file.
I can never decide if it's better to format data before inserting it into the DB, or when pulling it out.
I'm not talking about data sanitization; we all know to protect against SQL injection. I'm talking about if the user gives you a URL, and it doesn't have http:// in front of it, should you add that before inserting it into the DB or when pulling it out? What about more complex things, like formatting a big wad of text. Do I want to mark it up with HTML (or strip it down) before or after? What if I change my mind later and want to format it differently? I can't do this if I've already formatted it, but I can if I store it unformatted... but then I'm doing extra work every time I pull a piece of data out of the DB, which I could have done once and been done with it.
What are your thoughts?
From the answers, there seems to be a general consensus that things like URLs, phone numbers, and emails (anything with a well-defined format) should be normalized first to a consistent format. Things like text should generally be left raw or in a manipulable format for maximum flexibility. If speed is an issue, both formats may be stored.
I think it's best to make sure data in the database is in the most consistent format possible. You might have multiple apps using this data, so if you can make sure it's all the same format, you won't have to worry about reformatting different formats in every application.
Normalising URLs to a canonical form prior to insertion is probably okay; performing any kind of extensive formatting, e.g. HTML conversion/parsing etc. strikes me as a bad idea - always have the "rawest" data possible in your database, especially if you want to change the presentation format later.
In terms of avoiding unnecessary post-processing on every query, you might look into adopting object caching or similar techniques for the more expensive operations.
You're asking two questions here.
Normalization should always be performed prior to the database insertion, e.g. if a column only has URLs then they should always be normalized first.
Regarding formating, that's a view problem and not a model (in this case DB) problem.
In my opinion, it should be formatted first. If you choose to do it at the time of retrieval instead of insertion, this can cause problems down the road when other applications/scripts want to use data out of the same database. They will all need to know how to clean up the data when they pull it out.
depends
if you are doing well defined items, SSN, zip code, phone number, store it formatted (this does not necessarily mean to include dashes or dots, etc. it may mean removing them so everyhting is consistent.
You have to be very careful if you change data before you store it. You could always run into a situation where you need to echo back to the original user the exact text that they gave you.
My inclination is usually to store data in the most flexible form possible. For instance, numbers should be stored using integer or floating-point types, not strings, because you can do math with numeric types but not with strings (although it's easy enough to parse a number into a string that this is not a big deal). Perhaps a more practical example: dates/times should be stored using the database's actual date/time data type instead of strings. Also, maybe it's easier to convert HTML into plain text than vice versa, in which case you'd want to store your text as HTML. Or maybe even using a format like Markdown which can be easily converted into either HTML or plain text.
It's the same reason vector graphics formats (SVG, EPS, etc.) exist: an SVG file is essentially a sequence of instructions specifying how to draw the image. It's easy to convert that into a bitmap image of any size, whereas if you only had a bitmap image to start with, you'd have a hard time changing its size (e.g. to create a thumbnail) without losing quality.
It is possible you might want to store both the formatted and unformatted versions of the data. For instance, let's use American phone numbers as an example. If you store one column with just the numbers and one column with the most frequently needed format, such as (111) 111-1111, then you can easily format to client specifications for the special cases or pull the most common one out quickly without lots of casting. This takes very little extra time at the time of insert (and can be accomplished with a calculated column so it always happens no matter where the data came from).
Data should be scrubbed before being put in the database so that invalid dates or nonnumeric data etc aren't ever placed in the field. Email is one field that people often put junk into for some reason. If it doesn't have an # sign, it shouldn't be stored. This is especially true if you actually send emails thorugh your application(s) using that field. It is a waste of time to try to send an email to 'contact his secretary' or 'aol.com' if you see what I mean.
If the format will be consistently needed, it is better to convert the data to that format once on insert or update and not have to convert it ever again. If the standard format changes, you will need to update the column for all existing records at that time, then use the new format going forth. If you have frequent changes of format and large tables or if differnt applications use different formats, it might be best to store unformatted.
I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:
251
12.76 12.55 12.55 12.34 [etc., 200 more values...]
13.02 12.95 12.70 12.40 [etc., 200 more values...]
[etc., 250 more lines]
252
[etc., etc.]
I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).
We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.
I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.
Any bright ideas or products?
I've assembled your comments here:
I'd like to do all this "w/o writing my own file I/O code"
I need access from "Java Ruby MATLAB" and "FORTRAN routines"
When you add these up, you definitely don't want a new file format. Stick with the one you've got.
If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)
You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."
My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...
You said something that I want to address:
"leverage 40 yrs of DB optimization"
Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.
Here's the most useful thing I can tell you, based on everything you've told us. You said this:
"I am more interested in optimizing my time than the CPU's, though exec speed is good!"
This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.
And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)
I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.
Clarifications:
I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?
Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?
You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?
Is "compression" one of the things you care about, or are you just sick of flat files?
Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?
Also, what programming language(s) do you want to access this data with?
your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.
please describe how you need to be able to access the data/
Matt, thanks very much, and likewise longneck and jirv.
This post was partly an experiment, testing the quality of stackoverflow discourse. If you guys/gals/alien lifeforms are representative, I'm sold.
And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)
I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.
Thanks again, folks.