Need a good way to store loads of data - embedded-database

I'm generating data at a rate of 4096 bytes every 16.66ms. This data needs to be stored constantly and will be read randomly. It would be nice to have it in a relational database, but I think doing so many inserts would create too much overhead for the processor I'm working with (ARM11). And I don't need all the features that something like SQLite offers.
In fact, just writing this stuff to a file seems tempting because while most of the time I'll just be writing lots of data, when I actually do need to read data, I can just seek to the block I need. However, I just know I'm going to run into some problem along the way. Especially when I leave this thing running for a day and end up with gigabytes of data.
This just seems like a very naive solution to my problem and I need someone else to tell me so I can start thinking about a better solution. Thanks.

You should add some more details to get better answers. What are you use cases, do you need ACID, what is your storage you are writing to, etc.
What is your OS, do you only write fixed size records. Just saying like I will do random access and this is my write rate is something is much too unspecific.
You are writing at 240 kb/s, which are 20 GB/day.
If you have just fixed size records, only append data and use Linux, then a plain file is great. Perhaps think about using some fsync calls, if your storage is fast enough.

Related

Best Memory Usage

First Of all thanks for stopping by and here is my qusetion.
I'm working on a project where i use an array that contains much information (About 300 variables every variable is about 25 Characters).So my question here is What is the best way to store it...??
I Have two possible way and please tell me which is better...
First Way : Make a normal local where i can store all the needed information and of course it will be stored on the RAM (As far as I Know).
Second Way : To store them in a file and whenever i need the array i simply read the data form the file and get the array.
Note That : The array is used occasionally and not every time.
My Second Question is :
Is there a possible error that may occur to the Hard Drive if i made the program write and read so many time in a short period of time.And if so what is the min period i can write and read safely without any possible error...???
Thanks In Advance
Reading and writing files are very slow operation in comparison to RAM access. 300 strings with 25 elements each wouldn't consume too much space with modern RAM. If you need access to this data rarely (once per 10 minutes or once per hour), you, probably, could keep this data on HDD, but, to my opinion, it will be simpler for you to keep it in RAM for all the time.
You can find an answer to your second question here.
This data you have mentioned can be easily stored in memory without any memory crash. If there is need to store much larger amount of data, then NSKeyedArchiver (used to stored object in any format in NSData form to disk) can be used or CoreData framework can also be used. CoreData framework also supports caching and it is faster http://nshipster.com/nscoding/. Hope this helps.

Derby/JavaDB char encoding - size on disk

I want to calculate the size my JavaDB database will need and therefore I need to know how char is encoded to know how many bytes it will use on disk.
Does anyone know which character encoding Derby uses? I have read that Derby uses Unicode, but couldn't find any information about the encoding (neither in the reference manual, nor any other pages).
I believe it's UTF8.
However, whenever I'm trying to figure out how much disk space my database will use, I always use a benchmarking approach: write a little test program that generates a sample database, using sample test data, taking care to try to make that data as representative of your application as you can. Generate enough data to be realistic (e.g., at least 10,000 rows).
Generate some nice round number of rows, like 10,000 or 100,000, or 1,000,000. Then look at the actual database files that Derby created on disk, see how large they are, do a little bit of match, and you can figure out your result.
The nice thing about an approach like this is that it will help you catch mistakes like: forgetting an entire secondary index, or not understanding that your UNIQUE constraint added a bunch of overhead behind the scenes, or forgetting to count some of the columns that you thought wouldn't matter, but turned out to take a substantial amount of space.

Keep Big Object List in memory or Read From File .NET

I need to read a Big List of Object(each object contains 2 String and 1 Int32) (i extract those object from a XML WebPage), it should contains like 10000 Objects.
I need take from this list about 20 records each minute.
I would know if in terms of performance and for the memory safe , it's better keep this List in memory and take those 20 records (every min) or download The Xml from WebPage , read it from Local Disk each minute and find those 20 records.
Any other solutions would be accepted too :)
Update : Forgot to say i'm talking about a Winform C# Application
The first rule of optimisation is to measure before optimising.
When you load all the objects into memory how much memore does it consume? How much more memory do you need for the rest of your app? How much memory does the machine have? Are you running in a 32 or 64 bit address space? How much memory do any other required apps need at the same time?
Once you've answered these questions you can then start to break down your approach to optimisation. In this case you need 20 records each minute, any 20 records? Do you need to iterate through all 10,000 to find the 20? How often does the XML file change?
P.S. Look at XmlReader vs XmlDocument for parsing the XML file.
For performance, you'll be much better off to avoid disk I/O.
For memory, both solutions will be the same unless you process the file a line at a time (which is going to give horrible performance, since you won't be able to hash/index the results).
If there is a "key" I would keep them in memory in a dictionary.
If in doubt, profile.

Why do I have to set the max length of every single text column in the database?

Why is it that every RDBMS insists that you tell it what the max length of a text field is going to be... why can't it just infer this information form the data that's put into the database?
I've mostly worked with MS SQL Server, but every other database I know also demands that you set these arbitrary limits on your data schema. The reality is that this is not particulay helpful or friendly to work with becuase the business requirements change all the time and almost every day some end-user is trying to put a lot of text into that column.
Does any one with some inner working knowledge of a RDBMS know why we just don't infer the limits from the data that's put into the storage? I'm not talking about guessing the type information, but guessing the limits of a particular text column.
I mean, there's a reason why I don't use nvarchar(max) on every text column in the database.
Because computers (and databases) are stupid. Computers don't guess very well and, unless you tell them, they can't tell that a column is going to be used for a phone number or a copy of War and Peace. Obviously, the DB could be designed to so that every column could contain an infinite amount of data -- or at least as much as disk space allows -- but that would be a very inefficient design. In order to get efficiency, then, we make a trade-off and make the designer tell the database how much we expect to put in the column. Presumably, there could be a default so that if you don't specify one, it simply uses it. Unfortunately, any default would probably be inappropriate for the vast majority of people from an efficiency perspective.
This post not only answers your question about whether to use nvarchar(max) everywhere, but it also gives some insight into why databases historically didn't allow this.
It has to do with speed. If the max size of a string is specified you can optimize the way information is stored for faster i/o on it. When speed is key the last thing you want is a sudden shuffling of all your data just because you changed a state abbreviation to the full name.
With the max size set the database can allocate the max space to every entity in that column and regardless of the changes to the value no address space needs to change.
This is like saying, why can't we just tell the database we want a table and let it infer what type and how many columns we need from the data we give it.
Simply, we know better than the database will. Supposed you have a one in a million chance of putting a 2,000 character string into the database, most of the time, it's 100 characters. The database would probably blow up or refuse the 2k character string. It simply cannot know that you're going to need 2k length if for the first three years you've only entered 100 length strings.
Also, the length of the characters are used to optimize row placement so that rows can be read/skipped faster.
I think it is because the RDBMS use random data access. To do random data access, they must know which address in the hard disk they must jump into to fastly read the data. If every row of a single column have different data length, they can not infer what is the start point of the address they have to jump directly to get it. The only way is they have to load all data and check it.
If RDBMS change the data length of a column to a fixed number (for example, max length of all rows) everytime you add, update and delete. It is an extremely time consuming
What would the DB base its guess on? If the business requirements change regularly, it's going to be just as surprised as you. If there's a reason you don't use nvarchar(max), there's probably a reason it doesn't default to that as well...
check this tread http://www.sqlservercentral.com/Forums/Topic295948-146-1.aspx
For the sake of an example, I'm going to step into some quicksand and suggest you compare it with applications allocating memory (RAM). Why don't programmers ask for/allocate all the memory they need when the program starts up? Because often they don't know how much they'll need. This can lead to apps grabbing more and more memory as they run, and perhaps also releasing memory. And you have multiple apps running at the same time, and new apps starting, and old apps closing. And apps always want contiguous blocks of memory, they work poorly (if at all) if their memory is scattered all over the address space. Over time, this leads to fragmented memory, and all those garbage collection issues that people have been tearing their hair out over for decades.
Jump back to databases. Do you want that to happen to your hard drives? (Remember, hard drive performance is very, very slow when compared with memory operations...)
Sounds like your business rule is: Enter as much information as you want in any text box so you don't get mad at the DBA.
You don't allow users to enter 5000 character addresses since they won't fit on the envelope.
That's why Twitter has a text limit and saves everyone the trouble of reading through a bunch of mindless drivel that just goes on and on and never gets to the point, but only manages to infuriate the reader making them wonder why you have such disreguard for their time by choosing a self-centered and inhumane lifestyle focused on promoting the act of copying and pasting as much data as the memory buffer gods will allow...

How to store and compress data for real time data logging?

When developing software that records input signals (numbers) in real time, how can this data be best stored and compressed? Would an SQL engine be good for this, permitting fast data mining in the future, or are there other data formats that would be suitable or compressed enough for upto 1000 data samples per second?
I don't mind building in VC++ but ideas applicable to C# would be ideal.
It is hard to say without more info, such as, what is the source, will you be needing to query the stored data, and so on.
But for 1000 samples/sec, you should propably look at holding a few seconds of data in memory, and then writing them out in bulk to persistent storage on another thread. (Multi-processor machine recommended).
If you decide to do it via a managed language, keep the same data structure around for keeping the samples - so that the GC does not need to collect memory too often. You can get marginally better performance by using pointers and the unsafe keyword (provides direct access to the memory structure and eliminates bounds checking code for arrays).
I don't know how much CPU time is needed for you to collect each sample; and how time-critical it is to read each sample at a specified time (will they be buffered in the device you are reading from ?). If the sampling is time-critical, you have 1 ms per sample; and then you probably cannot afford the risk of the garbage collector kicking in, as it will block your thread for some time. In this case, I would go for an unmanaged approach.
SQL Server would easily be able to hold your data, or you could write them to a file. It mostly depends on what you need to do with the data at a later time. I don't know how much data each sample is, but let's assume it is 8 bytes. Then you have 8000 bytes per second to write of raw data - perhaps you have some overhead, so it could be 10 kB/s. Most storage mechanisms I can think of will be able to write data at this speed. Just make sure to write on another thread than the one that are doing the sampling.
You may want to look at time-series databases, rather than relational. These will be optimised to deal with the sort of data and usage you're considering.
Kx is a popular choice, as is Fame.

Resources