I need to choose a database management system (DBMS) that uses the least amount of main memory since we are severely constrained. Since a DBMS will use more and more memory to hold the index in main memory, how exactly do I tell which DBMS has the smallest memory footprint?
Right now I just have a memory monitor program open while I perform a series of queries we'll call X. Then I run the same set of queries X on a different DBMS and see how much memory is used in its lifetime and compare with the other memory footprints.
Is this a not-dumb way of going about it? Is there a better way?
Thanks,
Jbu
Just use SQLite. In a single process. With C++, preferably.
What you can do in the application is manage how you fetch data. If you fetch all rows from a given query, it may try to build a Collection in your application, which can consume memory very quickly if you're not careful. This is probably the most likely cause of memory exhaustion.
To solve this, open a cursor to a query and fetch the rows one by one, discarding the row objects as you iterate through the result set. That way you only store one row at a time, and you can predict the "high-water mark" more easily.
Depending on the JDBC driver (i.e. the brand of database you're connecting to), it may be tricky to convince the JDBC driver not to do a fetchall. For instance, some drivers fetch the whole result set to allow you to scroll through it backwards as well as forwards. Even though JDBC is a standard interface, configuring it to do row-at-a-time instead of fetchall may involve proprietary options.
On the database server side, you should be able to manage the amount of memory it allocates to index cache and so on, but the specific things you can configure are different in each brand of database. There's no shortcut for educating yourself about how to tune each server.
Ultimately, this kind of optimization is probably answering the wrong question.
Most likely the answers you gather through this sort of testing are going to be misleading, because the DBMS will react differently under "live" circumstances than during your testing. Futhermore, you're locking yourself in to a particular architecture. It's difficult to change DBMS down the road, once you've got code written against it. You'd be far better served finding which DBMS will fill your needs and simplify your development process, and then make sure you're optimizing your SQL queries and indices to fit the needs of your application.
Related
I was wondering what threshold of data volume may determine whether to use a database or a simple file I/O, assuming that fresh data needs to be handled quite frequently.
Edit: There is no multi-threading in my application. Data needs to be stored and then retrieved sequentially and at this point I am not really worried about anyone else accessing the data/data safety.
Given this backdrop is there still any advantage to using databases over files?
It depends and you probably should consider other factors as well.
If you use a database, there is an overhead for transactions, security, index management etc. on the one hand. On the other hand you can get caching (which could significantly speed up your application) and better performance for random access, if you have a lot of data. In a multithreaded environment I suggest using a database because of a property implemented locking mechanism.
Flat files are OK for really simple and small data. Do you really need to open and close them so often?
If you have indexes on your table correctly then I think it would be better to use database instead of file system to get a better performance. Also to include that if your data in the database is going to be million of records then also the performance will not be affected when compared to file system with that much amount of data.
Probably a database is prefered and in this case id suggest to use sqlite database insted of sql server and mysql as data is small.
In this case I would say DB. You are writing and reading and thats what DBs are good at.
On the flip side if you are holding a tiny amount of data thats alot of over head for not much data
also depends on licensing etc. a file will be alot quicker
There are two databases, A and B, that serve web pages and communicate with each-other via internal network when they need to share data. Sometimes server A needs to produce a webpage with a chart that requires intensive calculations involving a large quantity of data on server B. Lets assume both servers are equally powerful and the network is fast. I'm trying to figure out a good way to do this. Ideas so far:
Server B could do a lot of the calculations then pass back a small set of results to server A. Less flexible, but fairly efficient. Unfortunately this is a somewhat complex transaction. I'd compare it to a method with a lot of parameters, side-effects and results.
Server B could simply provide its own raw data server to A and let it handle all of the calculations itself. More "open" and flexible, but less efficient since there is a lot of data involved in the calculation, and server A would have to pull all of it over the network.
Server B could produce and return the chart, or a link to it. Perhaps the most efficient, but least flexible, and also creates a messy relationship where server B is partially responsible for producing server A's webpages. However, at least server A doesn't have to worry about getting back a complex object and know what to do with it.
Is there a best practice here that results in a balance of performance and maintainable encapsulation? Or is it simply a case-by-case situation? I'm leaning towards the first option. I hope this question is "answerable" enough and not a discussion issue. I've tried to concentrate the general problem into a specific scenario.
I cannot answer the best practice part of your question, But will for the performance aspect. What you have described is two servers with no real technical difference between them, and presumably you have equal access to update both of them, neither starved of resource, etc.
To truly calculate performance, there are a few number needed.
A) how many requests per second?
B) for techniques a and b, how much traffic will flow between the machines?
C) how much latency can you withstand?
Your first option will be reasonable for performance, if the data flowing is small enough. Option b will generate most load on your infra structure, and if this is approaching hardware limits then ditch this option now. If all the numbers are small, then this isn't really a performance based decision.
If you are driving this all from web pages, then cannot the first page load simply return Img tags etc that refer to server B? Ie I request pageA, but the HTML stream includes resources from serverB? This is essentially your option C, but will perhaps give higher perceived user performance as the resources from different servers can download in parallel (assuming there are several to be accessed).
from a design standpoint, I struggle to see Where a calculation is performed is Relevant in this question, you control both servers, so the question becomes what 'contract' does server b have to provide information to server a. Is there some reason that b cannot provide filtered results only and never the underlying raw data? Or put another way, why should a have to process b's data? Whether the logic is in a or b doesn't matter when you control both ends.
If you do not control both servers, then the decision is more around who should bare the cost (build and support) for processing the raw data to results.
Summary/opinion. Option c if possible and suitable, else a in absence of some overriding reason for b
I'm writing a database-style thing in C (i.e. it will store and operate on about 500,000 records). I'm going to be running it in a memory-constrained environment (VPS) so I don't want memory usage to balloon. I'm not going to be handling huge amounts of data - perhaps up to 200MB in total, but I want the memory footprint to remain in the region of 30MB (pulling these numbers out of the air).
My instinct is doing my own page handling (real databases do this), but I have received advice saying that I should just allocate it all and allow the OS to do the VM paging for me. My numbers will never rise above this order of magnitude. Which is the best choice in this case?
Assuming the second choice, at what point would it be sensible for a program to do its own paging? Obviously RDBMsses that can handle gigabytes must do this, but there must be a point along the scale at which the question is worth asking.
Thanks!
Use malloc until it's running. Then and only then, start profiling. If you run into the same performance issues as the proprietary and mainstream "real databases", you will naturally begin to perform cache/page/alignment optimizations. These things can easily be slotted in after you have a working database, and are orthogonal to having a working database.
The database management systems that perform their own paging also benefit from the investment of huge research efforts to make sure their paging algorithms function well under varying system and load conditions. Unless you have a similar set of resources at your disposal I'd recommend against taking that approach.
The OS paging system you have at your disposal has already benefit from tuning efforts of many people.
There are, however, some things you can do to tune your OS to benefit database type access (large sequential I/O operations) vs. the typical desktop tuning (mix of seq. and random I/O).
In short, if you are a one man team or a small team, you probably should make use of existing tools rather than trying to roll your own in that particular area.
I have to develop a database for a unique environment. I don't have experience with database design and could use everybody's wisdom.
My group is designing a database for piece of physics hardware and a data acquisition system. We need a system that will store all the hardware configuration parameters, and track the changes to these parameters as they are changed by the user.
The setup:
We have nearly 200 detectors and roughly 40 parameters associated with each detector. Of these 40 parameters, we expect only a few to change during the course of the experiment. Most parameters associated with a single detector are static.
We collect data for this experiment in timed runs. During these runs, the parameters loaded into the hardware must not change, although we should be able to edit the database at any time to prepare for the next run. The current plan:
The database will provide the difference between the current parameters and the parameters used during last run.
At the start of a new run, the most recent database changes be loaded into hardware.
The settings used for the upcoming run must be tagged with a run number and the current date and time. This is essential. I need a run-by-run history of the experimental setup.
There will be several different clients that both read and write to the database. Although changes to the database will be infrequent, I cannot guarantee that the changes won't happen concurrently.
Must be robust and non-corruptible. The configuration of the experimental system depends on the hardware. Any breakdown of the database would prevent data acquisition, and our time is expensive. Database backups?
My current plan is to implement the above requirements using a sqlite database, although I am unsure if it can support all my requirements. Is there any other technology I should look into? Has anybody done something similar? I am willing to learn any technology, as long as it's mature.
Tips and advice are welcome.
Thank you,
Sean
Update 1:
Database access:
There are three lite applications that can write and read to the database and one application that can only read.
The applications with write access are responsible for setting a non-overlapping subset of the hardware parameters. To be specific, we have one application (of which there may be multiple copies) which sets the high voltage, one application which sets the remainder of the hardware parameters which may change during the experiment, and one GUI which sets the remainder of the parameters which are nearly static and are only essential for the proper reconstruction of the data.
The program with read access only is our data analysis software. It needs access to nearly all of the parameters in the database to properly format the incoming data into something we can analyze properly. The number of connections to the database should be >10.
Backups:
Another setup at our lab dumps an xml file every run. Even though I don't think xml is appropriate, I was planning to back up the system every run, just in case.
Some basic things about the design; you should make sure that you don't delete data from any tables; keep track of the most recent data (probably best with most recent updated datetime); when the data value changes, though, don't delete the old data. When a run is initiated, tag every table used with the Run ID (in another column); this way, you maintain full historical record about every setting, and can pin exactly what the state used at a given run was.
Ask around of your colleagues.
You don't say what kind of physics you're doing, or how big the working group is, but in my discipline (particle physics) there is a deep repository of experience putting up and running just this type of systems (we call it "slow controls" and similar). There is a pretty good chance that someone you work with has either done this or knows someone who has. There may be a detailed description of the last time out in someone's thesis.
I don't personally have much to do with this, but I do know this: one common feature is to have no-delete-no-overwrite design. You can only add data, never remove it. This preserves your chances of figuring out what really happened in the case of trouble
Perhaps I should explain a little more. While this is an important task and has to be done right, it is not really related to physics, so you can't look it up on Spires or on arXive.org. No one writes papers on the design and implementation of medium sized slow controls databases. But they do sometimes put it in their dissertations. The easiest way to find a pointer really is to ask a bunch of people around the lab.
This is not a particularly large database by the sounds of things. So you might be able to get away with using Oracle's free database which will give you all kinds of great flexibility with journaling (not sure if that is an actual word) and administration.
Your mentioning of 'non-corruptible' right after you say "There will be several different clients that both read and write to the database" raises a red flag for me. Are you planning on creating some sort of application that has a interface for this? Or were you planning on direct access to the db via a tool like TOAD?
In order to preserve your data integrity you will need to get really strict on your permissions. I would only allow one (and a backup) person to have admin rights with the ability to do the data manipulation outside the GUI (which will make your life easier).
Backups? Yes, absolutely! Not only should you do daily, weekly and monthly backups you should do full and incremental. Also, test your backup images often to confirm they are in fact working.
As for the data structure I would need much greater detail in what you are trying to store and how you would access it. But from what you have put here I would say you need the following tables (to begin with):
Detectors
Parameters
Detector_Parameters
Some additional notes:
Since you will be doing so many changes I recommend using a version control like SVN to keep track of all your DDLs etc. I would also recommend using something like bugzilla for bug tracking (if needed) and using google docs for team document management.
Hope that helps.
I need to give data to a data processing windows service (one-way, loosely coupled). I want to ensure that the service being down etc. doesn't result in 'lost' data, that restarting the windows service simply causes it to pick up work where it left and I need the system to be really easy to troubleshoot, which is why I'm not using MSMQ.
So I came up with one of two solutions - either:
I drop text files with the processing data into a drop directory and the windows service waits for file change notifications, processes and deletes the file then
or
I insert data in a special table in the local MS SQL database, and the windows service polls the database for changes/new items and then erases them as they are processed
The MSSQL database is local on the system, not over the network, but later on I may want to move it to a different server.
Which, from a performance (or other standpoint) is the better solution here?
From a performance perspective, it's likely the filesystem will be fastest - perhaps by a large margin.
However, there are other factors to consider.
It doesn't matter how fast it is, generally, only whether it's sufficiently fast. Storing and retrieving small blobs is a simple task and quite possibly this will never be your bottleneck.
NTFS is journalled - but only the metadata. If the server should crash mid-write, a file may contain gibberish. If you use a filesystem backend, you'll need to be robust against arbitrary data in the files. Depending on the caching layer and the way the file system reuses old space, that gibberish could contains segments of other messages, so you'd best be robust even against an old message being repeated.
If you ever want to add new features involving a richer message model, a database is more easily extended (say, some sort of caching layer).
The filesystem is more "open" - meaning it may be easier to debug with really simple tools (notepad), but also that you may encounter more tricky issues with local indexing services, virus scanners, poorly set permissions, or whatever else happens to live on the system.
Most API's can't deal with files with paths of more than 260 characters, and perform poorly when faced with huge numbers of files. If ever your storage directory becomes too large, things like .GetFiles() will become slow - whereas a DB can be indexed on the timestamp, and the newest messages retrieved irrespective of old clutter. You can work around this, but it's an extra hurdle.
MS SQL isn't free and/or isn't installed on every system. There's a bit of extra system administration necessary for each new server and more patches when you use it. Particularly if your software should be trivially installable by third parties, the filesystem has an advantage.
I don't know what your building, but don't prematurely optimize. Both solutions are quite similar in terms of performance, and it's likely not to matter - so pick whatever is easiest for you. If performance is ever really an issue, direct communication (whether via IPC or IP or whatnot) is going to be several orders of magnitude more performant, so don't waste time microoptimizing.
My experience with 2005 and lower is that it's much slower with the database.
Especially with larger file.. That really messes up SQL server memory when doing table scans
However
The new SQL server 2008 has better file support in the engine.