Not sure whether there isn't a DBS that does and whether this is indeed a useful feature, but:
There are a lot of suggestions on how to speed up DB operations by tuning buffer sizes. One example is importing Open Street Map data (the planet file) into a Postgres instance. There is a tool called osm2pgsql (http://wiki.openstreetmap.org/wiki/Osm2pgsql) for this purpose and also a guide that suggests to adapt specific buffer parameters for this purpose.
In the final step of the import, the database is creating indexes and (according to my understanding when reading the docs) would benefit from a huge maintenance_work_mem whereas during normal operation, this wouldn't be too useful.
This thread http://www.mail-archive.com/pgsql-general#postgresql.org/msg119245.html in the contrary suggests a large maintenance_work_mem would not make too much sense during final index creation.
Ideally (imo), the DBS should know best what buffers size combination it could profit most given a limited size of total buffer memory.
So, are there some good reasons why there isn't a built-in heuristic that is able to adapt the buffer sizes automatically according to the current task?
The problem is the same as with any forecasting software. Just because something happened historically doesn't mean it will happen again. Also, you need to complete a task in order to fully analyze how you should have done it more efficient. Problem is that the next task is not necessarily anything like the previously completed task. So if your import routine needed 8gb of memory to complete, would it make sense to assign each read-only user 8gb of memory? The other way around wouldn't work well either.
In leaving this decision to humans, the database will exhibit performance characteristics that aren't optimal for all cases, but in return, let's us (the humans) optimize each case individually (if like to).
Another important aspect is that most people/companies value reliable and stable levels over varying but potentially better levels. Having a high cost isn't as big a deal as having large variations in cost. This is of course not true all the times as entire companies are based around the fact the once in a while hit that 1%.
Modern databases already make some effort into adapting itself to the tasks presented, such as increasingly more sofisticated query optimizers. At least Oracle have the option to keep track of some of the measures that are influencing the optimizer decisions (cost of single block read which will vary with the current load).
My guess would be it is awfully hard to get the knobs right by adaptive means. First you will have to query the machine for a lot of unknowns like how much RAM it has available - but also the unknown "what do you expect to run on the machine in addition".
Barring that, by setting a max_mem_usage parameter only, the problem is how to make a system which
adapts well to most typical loads.
Don't have odd pathological problems with some loads.
is somewhat comprehensible code without error.
For postgresql however the answer could also be
Nobody wrote it yet because other stuff is seen as more important.
You didn't write it yet.
Related
I have some performance tests for an index structure on some data. I will be comparing 2 indexes side-by-side (still not decided if I will be using 2 VMs). I require results to be as neutral as possible of course, so I have these kinds of questions which I would appreciate any input about... How can I ensure/control what is influencing the test? For example, caching effects/order of arrival from one test to another will influence the result. How can I measure these influences? How do I create a suitable warm-up? Or what kind of statistical techniques can I use to nullify such influences (I don't think just averages is enough)?
Before you start:
Make sure your tables and indices have just been freshly created and populated. This avoids issues with regard to fragmentation. Otherwise, if the data in one test is heavily fragmented, and the other is not, you might not be comparing apples to apples.
Make sure your tables are properly ANALYZEd. This makes sure that the query planner has proper statistics in all cases.
If you just want a comparison, and not a test under realistic use, I'd just do:
Cold-start your (virtual) machine. Wait a reasonable but fixed time (let's say 5 min, or whatever is reasonable for your system) so that all startup processes have taken place and do not interfere with the DB execution.
Perform test with index1, and measure time (this is timing where you don't have anything cached by either the database nor the OS).
If you're interested in results when there are cache effects: Perform test again 10 times (or any number of times as big as reasonable). Measure each time, to account for variability due to other processes running on the VM, and other contingencies.
Reboot your machine, and repeat the whole process for test2. There are methods to clean the OS cache; but they're very system dependent, and you don't have a way to clean the database cache. Check See and clear Postgres caches/buffers?.
If you are really (or mostly) interested in performance when there are no cache effects, you should perform the whole process several times. It's slow and tedious. If you're only interested in the case where there's (most probably) a cache effect, you don't need to restart again.
Perform an ANOVA (or any other statistical hypothesis test you might think more suited) to decide if your average time is statistically different or not.
You can see an example of performing several tests in the answer to a question about NOT NULL versus CHECK(xx NOT NULL).
As neutral as possible, then create two databases on the same instance of your database management system, then create the same tablespaces with data, using indexes on one instance but not the other.
The challenge with a VM is you have arbitrated access to your disk resources ( unless you have each VM pinned to a specific interface and disk set ). Because of this, your arbitration model could vary from one test to the next. The most neutral course, which removes the arbitration, is on physical hardware....and the same hardware in both cases.
How would a single BLOB column in SQL Server compare (performance wise), to ~20 REAL columns (20 x 32-bit floats)?
I remember Martin Fowler recommending using BLOBs for persisting large object graphs (in Patterns of Enterprise Application Architecture) to remove multiple joins in queries, but does it make sense to do something like this for a table with 20 fixed columns (which are never used in queries)?
This table is updated really often, around 100 times per second, and INSERT statements get rather large with all the columns specified in the query.
I presume the first answer is going to be "profile it yourself", but I'd like to know if someone already has experience with this stuff.
Typically you should not, if you have not found out that this is critical to meet your performance requirements.
If you store it in one blob you need to recalculate your whole database if you make any change to the object structure (like adding or removing a column). If you keep multiple columns your future database refactorings and deployments will be much easier.
I can't fully speak to the performance of the SELECT, you'll need to test that, but I highly doubt it will cause any performance issues there because you wouldn't be reading any more data than before. However, in regards to the INSERT, you should see a performance gain (of what size I'm unsure), because there will likely not be any statistical indexes to update. Of course that depends on a lot of settings but I'm just throwing my opinion out there. This question is pretty subjective and not near enough information is available to truly tell you if you will see performance issues surrounding the change.
Now, in practice I'm going to say, leave it be unless you're seeing real performance issues. Further, if you're seeing real performance issues, analyze those before choosing this type of solution, there are probably other ways to fix them.
I'm currently in the preparatory phase for a project which will involve (amongst other things) writing lots of data to a database, very fast (i.e. images (and associated meta-data) from 6 cameras, recording 40+ times a second).
Searching around the web, it seems that 'Big Data' more often applies to a higher rate, but smaller 'bits' (i.e. market data).
So..
Is there a more scientific way to proceed than "try it and see what happens"?
Is "just throw hardware at it" the best approach?
Is there some technology/white papers/search term that I ought to check out?
Is there a compelling reason to consider some other database (or just saving to disk)?
Sorry, this is a fairly open-ended question (maybe better for Programmers?)
Is there a more scientific way to proceed than "try it and see what happens"?
No, given your requirements are very unusual.
Is "just throw hardware at it" the best approach?
No, but at some point it is the only approach. You wont get a 400 horse power racing engine just by tuning a fiat panda. You wont get high throughput at any database without appropriate hardware.
Is there some technology/white papers/search term that I ought to check out?
Not a valid question in the context of the question - you ask specifically for sql server.
Is there a compelling reason to consider some other database (or just saving to disk)?
No. As long as you stick relational database the same rules apply pretty much - another may be faster, but not by a wide margin.
Your main problem will be disc IO and network bandwidth, depending on size of the images. Properly size the equipment and you should be fine. At the end this seems less than 300 images per second. Sure you want the images themselves in the database? I normally like that, but this is like storing a movie in pictures and that may be stretching it.
Whatever you do, that is a lot of disc IO and size, so - hardware is the only way to go if you need IOPS etc.
What I want to create is a huge index over an even bigger collection of data. The data is a huge collection of images (and I mean millions of photos!) and I want to build an index on all unique images.
So I calculate a hash value of every image and append this with the width, height and file size of the image. This would generate a very unique key for every image. This would be combined with the location of the image, or locations in case of duplicates.
Technically speaking, this would fit perfectly in a single database table. An unique index on file name, plus an additional non-unique index on hash-width-height-size would be enough. However, I could use an existing database system to solve this, or just write my own, optimized version. It will be a single-user application anyway and the main purpose is to detect when I add a duplicate image to the collection so it will warn me that I already have it in my collection and display the locations where the other copies are. I can then decide to still add the duplicate or to discard it.
I've written hash-table implementations before and it's not that difficult once you know what you have to be aware of. So I could just implement my own file format for this data. It's unlikely that I'll ever need to add more information to these images and I'm not interested in similar images, just exact images. I'm not storing the original images in this file either, just the hash, size and location.
From experience, I know this could run extremely fast. I've done it before and have been doing similar things for nearly three decades so it's likely that I will chose this solution.
But I do wonder... Doing the same with an existing database system like SQL Server, Oracle, Interbase or MySQL, would performance still be high enough? There would be about 750 TB of images indexed in this database, which roughly translates to around 30 million records in a single, small table. Is it even worth considering the use of a regular database?
I have doubts about the usability of a database for this project. The amount of data is huge, yet the structure is real simple. I don't need multi-user support or most other features that most databases provide. So I don't see a need for a database. But I'm interested in the opinions of other programmers about this. (Although I expect most will agree with me here.)
The project itself, which is still just an idea in my head, is supposed to be some tool or add-on for explorer or whatever. Basically, it builds an index for any external hard disk that I attach to the system and when I copy an image to this disk somewhere, it's supposed to tell me if the image already exists at this disk. It will allow me to avoid filling up my backup disks with duplicates, although I sometimes would like to add duplicates. (E.g. because they're part of a series.) Since I like to create my own rendered artwork I have plenty of images. Plus, I've been taking digital pictures with digital cameras since 1996 so I also have a huge collection of photos. Add some other large collections to this and you'll soon realise that the amount of data will be huge. (And yes, there are already plenty of duplicates in my collection...)
Since it's a single-user application that you are considering, I'd probably have a look at SQLite. It ought to fit your other requirements rather nicely, I'd say.
I just tested the performance of PostgreSQL on my laptop (Core 2 Duo T5800 2.0 GHz 3.0 GiB RAM). I have a table with slightly more than 100M records, 5 columns and some indexes. I performed a range query on one indexed column (not the primary key) and returned all columns. A mean query returned 75 rows and executed in 750ms. You have to decide if this is fast enough.
I would avoid DIY-ing it unless you know all the repocussions of what you're doing.
Transactional Consistency for example, is not trivial.
I would suggest designing your code in such a way the backend can be easily replaced later, and then run with something sane ( SQLite is a good starting choice ), develop it the most sane and rational way possible, and then try slotting in the alternative backing store.
Then profile the differences, and run regression tests against it to make sure your database is not worse than SQLite.
Exisiting database solutions tend to win because they've had years of improvement and fine tuning to get their benefits, an a naïve attempt will likely be slower, buggier, and do less, all the while Increasing your development load to purely MONUMENTAL proportions.
http://fetter.org/optimization.html
The first rule of Optimization is, you do not talk about Optimization.
The second rule of Optimization is, you DO NOT talk about Optimization.
If your app is running faster than the underlying transport protocol, the optimization is over.
One factor at a time.
No marketroids, no marketroid schedules.
Testing will go on as long as it has to.
If this is your first night at Optimization Club, you have to write a test case.
Also, with databases, there is one thing you utterly MUST get ingrained.
Speed is unimportant
Your data being there when you need it, that is important.
When you have the assuredness that your data will always be there, then you may worry about trivial concerns like speed.
Hashes
You also lament that you'll be using image SHA's/MD5's etc to deduplicate images. This is a fallacious notion of its own, Hashes of files are only able to tell if the files are different, not if they're the same.
The logic is akin to asking 30 people to flip a coin, and you see the first one get heads, and thus decide to delete every other person who gets a head, because they're obviously the same person.
https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer
Although you may think it unlikely you'd have 2 different files with the same hash, your odds are about as good as winning the lotto. The chances of you winning the lotto are low, but somebody wins the lotto every day. Don't let it be you.
Usually when I'm creating indexes on tables, I generally guess what the Fill Factor should be based on an educated guess of how the table will be used (many reads or many writes).
Is there a more scientific way to determine a more accurate Fill Factor value?
You could try running a big list of realistic operations and looking at IO queues for the different actions.
There are a lot of variables that govern it, such as the size of each row and the number of writes vs reads.
Basically: high fill factor = quicker read, low = quicker write.
However it's not quite that simple, as almost all writes will be to a subset of rows that need to be looked up first.
For instance: set a fill factor to 10% and each single-row update will take 10 times as long to find the row it's changing, even though a page split would then be very unlikely.
Generally you see fill factors 70% (very high write) to 95% (very high read).
It's a bit of an art form.
I find that a good way of thinking of fill factors is as pages in an address book - the more tightly you pack the addresses the harder it is to change them, but the slimmer the book. I think I explained it better on my blog.
I would tend to be of the opinion that if you're after performance improvements, your time is much better spent elsewhere, tweaking your schema, optimising your queries and ensuring good index coverage. Fill factor is one of those things that you only need to worry about when you know that everything else in your system is optimal. I don't know anyone that can say that.