I'm trying to estimate a few things. For example I have a specific function that executes after another and I was wondering if it's CPU bound so it'd be ok to move it farther away from the function or if its cache/memory bound meaning I shouldn't move it farther and I may want to split some work.
Some things I want to know is written below. My question is what events might I want to look at in order to estimate the ratio of memory to cpu stalls? What gotchas may I want to know about for the suggested events? I'm reading through perf_event_open and it isn't easy to understand what the data is measuring. Here's my starting point
Memory VS CPU bound (is there an easy way to know?)
Backend stalls (PERF_COUNT_HW_STALLED_CYCLES_BACKEND, does this report per cpu or process??)
Unique L1 data lines accessed (Is this available/possible to get?)
Instruction count (PERF_COUNT_HW_INSTRUCTIONS)
Cycle count (PERF_COUNT_HW_CPU_CYCLES or __rdtscp)
Related
I have some performance tests for an index structure on some data. I will be comparing 2 indexes side-by-side (still not decided if I will be using 2 VMs). I require results to be as neutral as possible of course, so I have these kinds of questions which I would appreciate any input about... How can I ensure/control what is influencing the test? For example, caching effects/order of arrival from one test to another will influence the result. How can I measure these influences? How do I create a suitable warm-up? Or what kind of statistical techniques can I use to nullify such influences (I don't think just averages is enough)?
Before you start:
Make sure your tables and indices have just been freshly created and populated. This avoids issues with regard to fragmentation. Otherwise, if the data in one test is heavily fragmented, and the other is not, you might not be comparing apples to apples.
Make sure your tables are properly ANALYZEd. This makes sure that the query planner has proper statistics in all cases.
If you just want a comparison, and not a test under realistic use, I'd just do:
Cold-start your (virtual) machine. Wait a reasonable but fixed time (let's say 5 min, or whatever is reasonable for your system) so that all startup processes have taken place and do not interfere with the DB execution.
Perform test with index1, and measure time (this is timing where you don't have anything cached by either the database nor the OS).
If you're interested in results when there are cache effects: Perform test again 10 times (or any number of times as big as reasonable). Measure each time, to account for variability due to other processes running on the VM, and other contingencies.
Reboot your machine, and repeat the whole process for test2. There are methods to clean the OS cache; but they're very system dependent, and you don't have a way to clean the database cache. Check See and clear Postgres caches/buffers?.
If you are really (or mostly) interested in performance when there are no cache effects, you should perform the whole process several times. It's slow and tedious. If you're only interested in the case where there's (most probably) a cache effect, you don't need to restart again.
Perform an ANOVA (or any other statistical hypothesis test you might think more suited) to decide if your average time is statistically different or not.
You can see an example of performing several tests in the answer to a question about NOT NULL versus CHECK(xx NOT NULL).
As neutral as possible, then create two databases on the same instance of your database management system, then create the same tablespaces with data, using indexes on one instance but not the other.
The challenge with a VM is you have arbitrated access to your disk resources ( unless you have each VM pinned to a specific interface and disk set ). Because of this, your arbitration model could vary from one test to the next. The most neutral course, which removes the arbitration, is on physical hardware....and the same hardware in both cases.
I have a problem where I am going to have a bunch of nbodies - the movements of each is predescribed by existing data, however when a body is in the range of another one certain properties about it change. For the sake of this question we'll just assume you have a counter per body that counts the time you were around other bodies. So basically you start with t = 0, you spend 5 seconds around body 2, so your t is now 5. I am wondering what's the best way I should go about this, I don't have the data yet, but I was just wondering if it's appropriate for me to explore something like CUDA/OpenCL or should I stick with optimizing this across a multi-core cpu machine. Because the amount of data that this will be simulated across is about 500 bodies, which each have movements described down to the second over a 30 day period, so that's 43200 points of data per body.
Brute force nbody is definitely suited to GPUs, because it is "embarrassingly parallel". Each body-to-body interaction computation is completely independent of any other. Your variation that includes keeping track of time spent in the "presence" of other bodies would be a straightforward addition to the existing body-to-body force computation, since everything is done on a timestep basis anyway.
Here's some sample CUDA code for nbody.
Not sure whether there isn't a DBS that does and whether this is indeed a useful feature, but:
There are a lot of suggestions on how to speed up DB operations by tuning buffer sizes. One example is importing Open Street Map data (the planet file) into a Postgres instance. There is a tool called osm2pgsql (http://wiki.openstreetmap.org/wiki/Osm2pgsql) for this purpose and also a guide that suggests to adapt specific buffer parameters for this purpose.
In the final step of the import, the database is creating indexes and (according to my understanding when reading the docs) would benefit from a huge maintenance_work_mem whereas during normal operation, this wouldn't be too useful.
This thread http://www.mail-archive.com/pgsql-general#postgresql.org/msg119245.html in the contrary suggests a large maintenance_work_mem would not make too much sense during final index creation.
Ideally (imo), the DBS should know best what buffers size combination it could profit most given a limited size of total buffer memory.
So, are there some good reasons why there isn't a built-in heuristic that is able to adapt the buffer sizes automatically according to the current task?
The problem is the same as with any forecasting software. Just because something happened historically doesn't mean it will happen again. Also, you need to complete a task in order to fully analyze how you should have done it more efficient. Problem is that the next task is not necessarily anything like the previously completed task. So if your import routine needed 8gb of memory to complete, would it make sense to assign each read-only user 8gb of memory? The other way around wouldn't work well either.
In leaving this decision to humans, the database will exhibit performance characteristics that aren't optimal for all cases, but in return, let's us (the humans) optimize each case individually (if like to).
Another important aspect is that most people/companies value reliable and stable levels over varying but potentially better levels. Having a high cost isn't as big a deal as having large variations in cost. This is of course not true all the times as entire companies are based around the fact the once in a while hit that 1%.
Modern databases already make some effort into adapting itself to the tasks presented, such as increasingly more sofisticated query optimizers. At least Oracle have the option to keep track of some of the measures that are influencing the optimizer decisions (cost of single block read which will vary with the current load).
My guess would be it is awfully hard to get the knobs right by adaptive means. First you will have to query the machine for a lot of unknowns like how much RAM it has available - but also the unknown "what do you expect to run on the machine in addition".
Barring that, by setting a max_mem_usage parameter only, the problem is how to make a system which
adapts well to most typical loads.
Don't have odd pathological problems with some loads.
is somewhat comprehensible code without error.
For postgresql however the answer could also be
Nobody wrote it yet because other stuff is seen as more important.
You didn't write it yet.
This is just an out of curiosity question. Let's say you have a database table with 1m rows in it, and you want to often do queries like looking for either male or female, US or non-US, voter or non-voter etc, it's clearly very efficient to define a bitmap index for the table in which each bit represents one either-or condition.
However, to execute the query, you still have to scan through (probably) all of the index doing a bitand to select matching rows.
My question is is there some kind of bitmap-optimized storage such that the bit 'channels' are pre-created in the hardware? I'm envisaging something similar to knitting needles lifting punched cards out of an old library catalog system. In other words, rather than going row by row through memory locations, the chip can just pull out the matching rows electronically because there are hardware connections for each bit channel? I've a feeling the brain must work something like this. If I think of 'all blue objects', and then restrict that to 'all long blue objects' and then 'all long blue heavy objects', my brain does it effortlessly and I'm sure it's not scanning through all the objects I know about every time. It seems like perhaps there is some neurons that provide pathways for different dimensions for quick retrieval. I'm just wondering if there's anything like this in the hardware world?
Thanks!
Why invent something that's already there?
Content-addressable memory
You could certainly wire up some logic to perform this (e.g. using programmable logic devices) but you'll need a large number of logic elements and connections, making such circuits probably expensive to build for large databases.
For example, one would have to build matching logic (is this bit being selected on ? what is the required value ?) into each 'row' giving you one signal (selected/not selected) per row.
You would then have a logic circuit with one million output lines (telling you which records were selected) which you probably at some point have to 'serialize' anyway, e.g. when you interface with the PCI bus inside a computer (i.e. first transmit the result for record 0 then 1 etc. or transmit the numbers of the selected records).
As bitwise operations in modern CPUs are fast (should only take one clock cycle for logicl operations such as bitwise and, or and 'xor') you're probably not gaining much using such a custom circuit compared to optimized software (not mentioning the 'hardware' development and testing effort) unless you have a very special use case.
Usually when I'm creating indexes on tables, I generally guess what the Fill Factor should be based on an educated guess of how the table will be used (many reads or many writes).
Is there a more scientific way to determine a more accurate Fill Factor value?
You could try running a big list of realistic operations and looking at IO queues for the different actions.
There are a lot of variables that govern it, such as the size of each row and the number of writes vs reads.
Basically: high fill factor = quicker read, low = quicker write.
However it's not quite that simple, as almost all writes will be to a subset of rows that need to be looked up first.
For instance: set a fill factor to 10% and each single-row update will take 10 times as long to find the row it's changing, even though a page split would then be very unlikely.
Generally you see fill factors 70% (very high write) to 95% (very high read).
It's a bit of an art form.
I find that a good way of thinking of fill factors is as pages in an address book - the more tightly you pack the addresses the harder it is to change them, but the slimmer the book. I think I explained it better on my blog.
I would tend to be of the opinion that if you're after performance improvements, your time is much better spent elsewhere, tweaking your schema, optimising your queries and ensuring good index coverage. Fill factor is one of those things that you only need to worry about when you know that everything else in your system is optimal. I don't know anyone that can say that.