When I am searching for rows satisfying a certain condition:
SELECT something FROM table WHERE type = 5;
Is it a linear difference in time when I am executing this query on a table containing 10K and 10M of rows?
In other words - is making this kind of queries on a 10K table 1000 times faster than making it on a 10M table?
My table contains a column type which contains numbers from 1 to 10. The most often query on this table will be the one above. If the difference in performance is true, I will have to make 10 tables for each type to achieve a better performance. If this is not really the issue, I will have two tables - one for the types, and the second one for data with column type_id.
EDIT:
There are multiple rows with the type value.
(Answer originally tagged postgresql and this answer is in those terms. Other DBMSes will vary.)
Like with most super broad questions, "it depends".
If there's no index present, then time is probably roughly linear, though with a nearly fixed startup cost plus some breakpoints - e.g. from when the table fits in RAM to when it no longer fits in RAM. All sorts of effects can come into play - memory banking and NUMA, disk readahead, parallelism in the underlying disk subsystem, fragmentation on the file system, MVCC bloat in the tables, etc - that make this far from simple.
If there's a b-tree index on the attribute in question time is going to increase at a less than linear rate - probably around O(log n). How much less with vary based on whether the index fits in RAM, whether the table fits in RAM, etc. However, PostgreSQL usually then has to do a heap lookup for each index pointer, which adds random I/O cost rather unpredictably depending on the data distribution/clustering, caching and readahead, etc. It might be able to do an index-only scan, in which case this secondary lookup is avoided, if vacuum is running enough.
So ... in extremely simplified terms, no index = O(n), with index ~= O(log n). Very, very approximately.
I think the underlying intent of the question is along the lines of: Is it faster to have 1000 tables of 1000 rows, or 1 table of 1,000,000 rows?. If so: In the great majority of cases the single bigger table will be the better choice for performance and administration.
Related
I'm working with a few large row count tables with a commonly named column (table_id). I now intend to add an index to this column in each table. One or two of these tables use up 10x more space than the others, but let's say for simplicity that all tables have the same row count. Basically, the 10x added space is because some tables have many more columns than others. I have two inter-related questions:
I'm curious about whether overall indexing time is a function of the table's disk usage or just the row count?
Additionally, would duplicate values in a column speed up the indexing time at all, or would it actually slow down indexing?
If indexing time is only dependent on row count, then all my tables should get indexed at the same speed. I'm sure someone could do benchmark tests to answer all these questions, but my disks are currently tied up indexing those tables.
The speed of indexing depends on the following factors:
Count of rows, it is one of the factors that most affect the speed of the index.
The type of column (int, text, bigint, json) is also one of the factors that influence indexing.
Duplicate data affects index size, not index speed. It may have a very slight effect on the speed of the index. It mainly affects the size. So if a column has a lot of duplicate data, the size of the column index decreases.
The speed of the index can be affected by the disk in this way, for example: When you created index choosing is in different tablespace, and this tablespace is setting to another HDD using configuration. And at this time, if the disk on which the index is created is an SSD, and the other is a regular HDD disk, then of course the index creation speed will increase.
Also, PostgreSQL server configurations have memory usage and other parameters that can affect index creation speed, so if some parameters like buffer memory are high, then indexing speed will increase.
The speed of CREATE INDEX depends on several factors:
the kind of index
the number of rows
the speed of your disks
the setting of maintenance_work_mem and max_parallel_maintenance_workers
The effort to sort the data grows with O(n * log(n)), where n is the number of rows. Reading and writing should grow linearly with the number of rows.
I am not certain, but I'd say that duplicate rows should slow down indexing a little bit from v13 on, where B-tree index deduplication was introduced.
I have been searching good information about index benchmarking on PostgreSQL and found nothing really good.
I need to understand how PostgreSQL behaves while handling a huge amount of records.
Let's say 2000M records on a single non-partitioned table.
Theoretically, b-trees are O(log(n)) for reads and writes but in practicality
I think that's kind of an ideal scenario not considering things like HUGE indexes not fitting entirely in memory (swapping?) and maybe other things I am not seeing at this point.
There are no JOIN operations, which is fine, but note this is not an analytical database and response times below 150ms (less the better) are required. All searches are expected to be done using indexes, of course. Where we have 2-3 indexes:
UUID PK index
timestamp index
VARCHAR(20) index (non unique but high cardinality)
My concern is how writes and reads will perform once the table reach it's expected top capacity (2500M records)
... so specific questions might be:
May "adding more iron" achieve reasonable performance in such scenario?
NOTE this is non-clustered DB so this is vertical scaling.
What would be the main sources of time consumption either for reads and writes?
What would be the amount of records on a table that we can consider "too much" for this standard setup on PostgreSql (no cluster, no partitions, no sharding)?
I know this amount of records suggests taking some alternative (sharding, partitioning, etc) but this question is about learning and understanding PostgreSQL capabilities more than other thing.
There should be no performance degradation inserting into or selecting from a table, even if it is large. The expense of index access grows with the logarithm of the table size, but the base of the logarithm is large, and the index shouldn't have a depth of the index cannot be more than 5 or 6. The upper levels of the index are probably cached, so you end up with a handful of blocks read when accessing a single table row per index scan. Note that you don't have to cache the whole index.
I am unable to understand the role equi-depth histograms play in query optimization. Can someone please give me some pointers to good resources or could anyone explain. I have read a few research papers but still I could not convince my for the need and use of equi-depth histograms. So, can someone please explain equi-depth histograms with an example.
Also can we merge the buckets of the histograms so that the histogram becomes small enough and fits in 1 page on disk?
Also what are bucket boundaries in equi-depth histograms?
Caveat: I'm not an expert on database internals, so this is a general, not a specific answer.
Query compilers convert the query, usually given in SQL, to a plan for obtaining the result. Plans consist of low level "instructions" to the database engine: scan table T looking for value V in column C; use index X on table T to locate value V; etc.
Query optimization is about the compiler deciding which of a (potentially huge) set of alternative query plans have minimum cost. Costs include wall clock time, IO bandwidth, intermediate result storage space, CPU time, etc. Conceptually, the optimizer is searching the alternative plan space, evaluating the cost of each to guide the search, ultimately choosing the cheapest it can find.
The costs mentioned above depend on estimates of how many records will be read and/or written, whether the records can be located by indexes, what columns of those records will be used, and the size of the data and/or how many disk pages they occupy.
These quantities in turn often depend on the exact data values stored in the tables. Consider for example select * from data where pay > 100 where pay is an indexed column. If the pay column has no values over 100, then the query is extremely cheap. A single probe of the index answers it. Conversely the result set could contain the entire table.
This is where histograms help. (Equi-depth histograms are just one way of maintaining histograms.) In the preceeding query a histogram will in O(1) time provide an estimate of the fraction of rows that will be produced by the query without knowing exactly what those rows will contain.
In effect, the optimizer is "executing" the query on an abstraction of the data. The histogram is that abstraction. (Others are possible.) The histogram is useful for estimating costs and result sizes for query plan operations: join result size and page hits during mass insertions and deletions (which may lead to the generation of a temporary index), for example.
For a simple inner join example, suppose we know how integer-valued join columns of two tables are distributed:
Bins (25% each)
Table A Table B
0-100 151-300
101-150 301-500
151-175 601-700
176-300 1001-1100
It's easy to see that 50% of Table A and 25% of Table B reflect the possible participation. If these are unique-valued columns, then a useful join size estimate is max(.5 * |A|, .25 * |B|). This is a very simple example. In many (most?) cases, the analysis requires much more mathematical sophistication. For joins, it's usual to compute an estimated histogram of the results by "joining" the histograms of the operands. This is what makes the literature so diverse, complicated, and interesting.
PhD dissertations often have surveys that cover big bodies of technical literature like this in a concise form that isn't too difficult to read. (After all, the candidate is trying to convince a committee he/she knows how to do a literature search.) Here is one such example.
We are designing a table for ad-hoc analysis that will capture umpteen value fields over time for claims received. The table structure is essentially (pseudo-ish-code):
table_huge (
claim_key int not null,
valuation_date_key int not null,
value_1 some_number_type,
value_2 some_number_type,
[etc...],
constraint pk_huge primary key (claim_key, valuation_date_key)
);
All value fields all numeric. The requirements are: The table shall capture a minimum of 12 recent years (hopefully more) of incepted claims. Each claim shall have a valuation date for each month-end occurring between claim inception and the current date. Typical claim inception volumes range from 50k-100k per year.
Adding all this up I project a table with a row count on the order of 100 million, and could grow to as much as 500 million over years depending on the business's needs. The table will be rebuilt each month. Consumers will select only. Other than a monthly refresh, no updates, inserts or deletes will occur.
I am coming at this from the business (consumer) side, but I have an interest in mitigating the IT cost while preserving the analytical value of this table. We are not overwhelmingly concerned about quick returns from the Table, but will occasionally need to throw a couple dozen queries at it and get all results in a day or three.
For argument's sake, let's assume the technology stack is, I dunno, in the 80th percentile of modern hardware.
The questions I have are:
Is there a point at which the cost-to-benefit of indices becomes excessive, considering a low frequency of queries against high-volume tables?
Does the SO community have experience with +100M row tables and can
offer tips on how to manage?
Do I leave the database technology problem to IT to solve or should I
seriously consider curbing the business requirements (and why?)?
I know these are somewhat soft questions, and I hope readers appreciate this is not a proposition I can test before building.
Please let me know if any clarifications are needed. Thanks for reading!
First of all: Expect this to "just work" if leaving the tech problem to IT - especially if your budget allows for an "80% current" hardware level.
I do have experience with 200M+ rows in MySQL on entry-level and outdated hardware, and I was allways positivly suprised.
Some Hints:
On monthly refresh, load the table without non-primary indices, then create them. Search for the sweet point, how many index creations in parallell work best. In a project with much less date (ca. 10M) this reduced load time compared to the naive "create table, then load data" approach by 70%
Try to get a grip on the number and complexity of concurrent queries: This has influence on your hardware decisions (less concurrency=less IO, more CPU)
Assuming you have 20 numeric fields of 64 bits each, times 200M rows: If I can calculate correctly, ths is a payload of 32GB. Trade cheap disks against 64G RAM and never ever have an IO bottleneck.
Make sure, you set the tablespace to read only
You could consider anchor modeling approach to store changes only.
Considering that there are so many expected repeated rows, ~ 95% --
bringing row count from 100M to only 5M, removes most of your concerns.
At this point it is mostly cache consideration, if the whole table
can somehow fit into cache, things happen fairly fast.
For "low" data volumes, the following structure is slower to query than a plain table; at one point (as data volume grows) it becomes faster. That point depends on several factors, but it may be easy to test. Take a look at this white-paper about anchor modeling -- see graphs on page 10.
In terms of anchor-modeling, it is equivalent to
The modeling tool has automatic code generation, but it seems that it currenty fully supports only MS SQL server, though there is ORACLE in drop-down too. It can still be used as a code-helper.
In terms of supporting code, you will need (minimum)
Latest perspective view (auto-generated)
Point in time function (auto-generated)
Staging table from which this structure will be loaded (see tutorial for data-warehouse-loading)
Loading function, from staging table to the structure
Pruning functions for each attribute, to remove any repeating values
It is easy to create all this by following auto-generated-code patterns.
With no ongoing updates/inserts, an index NEVER has negative performance consequences, only positive (by MANY orders of magnitude for tables of this size).
More critically, the schema is seriously flawed. What you want is
Claim
claim_key
valuation_date
ClaimValue
claim_key (fk->Claim.claim_key)
value_key
value
This is much more space-efficient as it stores only the values you actually have, and does not require schema changes when the number of values for a single row exceeds the number of columns you have allocated.
Using partition concept & apply partition key on every query that you perform will save give the more performance improvements.
In our company we solved huge number of performance issues with the partition concept.
One more design solutions is if we know that the table is going to be very very big, try not to apply more constraints on the table & handle in the logic before u perform & don't have many columns on the table to avoid row chaining issues.
I have a large table (more than 10 millions records). this table is heavily used for search in the application. So, I had to create indexes on the table. However ,I experience a slow performance when a record is inserted or updated in the table. This is most likely because of the re-calculation of indexes.
Is there a way to improve this.
Thanks in advance
You could try reducing the number of page splits (in your indexes) by reducing the fill factor on your indexes. By default, the fill factor is 0 (same as 100 percent). So, when you rebuild your indexes, the pages are completely filled. This works great for tables that are not modified (insert/update/delete). However, when data is modified, the indexes need to change. With a Fill Factor of 0, you are guaranteed to get page splits.
By modifying the fill factor, you should get better performance on inserts and updates because the page won't ALWAYS have to split. I recommend you rebuild your indexes with a Fill Factor = 90. This will leave 10% of the page empty, which would cause less page splits and therefore less I/O. It's possible that 90 is not the optimal value to use, so there may be some 'trial and error' involved here.
By using a different value for fill factor, your select queries may become slightly slower, but with a 90% fill factor, you probably won't notice it too much.
There are a number of solutions you could choose
a) You could partition the table
b) Consider performing updates in batch at offpeak hours (like at night)
c) Since engineering is a balancing act of trade-offs, you would have to choose which is more important (Select or Update/insert/delete) and which operation is more important. Assuming you don't need the results in real time for an "insert", you can use Sql server service broker for those operations to perform the "less important" operation asynchronously
http://msdn.microsoft.com/en-us/library/ms166043(SQL.90).aspx
Thanks
-RVZ
We'd need to see your indexes, but likely yes.
Some things to keep in mind are that you don't want to just put an index on every column, and you don't generally want just one column per index.
The best thing to do is if you can log some actual searches, track what users are actually searching for, and create indexes targeted at those specific types of searches.
This is a classic engineering trade-off... you can make the shovel lighter or stronger, never both... (until a breakthrough in material science. )
More index mean more DML maintenance, means faster queries.
Fewer indexes mean less DML maintenance, means slower queries.
It could be possible that some of your indexes are redundant and could be combined.
Besides what Joel wrote, you also need to define the SLA for DML. Is it ok that it's slower, you noticed that it slowed down but does it really matter vs. the query improvement you've achieved... IOW, is it ok to have a light shovel that weak?
If you have a clustered index that is not on an identity numeric field, rearranging the pages can slow you down. If this is the case for you see if you can improve speed by making this a non-clustered index (faster than no index but tends to be a bit slower than the clustered index, so your selects may slow down but inserts improve)
I have found users more willing to tolerate a slightly slower insert or update to a slower select. That is a general rule, but if it becomes unacceptably slow (or worse times out) well no one can deal with that well.