H2 database performance strangeness --- or how to efficiently `count(*)` - database

The setup could not be simpler:
H2 version 1.3.176
One table, 10 columns of which two are a bit lengthy with 300 and 3500 characters a typical value length
Simple query: select count(*) from requestrepository where request_type = 'ADD'
Index is on the queried column.
Queried column is just varchar(20) (i.e. not one of the longer ones)
Queried column contains just two different values, with one appearing 200k times and the other appearing 12 million times.
DB runs off an SSD, current server hardware, current Java 8 (varied a bit but no change in result)
What I do: (0) run analyze, (1) delete one row by a key field, (2) insert one row for the key just deleted, (3) run the query cited above, count to 10 and repeat.
What I see: The query cited above takes between 3 and 5 seconds each time and explain analyze says:
SELECT
COUNT(*)
FROM PUBLIC.REQUESTREPOSITORY
/* PUBLIC.IX_REQUESTS: REQUEST_TYPE = 'ADD' */
/* scanCount: 12098748 */
WHERE REQUEST_TYPE = 'ADD'
/*
REQUESTREPOSITORY.IX_REQUESTS read: 126700
*/
I tried the same DB on different machines, hardware/linux/ssd, VM/Windows/netapp, but the tendency is always the same: the count(*) takes too(?) long.
And this is what I am not sure about. Is it to be expected that this takes long? I would have thought that at least for the second round, caches are filled and this should be much faster, but the explain analyze always lists 126700 reads.
Any hints about H2 parameters or settings how this may be improve are appreciated.
EDIT (not sure if this should rather go as an answer)
Meanwhile we tried a wide range of things, including mvstore, 1.4.x, parallel threads, computers with different disks, Linux, Windows. The situation is always the same. Take over 10 or 12 million rows, a varchar column with three status values, something like PROCESSING, ADD, DELETE, an index on the column and one status grossly overrepresented: Then something like count(*) where colname='ADD' takes between 1 and many seconds after each update of the table.
To prevent this from creating a problem, we finally fixed our own code, which did three count(*), one for each status, instead of one with a group by and was run every 5 seconds instead of just on demand. Certainly not the greatest design we had.
The only excuse I have is that it I am still surprised that a count(*) takes that long in such a setup. My hunch is that the count must be computed on the index by really counting after an update, whereas I expected that the count can be just read off the data structure somewhere. (No critique, I for myself would certainly not be able to implement a DB.)

Not sure about H2, but have you tried COUNT(request_type) instead of COUNT(*)?
SQL standard's COUNT(*) tends to take long time to compute, as it requires a full table scan to filter out rows that consist of NULL values only.
Using COUNT() on a single indexed column can speed things up. This way no table row need to be read, as the index is sufficient to decide whether the column's value is NULL.

Related

How to increase SQL Server select query performance?

I have a table with 102 columns and 43200 rows. Id column is an identity column and 2 columns have an unique index.
When I just execute
Select *
from MyTable
it takes almost 8 minutes+ over the network.
This table has a Status column which contains 1 or 0. If I select with where Status = 1, then I'm getting 31565 rows and the select is taking 6 minutes+. For your information status 1 completed and will not change ever anymore. But 0 status is working in progress and the rows are changing different columns value by different user stage.
When I select with Status = 0, it takes 1.43 minutes and returns 11568 rows.
How can I increase performance for completed and WIP status query separately or cumulatively? Can I somehow use caching?
The SQL server takes care of caching. At least as long as there is enough free RAM. When it take so long to get the data at first you need to find the bottleneck.
RAM: Is there enough to hold the full table? And is the SQL server configured to use it?
Is there an upper limit to RAM usage? If not SQL server assumes unlimited RAM and this will often end caching in page file, which causes massive slow downs
You said "8+ minutes through network". How long does it take on local execution? Maybe the network is slow
Hard drive: When the table is too big to be held in RAM it gets read from hard drive. HDDs are somewhat slow. Maybe defragmenting the indices could help here (at least somewhat)
If none helps, the SQL profiler might help to show you where the bottleneck actually is to find
This is an interesting question, but it's a little open-ended, more info is needed. I totally agree with allmhuran's comment that maybe you shouldn't be using "select * ..." for a large table. (It could in fact be posted as an answer, it deserves upvotes).
I suspect there may be design issues - Are you using BLOB's? Is the data at least partially normalized? ref https://en.wikipedia.org/wiki/Database_normalization
I Suggest create a non clustered index on "Status" Column. It improves your queries with Where Clause that uses this column.

SELECT with WHERE drops performance

In my application I use queries like
SELECT column1, column2
FROM table
WHERE myDate >= date1 AND myDate <= date2
My application crashes and returns a timeout exception. I copy the query and run it in SSMS. The results pane displays ~ 40 seconds of execution time. Then I remove the WHERE part of the query and run. This time, the returned rows appear immediately in the results table, although the query continues to print more rows (there are 5 million rows in the table).
My question is: how can a WHERE clause affect query performance?
Note: I don't change the CommandTimeOut property in the application. Left by default.
Without a WHERE clause, SQL Server is told to just start returning rows, so that's what it does, starting from the first row it can find efficiently (which may be the "first" row in the clustered index, or in a covering non-clustered index).
When you limit it with a where clause, SQL Server first has to go find those rows. That's what you're waiting on, because you don't have an index on myDate (or date1/date2, which I'm not sure are columns or variables), it needs to examine every single row.
Another way to look at it is to think of a phone book, which is an outdated analogy but gets the job done. Without a WHERE clause, it's like you're asking me to read you off all of the names and numbers in the book. If you add a WHERE clause that is not supported by an index, like read me off the names and numbers of every person with the first name 'John', it's going to take me a lot longer to start returning rows because I can't even start until I find the first John.
Or a slightly different analogy is to think of the index in a book. If you ask me to read off the page numbers for all the terms that are indexed, I can do that from the index, just starting from the beginning and reading through until the end. If you ask me to read off all the page numbers for all the terms that aren't in the index, or a specific unindexed term (like "the"), or even all the page numbers for indexed terms that contain the letter a, I'm going to have a much harder time.

Database design for IoT application

Our application shows near-real-time IoT data (up to 5 minute intervals) for our customers' remote equipment.
The original pilot project stores every device reading for all time, in a simple "Measurements" table on a SQL Server 2008 database.
The table looks something like this:
Measurements: (DeviceId, Property, Value, DateTime).
Within a year or two, there will be maybe 100,000 records in the table per device, with the queries typically falling into two categories:
"Device latest value" (95% of queries): looking at the latest value only
"Device daily snapshot" (5% of queries): looking at a single representative value for each day
We are now expanding to 5000 devices. The Measurements table is small now, but will quickly get to half a billion records or so, for just those 5000 devices.
The application is very read-intensive, with frequently-run queries looking at the "Device latest values" in particular.
[EDIT #1: To make it less opinion-based]
What database design techniques can we use to optimise for fast reads of the "latest" IoT values, given a big table with years worth of "historic" IoT values?
One suggestion from our team was to store MeasurementLatest and MeasurementHistory as two separate tables.
[EDIT #2: In response to feedback]
In our test database, seeded with 50 million records, and with the following index applied:
CREATE NONCLUSTERED INDEX [IX_Measurement_DeviceId_DateTime] ON Measurement (DeviceId ASC, DateTime DESC)
a typical "get device latest values" query (e.g. below) still takes more than 4,000 ms to execute, which is way too slow for our needs:
SELECT DeviceId, Property, Value, DateTime
FROM Measurements m
WHERE m.DateTime = (
SELECT MAX(DateTime)
FROM Measurements m2
WHERE m2.DeviceId = m.DeviceId)
This is a very broad question - and as such, it's unlikely you'll get a definitive answer.
However, I have been in a similar situation, and I'll run through my thinking and eventual approach. In summary though - I did option B but in a way to mirror option A: I used a filtered index to 'mimic' the separate smaller table.
My original thinking was to have two tables - one with the 'latest data only' for most reporting, then a table with all historical values. An alternate was to have two tables - one with all records, and one with just the latest.
When inserting a new row, it would typically need to therefore update at least two rows, if not more (depending on how it's stored).
Instead, I went for a slightly different route
Put all the data into one table
On that one table, add a new column 'Latest_Flag' (bit, NOT NULL, DEFAULT 1). If it's 1 then it's the latest value; otherwise it's historical
Have a filtered index on the table that has all columns (with appropriate column order) and filter of Latest_Flag = 1
This filtered index is similar to a second copy of the table with just the latest rows only
The insert process therefore has two steps in a transaction
'Unflag' the last Latest_Flag for that device, etc
Insert the new row
It still makes the writes a bit slower (as it needs to do several row updates as well as index updates) but fundamentally it does the pre-calculation for later reads.
When reading from the table, however, you need to then specify WHERE Latest_Flag = 1. Alternatively, you may want to put it into a view or similar.
For the filtered index, it may be something like
CREATE INDEX ix_measurements_deviceproperty_latest
ON Measurements (DeviceId, Property)
INCLUDE (Value, DateTime, Latest_Flag)
WHERE (Latest_Flag = 1)
Note - another version of this can be done in a trigger e.g., when inserting a new row, it invalidates (sets Latest_Flag = 0) any previous rows. It means you don't need to do the two-step inserts; but you do then rely on business/processing logic being within triggers.

Selecting rows between x and y from database

I've got a query which returns 30 rows. I'm writing code that will paginate those 30 rows into 5 records per page via an AJAX call.
Is there any reason to return just those 5 records up the presentation layer? Would there be any benefits in terms of speed or does it just get all the rows under the hood anyways?
If so, how do I actually do it in Sybase? I know Oracle has Rownum and MS Sql has something similar, but I can't seem to find a similar function in Sybase.
Unless your record length is huge, the difference between 5 and 30 rows should be completely unnoticeable to the user. In fact there's a significant potential the multiple DB calls will harm performance more than help. Just return all 30 rows either to your middle tier or your presentation, whatever makes more sense.
Some info here:
Selecting rows N to M without Oracle's rownum?
I've never worked with Sybase, but here's a link that explains how to do something similar:
http://www.dbforums.com/sybase/1616373-sybases-rownum-function.html
Since the solution involves a temp table, you can also use it for pagination. On your initial query, put the 30 rows into a temporary table, and add a column for page number (the first five rows would be page 1, the next five page 2 and so on). On subsequent page requests, you query the temp table by page number.
Not sure how you go about cleaning up the temp table, though. Perhaps when the user's session times out?
For 30 records, it's probably not even worth bothering with pagination at all.
I think in sybase you can use
select top 5 * from table
where order-by-field > (last record of previous calls order-by-field)
order by order-by-field
just make sure you use the same order by each time.
As for benefit I guess it depends on how many rows we are talking and how big the table is etc.
I agree completely with jmgant, however, if you want to do it anyway, the process goes something like this:
Select top 10 items and store in X
Select top 5 items and store in Y
X-Y
This entire process can happen in 1 SQL statement.

TSQL "LIKE" or Regular Expressions?

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?
What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.
First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.
One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)
Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.
There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.
Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.
Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

Resources