I have table with 12000 records and I have a query where this table is joined with few tables + paging. I measure time using SET STATISTICS TIME ON/OFF. For first pages it's very fast but the closer to the last page the more time it takes. Is it normal?
This is normal because SQL Server has no way to directly seek to a given page of a logical query. It scans through a stream of results until it has arrived at the page you wanted.
If you want constant time paging you need to provide some kind of seek key on an index. For example if you can guarantee that your ID int column has consecutive values starting with 1 you can get any page in constant time simply by saying WHERE ID >= ... and ID < ....
I'm sure you'll find other approaches on the web but there's nothing built into the product.
Related
In my application I use queries like
SELECT column1, column2
FROM table
WHERE myDate >= date1 AND myDate <= date2
My application crashes and returns a timeout exception. I copy the query and run it in SSMS. The results pane displays ~ 40 seconds of execution time. Then I remove the WHERE part of the query and run. This time, the returned rows appear immediately in the results table, although the query continues to print more rows (there are 5 million rows in the table).
My question is: how can a WHERE clause affect query performance?
Note: I don't change the CommandTimeOut property in the application. Left by default.
Without a WHERE clause, SQL Server is told to just start returning rows, so that's what it does, starting from the first row it can find efficiently (which may be the "first" row in the clustered index, or in a covering non-clustered index).
When you limit it with a where clause, SQL Server first has to go find those rows. That's what you're waiting on, because you don't have an index on myDate (or date1/date2, which I'm not sure are columns or variables), it needs to examine every single row.
Another way to look at it is to think of a phone book, which is an outdated analogy but gets the job done. Without a WHERE clause, it's like you're asking me to read you off all of the names and numbers in the book. If you add a WHERE clause that is not supported by an index, like read me off the names and numbers of every person with the first name 'John', it's going to take me a lot longer to start returning rows because I can't even start until I find the first John.
Or a slightly different analogy is to think of the index in a book. If you ask me to read off the page numbers for all the terms that are indexed, I can do that from the index, just starting from the beginning and reading through until the end. If you ask me to read off all the page numbers for all the terms that aren't in the index, or a specific unindexed term (like "the"), or even all the page numbers for indexed terms that contain the letter a, I'm going to have a much harder time.
Our application shows near-real-time IoT data (up to 5 minute intervals) for our customers' remote equipment.
The original pilot project stores every device reading for all time, in a simple "Measurements" table on a SQL Server 2008 database.
The table looks something like this:
Measurements: (DeviceId, Property, Value, DateTime).
Within a year or two, there will be maybe 100,000 records in the table per device, with the queries typically falling into two categories:
"Device latest value" (95% of queries): looking at the latest value only
"Device daily snapshot" (5% of queries): looking at a single representative value for each day
We are now expanding to 5000 devices. The Measurements table is small now, but will quickly get to half a billion records or so, for just those 5000 devices.
The application is very read-intensive, with frequently-run queries looking at the "Device latest values" in particular.
[EDIT #1: To make it less opinion-based]
What database design techniques can we use to optimise for fast reads of the "latest" IoT values, given a big table with years worth of "historic" IoT values?
One suggestion from our team was to store MeasurementLatest and MeasurementHistory as two separate tables.
[EDIT #2: In response to feedback]
In our test database, seeded with 50 million records, and with the following index applied:
CREATE NONCLUSTERED INDEX [IX_Measurement_DeviceId_DateTime] ON Measurement (DeviceId ASC, DateTime DESC)
a typical "get device latest values" query (e.g. below) still takes more than 4,000 ms to execute, which is way too slow for our needs:
SELECT DeviceId, Property, Value, DateTime
FROM Measurements m
WHERE m.DateTime = (
SELECT MAX(DateTime)
FROM Measurements m2
WHERE m2.DeviceId = m.DeviceId)
This is a very broad question - and as such, it's unlikely you'll get a definitive answer.
However, I have been in a similar situation, and I'll run through my thinking and eventual approach. In summary though - I did option B but in a way to mirror option A: I used a filtered index to 'mimic' the separate smaller table.
My original thinking was to have two tables - one with the 'latest data only' for most reporting, then a table with all historical values. An alternate was to have two tables - one with all records, and one with just the latest.
When inserting a new row, it would typically need to therefore update at least two rows, if not more (depending on how it's stored).
Instead, I went for a slightly different route
Put all the data into one table
On that one table, add a new column 'Latest_Flag' (bit, NOT NULL, DEFAULT 1). If it's 1 then it's the latest value; otherwise it's historical
Have a filtered index on the table that has all columns (with appropriate column order) and filter of Latest_Flag = 1
This filtered index is similar to a second copy of the table with just the latest rows only
The insert process therefore has two steps in a transaction
'Unflag' the last Latest_Flag for that device, etc
Insert the new row
It still makes the writes a bit slower (as it needs to do several row updates as well as index updates) but fundamentally it does the pre-calculation for later reads.
When reading from the table, however, you need to then specify WHERE Latest_Flag = 1. Alternatively, you may want to put it into a view or similar.
For the filtered index, it may be something like
CREATE INDEX ix_measurements_deviceproperty_latest
ON Measurements (DeviceId, Property)
INCLUDE (Value, DateTime, Latest_Flag)
WHERE (Latest_Flag = 1)
Note - another version of this can be done in a trigger e.g., when inserting a new row, it invalidates (sets Latest_Flag = 0) any previous rows. It means you don't need to do the two-step inserts; but you do then rely on business/processing logic being within triggers.
I have to fetch n number of records from the database. The count of this record may very manager to manager. It means one manager can have 100+ records and other may have only 50+ records.
If it is all about fetching data from only one table then its super easy to get.
In my case the main pain point is I will get my result after using so many joins , temp tabels , functions , maths on some column and dates filter using switch cases and many more and yes each tables has 100k+ records with proper indexing.
I have added pagination in UI side so that I can get only 20 records at a time on screen. once I clicked on page number based on that I should offset the records for next 20. Supposed clicked on page number 3 then from db I should get only records from 41- 60.
UI part is not a big deal the point is how to optimise your query so that every time I should get only 20 records.
My current implementation is every time I am calling the same procedure with index value to offset the data. Is that correct way to run same complex with all functions , cte, cases in filters and inner/left joins again and again to fetch only piece of data from recordset.
I have one simple query which has multiple columns (more than 1000).
When i run with single column it gives me result in 2 seconds with proper index seek, logical read, cpu and every thing is under thresholds.
But when i select more than 1000 columns it takes 11 mins for the result and gives me key lookup.
You folks have you faced this type of issue?
Any suggestion on that issue?
Normally, I would suggest to add those columns in the INCLUDE fields of your non-clustered index. Adding them in the INCLUDE removes the LOOKUP in the execution plan. But as everything with SQL Server, it depends. Depending on how the table is used i.e, if you're updating the table more than just plain SELECTing on it, then the LOOKUP might be ok.
If this query is run once per year, the overhead of additional index is probably not worth it. If you need quick response time, that single time of the year when it needs to be run, look into 'pre executing' it and just present the result to the user.
The difference in your query plan might be because of join elimination (if your query contains JOINs with multiple tables) or just that the additional columns you are requesting do not exist in your currently existing indexes...
I'm new to pagination, so I'm not sure I fully understand how it works. But here's what I want to do.
Basically, I'm creating a search engine of sorts that generates results from a database (MySQL). These results are merged together algorithmically, and then returned to the user.
My question is this: When the results are merged on the backend, do I need to create a temporary view with the results that is then used by the PHP pagination? Or do I create a table? I don't want a bunch of views and/or tables floating around for each and every query. Also, if I do use temporary tables, when are they destroyed? What if the user hits the "Back" button on his/her browser?
I hope this makes sense. Please ask for clarification if you don't understand. I've provided a little bit more information below.
MORE EXPLANATION: The database contains English words and phrases, each of which is mapped to a concept (Example: "apple" is 0.67 semantically-related to the concept of "cooking"). The user can enter in a bunch of keywords, and find the closest matching concept to each of those keywords. So I am mathematically combining the raw relational scores to find a ranked list of the most semantically-related concepts for the set of words the user enters. So it's not as simple as building a SQL query like "SELECT * FROM words WHERE blah blah..."
It depends on your database engine (i.e. what kind of SQL), but nearly each SQL flavor has support for paginating a query.
For example, MySQL has LIMIT and MS SQL has ROW_NUMBER.
So you build your SQL as usual, and then you just add the database engine-specific pagination stuff and the server automatically returns only, say, row 10 to 20 of the query result.
EDIT:
So the final query (which selects the data that is returned to the user) selects data from some tables (temporary or not), as I expected.
It's a SELECT query, which you can page with LIMIT in MySQL.
Your description sounds to me as if the actual calculation is way more resource-hogging than the final query which returns the results to the user.
So I would do the following:
get the individual results tables for the entered words, and save them in a table in a way that you can get the data for this specifiy query later (for example, with an additional column like SessionID or QueryID). No pagination here.
query these result tables again for the final query that is returned to the user.
Here you can do paging by using LIMIT.
So you have to do the actual calculation (the resource-hogging queries) only once when the user "starts" the query. Then you can return paginated results to the user by just selecting from the already populated results table.
EDIT 2:
I just saw that you accepted my answer, but still, here's more detail about my usage of "temporary" tables.
Of course this is only one possible way to do it. If the expected result is not too large, returning the whole resultset to the client, keeping it in memory and doing the paging client side (as you suggested) is possible as well.
But if we are talking about real huge amounts of data of which the user will only view a few (think Google search results), and/or low bandwidth, then you only want to transfer as little data as possible to the client.
That's what I was thinking about when I wrote this answer.
So: I don't mean a "real" temporary table, I'm talking about a "normal" table used for saving temporary data.
I'm way more proficient in MS SQL than in MySQL, so I don't know much about temp tables in MySQL.
I can tell you how I would do it in MS SQL, but maybe there's a better way to do this in MySQL that I don't know.
When I'd have to page a resource-intensive query, I want do the actual calculation once, save it in a table and then query that table several times from the client (to avoid doing the calculation again for each page).
The problem is: in MS SQL, a temp table only exists in the scope of the query where it is created.
So I can't use a temp table for that because it would be gone when I want to query it the second time.
So I use "real" tables for things like that.
I'm not sure whether I understood your algorithm example correct, so I'll simplify the example a bit. I hope that I can make my point clear anyway:
This is the table (this is probably not valid MySQL, it's just to show the concept):
create table AlgorithmTempTable
(
QueryID guid,
Rank float,
Value float
)
As I said before - it's not literally a "temporary" table, it's actually a real permanent table that is just used for temporary data.
Now the user opens your application, enters his search words and presses the "Search" button.
Then you start your resource-heavy algorithm to calculate the result once, and store it in the table:
insert into AlgorithmTempTable (QueryID, Rank, Value)
select '12345678-9012-3456789', foo, bar
from Whatever
insert into AlgorithmTempTable (QueryID, Rank, Value)
select '12345678-9012-3456789', foo2, bar2
from SomewhereElse
The Guid must be known to the client. Maybe you can use the client's SessionID for that (if he has one and if he can't start more than one query at once...or you generate a new Guid on the client each time the user presses the "Search" button, or whatever).
Now all the calculation is done, and the ranked list of results is saved in the table.
Now you can query the table, filtering by the QueryID:
select Rank, Value
from AlgorithmTempTable
where QueryID = '12345678-9012-3456789'
order by Rank
limit 0, 10
Because of the QueryID, multiple users can do this at the same time without interfering each other's query. If you create a new QueryID for each search, the same user can even run multiple queries at once.
Now there's only one thing left to do: delete the temporary data when it's not needed anymore (only the data! The table is never dropped).
So, if the user closes the query screen:
delete
from AlgorithmTempTable
where QueryID = '12345678-9012-3456789'
This is not ideal in some cases, though. If the application crashes, the data stays in the table forever.
There are several better ways. Which one is the best for you depends on your application. Some possibilities:
You can add a datetime column with the current time as default value, and then run a nightly (or weekly) job that deletes everything older than X
Same as above, but instead of a weekly job you can delete everything older than X every time someone starts a new query
If you have a session per user, you can save the SessionID in an additional column in the table. When the user logs out or the session expires, you can delete everything with that SessionID in the table
Paging results can be very tricky. They way I have done this is as follows. Set an upperbound limit for any query that may be run. For example say 5,000. If a query returns more than 5,000 then limit the results to 5,000.
This is best done using a stored procedure.
Store the results of the query into a temp table.
Select Page X's amount of data from the temp table.
Also return back the current page and total number of pages.