I want to start building a tool that more or less shows you the data-lineage of a query using parsing of the execution plan - so that you get information of the form:
Column A of Table XY was computed by taking Column B of Table XZ and adding Column C of Table PL
You get the idea :)
Now, when I tried some queries and looking at the corresponding execution plans, I ran into the issue that there was a random expression present without any definition as to how it is computed.
It appeared in a nested Loop OuterReferences Section, I queried just one table and seemingly performed an index scan followed by a key lookup. When "joining" those 2, the index scan and the key lookup, the query plan XML just showed:
ColumnReference = Column="Expr1020"
I tried searching the XML-File for another occurrence of Expr1020, but there were none.
Now, my question is: can anybody explain why this happens or what exactly happens in the query plan?
I figured every expression used should have a definition that is in some way based on the columns used, but this one is never referenced again :/
This is probably for nested loops prefetch.
See this article for more details
The output also shows that the mystery node uses an expression
labelled [Expr1004] with a type of binary(24). This expression is
shown in regular showplan output as an outer reference (correlated
parameter) of the nested loops join: ...
No definition is provided for this expression, and no other explicit
reference is made to it anywhere else in the plan
...
The prefetch operator can issue multiple asynchronous reads based on
outer-input values. It uses a small memory buffer to keep track, and
to save index leaf page information in particular.
A reference to this information was seen earlier as the outer
reference binary (24) expression label Expr1004. My understanding is
that the leaf page information is used on the inner side of the join
in hints to the storage engine to prevent duplicated effort by the
normal read-ahead mechanisms. It may also be used for other purposes.
The XML execution plan doesn't show all details from the actual compiled plan. The actual compiled plan can contain additional properties and even entire operator nodes that the XML plan does not display.
Related
In my GAE app I'm doing a query which has to be ordered by date. The query has to containt an IN filter, but this is resulting in the following error:
BadArgumentError: _MultiQuery with cursors requires __key__ order
Now I've read through other SO question (like this one), which suggest to change to sorting by key (as the error also points out). The problem is however that the query then becomes useless for its purpose. It needs to be sorted by date. What would be suggested ways to achieve this?
The Cloud Datastore server doesn't support IN. The NDB client library effectively fakes this functionality by splitting a query with IN into multiple single queries with equality operators. It then merges the results on the client side.
Since the same entity could be returned in 1 or more of these single queries, merging these values becomes computationally silly*, unless you are ordering by the Key**.
Related, you should read into underlying caveats/limitations on cursors to get a better understanding:
Because the NOT_EQUAL and IN operators are implemented with multiple queries, queries that use them do not support cursors, nor do composite queries constructed with the CompositeFilterOperator.or method.
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values. The de-duplication logic for such multiple-valued properties does not persist between retrievals, possibly causing the same result to be returned more than once.
If the list of values used in IN is a static list rather than determined at runtime, a work around is to compute this as an indexed Boolean field when you write the Entity. This allows you to use a single equality filter. For example, if you have a bug tracker and you want to see a list of open issues, you might use a IN('new', 'open', 'assigned') restriction on your query. Alternatively, you could set a property called is_open to True instead, so you no longer need the IN condition.
* Computationally silly: Requires doing a linear scan over an unbounded number of preceding values to determine if the current retrieved Entity is a duplicate or not. Also known as conceptually not compatible with Cursors.
** Key works because we can alternate between different single queries retrieving the next set of values and not have to worry about doing a linear scan over the entire proceeding result set. This gives us a bounded data set to work with.
I am trying to get a vendor to create an index on a progress 10.2b database to aid in migrating data from said database, however the vendor is reluctant to create an index, saying it could impact data integrity. There response is below. Is their any truth/merit in what is being said?
There are a number of reasons we will not add indices but the main
reason is, as you have outlined, Progress selects the index it uses
based on the parameters in the query. So for example if we had code
that does the following:
Find first record where a= 1 and b = 2
As the existing index stands this would find the record using index
‘M’ and it would find record ‘X’
If we add a new index to the table there is a chance that this code
could decide to use the new index to find the record and return record
‘Y’ instead.
Sure creating indices is a core part of any database, but proper
development practices would require heaps of testing before applying
an index change to a product system. Without testing, the integrity
of the system cannot be guaranteed.
So my thoughts on this are:
Progress selects the index it uses based on the parameters in the query
Isn't this how any database usually selects the index? Based on the required columns/where clause, it can decide the appropriate index (if any) available.
If we add a new index to the table there is a chance that this code
could decide to use the new index to find the record and return record
‘Y’ instead.
To me, it almost sounds like they have programmed their program to rely on "grabbing the first record out of the database". If it was to use an index, then sure, it might order the results differently if no order by has been specified. If this is the case, then that is just poor programming.
I pretty much agree with you:
To me, it almost sounds like they have programmed their program to rely on "grabbing the first record out of the database". If it was to use an index, then sure, it might order the results differently if no order by has been specified. If this is the case, then that is just poor programming.
If they wrote their query correctly, an index doesn't change the result — just the speed. If they left the order by and just rely that the index has the right order anyway, another index could cause problems.
However, to emphasis this: The bug is the query then.
According to documentation The order of the concatenated elements is arbitrary. is there any way to get sorted data(to get data according to the data source) rather than random choice ?
Some other databases allow an ORDER BY clause in there, but SQLite just uses whatever order the records happen to be read from the table/index/subquery.
If you are using one fixed version of SQLite, and if your database schema does not change, and if you never re-execute ANALYZE, and if your SQL query stays the same, then the order will stay the same.
However, these conditions are hard to guarantee.
Usually, it would be a better idea to not aggregate that field and to use an ORDER BY clause instead, or to use a separator and sort the values in your code.
In a sqlserver 2008 database we have a table with 10 columns. In a web application the user interface is designed to allow the user to specify search criteria on some or all of the columns. The web application calls a stored procedure which dynamically creates a sql statement with only the specified options in the where clause, then executes the query using sp_executesql.
What is the best way to index these columns? We currently have 10 indexes, each one with a different column. Should we have 1 index with all 10, or some other combination?
The bible on optimizing dynamic search queries was written by SQL Server MVP Erland Sommarskog:
http://www.sommarskog.se/dyn-search.html
For SQL Server 2008 specifically:
http://www.sommarskog.se/dyn-search-2008.html
There is a lot of information to digest here, and what you ultimately decide will depend on how the queries are formed. Are there certain parameters that are always searched on? Are there certain combinations of parameters that are usually requested together? Can you really afford to create an index on every column (remember that not all will [edit] necessarily be used even if multiple columns are mentioned in the where clause, and additional indexes are not "free" - you pay for them in maintenance)?
A compound index can only be used when the leftmost key is specified in the search condition. If you have an index on (A, B, C) it can be used to search values WHERE A =#a, WHERE A=#a AND B=#b, WHERE A=#a AND C=#c or WHERE A=#a AND B=#b AND C=#c. But it cannot be used if the leftmost key is not specified, WHERE B=#b or WHERE C=#c cannot use this index. Therefore 10 indexes each in a column can each be used for on specific user criteria, but 1 index on 10 column will only be useful if the user includes the criteria on the first column and useless on all other cases. At least this is the 10000ft answer. There are more details if you start to digg into it.
For a comprehensive discussion of your problem and possible solutions, see Dynamic Search Conditions in T-SQL.
It completely depends on what the data are: how well they index (eg an index on a column with only two values isn't going to help you much), how likely they are to be searched on, and how likely they are to be search on together.
In particular, if column A is queried a lot, and column B tends only to be queried when also querying column A, a compound index over (A, B) will make queries that search for particular values of both columns very fast, and also give you the benefits of a single index on A (but not B) for free.
It is possible that one index per column makes sense for your data, but more likely not. There will probably be a better trade-off taking into account the nature of your data and schema.
Personally I would not bother using a stored procedure to create dynamic SQL. There are no performance benefits compared to doing it in whatever server-side scripting language you're using in the webapp itself, and the language you're writing the webapp in will almost always have more flexible, readable and secure string handling functions than SQL does. Generating SQL strings in SQL itself is an exercise in pain; you will almost certainly get some escaping wrong somewhere and give yourself an SQL-injection security hole.
One index per column. The prioblem is that you have no clue about the queries, and this is the most generic way.
In my experience, having combined indexes does make the queries faster. In this case, you can't have all possible combinations.
I would suggest doing some usage testing to determine which combinations are used most frequently. Then focus on indexes that combine those columns. If the most frequent combinations are:
C1, C2, C3
C1, C2, C5
... then make a combined index on C1 and C2.
This has been bugging me for a while and I'm hoping that one of the SQL Server experts can shed some light on it.
The question is:
When you index a SQL Server column containing a UDT (CLR type), how does SQL Server determine what index operation to perform for a given query?
Specifically I am thinking of the hierarchyid (AKA SqlHierarchyID) type. The way Microsoft recommends that you use it - and the way I do use it - is:
Create an index on the hierarchyid column itself (let's call it ID). This enables a depth-first search, so that when you write WHERE ID.IsDescendantOf(#ParentID) = 1, it can perform an index seek.
Create a persisted computed Level column and create an index on (Level, ID). This enables a breadth-first search, so that when you write WHERE ID.GetAncestor(1) = #ParentID, it can perform an index seek (on the second index) for this expression.
But what I don't understand is how is this possible? It seems to violate the normal query plan rules - the calls to GetAncestor and IsDescendantOf don't appear to be sargable, so this should result in a full index scan, but it doesn't. Not that I am complaining, obviously, but I am trying to understand if it's possible to replicate this functionality on my own UDTs.
Is hierarchyid simply a "magical" type that SQL Server has a special awareness of, and automatically alters the execution plan if it finds a certain combination of query elements and indexes? Or does the SqlHierarchyID CLR type simply define special attributes/methods (similar to the way IsDeterministic works for persisted computed columns) that are understood by the SQL Server engine?
I can't seem to find any information about this. All I've been able to locate is a paragraph stating that the IsByteOrdered property makes things like indexes and check constraints possible by guaranteeing one unique representation per instance; while this is somewhat interesting, it doesn't explain how SQL Server is able to perform a seek with certain instance methods.
So the question again - how do the index operations work for types like hierarchyid, and is it possible to get the same behaviour in a new UDT?
The query optimizer team is trying to handle scenarios that don't change the order of things. For example, cast(someDateTime as date) is still sargable. I'm hoping that as time continues, they fix up a bunch of old ones, such as dateadd/datediff with a constant.
So... handling Ancestor is effectively like using the LIKE operator with the start of a string. It doesn't change the order, and you can still get away with stuff.
You are correct - HierarchyId and Geometry/Geography are both "magical" types that the Query Optimizer is able to recognize and rewrite the plans for in order to produce optimized queries - it's not as simple as just recognizing sargable operators. There is no way to simulate equivalent behavior with other UDTs.
For HierarchyId, the binary serialization of the type is special in order to represent the hierarchical structure in a binary ordered fashion. It is similar to the mechanism used by the SQL Xml type and described in a research paper ORDPATHs: Insert-Friendly XML Node Labels. So while the QO rules to translate queries that use IsDescendant and GetAncestor are special, the actual underlying index is a regular relational index on the binary hierarchyid data and you could achieve the same behavior if you were willing to write your original queries to do range seeks instead of calling the simple method.