I have recently encountered a problem I have hard time working around. I have to tune a rather nasty query that extensively uses - in many forms and on several layers - XML. The problem is that while it is easy to spot the slow part of the whole process - Table Valued Function [XML Reader] - it is rather hard to map it to any specific part of the query. The properties does not indicate on which XML column does this particular object work and I'm really puzzled why is that. Generally, it is relatively easy to map object to a part of a query but these XML objects are giving me a pause.
Thanks in advance.
Related
I have a big XML file about 800 MB with many tags and attributes. I need to pull different values from this file, therefore, I have used many SORT and JOIN transformations. All of them work well and do not take too much time, except for the last SORT transformation shown in a red oval in the picture below. This takes forever.
If I use a smaller XML file, it will go thru and will not take too much time. So I assume the problem is with the size of dataset its dealing with. I was wondering if you know of any way that can help me handle this situation. Any property that needs to get changed to improve the performance of this particular case. I'm using Visual Studio 2015. Thanks!
You can't really do much to speed up the Sort transformation in SSIS. The best solution is to find a way to not have to use the Sort transformation at all. This usually means putting the data into an indexed database table, and doing the sort in a SELECT...ORDER BY query.
We have a table in our database that stores XML in one of the columns. The XML is always in the exact same format out of a set of 3 different XML formats which is received via web service responses. We need to look up information in this table (and inside of the XML field) very frequently. Is this a poor use of the XML datatype?
My suggestion is to create seperate tables for each different XML structure as we are only talking about 3 with a growth rate of maybe one new table a year.
I suppose ultimately this is a matter of preference, but here are some reasons I prefer not to store data like that in an XML field:
Writing queries against XML in TSQL is slow. Might not be too bad for a small amount of data, but you'll definitely notice it with a decent amount of data.
Sometimes there is special logic needed to work with an XML blob. If you store the XML directly in SQL, then you find yourself duplicating that logic all over. I've seen this before at a job where the guy that wrote the XML to a field was long gone and everyone was left wondering how exactly to work with it. Sometimes elements were there, sometimes not, etc.
Similar to (2), in my opinion it breaks the purity of the database. In the same way that a lot of people would advise against storing HTML in a field, I would advise against storing raw XML.
But despite these three points ... it can work and TSQL definitely supports queries against it.
Are you reading the field more than you are writing it?
You want to do the conversion on whichever step you do least often or the step that doesn't involve the user.
I have a table where am storing a startingDate in a DateTime column.
Once i have the startingDate value, am supposed to calculate the
number_of_days,
number_of_weeks
number_of_months and
number_of_years
all from the startingDate to the current date.
If you are going to use these values in two or more places in the application and you do care much about the applications response time, would you rather make the calculations in a view or create computed columns for each so you can query the table directly?
Computed columns are easy to maintain and provide an ideal solution to your problem – I have used such a solution recently. However, be aware the values are calculated when requested (when they are SELECTed), not when the row is INSERTed into the table – so performance might still be an issue. This might be acceptable if you can off-load work from the application server to the database server. Views also don’t exist until they are requested (unless they are materialised) so, again, there will be an overhead at runtime, but, again it’s on the database server, not the application server.
Like nearly everything: It depends.
As #RedX suggest it probably not much of a performance difference either way, so it becomes a question of how will use them. To me this is more of a feel thing.
Using them more than once doesn't wouldn't necessary drive me immediately to either a view or computed columns. If I only use them in a few places or low volume code paths I might calc them in-line in those places or use a CTE. But if the are in wide spread or heavy use I would agree with a view or computed column.
You would also want them in a view or cc if you want them available via ORM tools.
Am I using those "computed columns" individual in places or am I using them in sets? If using them in sets I probably want a view of the table that shows included them all.
When i need them do I usually want them associated with data from a particular other table? If so that would suggest a view.
Am I basing updates on the original table of those computed values? If so then I want computed columns to avoid joining the view in these case.
Calculated columns may seem an easy solution at first, but I have seen companies have trouble with them because when they try to do ETL with CDC for real-time Change Data Capture with tools like Attunity it will not recognize the calculated columns since the values are not there permanently. So there are some issues. Also if the columns will be retrieve many, many times by users, you will save time in the long run by putting that logic in the ETL tool or procedure and write it once to the database instead of calculating it many times for each request.
I have a table that is rather large at about 10,000,000 rows. I need to page through this table from my C# application. I'm using NHibernate. I have tried to use this code example:
return session.CreateCriteria(typeof(T))
.SetFirstResult(startId)
.SetMaxResults(pageSize)
.List<T>();
When I execute it the operation eventually times out if my startId is greater than 7,000,000. The pageSize I'm using is 200. I have used this method on much smaller tables, of less than 1000 rows, and it works and performs quickly.
Question is, on such a large table is there a better way to accomplish this using NHibernate?
You're trying to page through 10 million rows 200 at a time? Why? No human being is going to page through that much data.
You need to filter the dataset first and then apply TSQL style paging to the smaller data set. Here are some methods that will work. Just modify them so that you're getting to less than 10million rows through some kind of filtering (a WHERE clause, CTE, or derived table).
Funny you should bring this up, as I am having the same issue. My issue isn't related to paging using NHibernate, but more with just using straight T-SQL.
It seems as though there are a few options. The one I found quite useful in my instance was this answer to a question regarding paging. It discusses using a "..keyset driven solution" rather than return ranked results through the use of ROW_NUMBER(). I'm not sure what NHibernate would use in this instance or if it's possible to see the SQL it generates based on the query you issue (I know you could in Hibernate, but I've not used NHibernate).
If you aren't aware of the using SQL SERVER to returned ranked results based on ROW_NUMBER, then it's well worth looking into. A lot of people seem to refer to this article as to how to go about paging. I've seen some subsequent posts discourage the use of SET ROWCOUNT though in favour of using TOP with a dynamic parameter - SELECT TOP(#NumOfResults).
There are lots of posts here on SO regarding this, but no definitive answer on the best way to go about it as far as I can see. I'll be keeping an eye on this post to see what others suggest also.
It could by Isolation Layer problem.
I had a similar issues.
If the table your reading from is constantly updated, the updater locks parts of the table, causing timeout then reading from the table.
Add SetIsolationLayer(ReadUncommitted) you must note that the data might be a little dirty.
I'm reviewing my code and realize I spend a tremendous amount of time
taking rows from a database,
formatting as XML,
AJAX GET to browser, and then
converting back into a hashed javascript object as my local datastore.
On updates, I have to reverse the process (except using POST instead of XML.)
Having just started looking at Redis, I'm thinking I can save a tremendous amount of time keeping the objects in a key-value store on the server and just using JSON to transfer directly to JS client. But my feeble mind can't anticipate what I'm giving up by leaving a SQL DB (i.e. I'm scared to give up the GROUP BY/HAVING queries)
For my data, I have:
many-many relationships, i.e. obj-tags, obj-groups, etc.
query objects by a combination of such, i.e. WHERE tag IN ('a', 'b','c') AND group in ('x','y')
self joins, i.e. ALL the tags for each object WHERE tag='a' (sql group_concat())
a lot of outer joins, i.e. OUTER JOIN rating ON o.id = rating.obj_id
and feeds, which seem to be a strong point in REDIS
How do you successfully mix key-value & SQL DBs?
For example, is practical to join a large list of obj.Ids from a REDIS set with SQL data using a SQL RANGE query (i.e. WHERE obj.id IN (1,4,6,7,8,34,876,9879,567,345, ...), or vice versa?
ideas/suggestions welcome.
You may want to take a look at MongoDB. It works with JSON style objects, and comes with SQL like indexing & querying. Redis is more suitable for storing data structures likes lists & sets, when you want a simple lookup instead of a complex query.
Now that the actual problem is more defined (i.e. you spend a lot of time writing repetitive conversion code to move from one layer/representation to the next) maybe you could consider writing (or googling for) something that automatizes this, maybe?
Googles returns plenty of results for "convert table to XML" (and the reverse), would this help? Would something going directly from table to key/value pairs be better? Have you tried tackling this problem in a generalized way?
When you say "I spend a tremendous amount of time" do you mean this is a lot of development time, or are you referring to computing time?
Personally I'd be wary of mixing a RDBMS with a non-RDBMS solution, because this will probably create problems when the two different paradigms clash.