Creating an efficient indexer for my Azure Search Service - azure-cognitive-search

My data source cannot be a single table as I need data that spans across 6 tables. For that I have created a view that does joins on these tables.
When I use this view as data source, indexing takes a lot of times and times out. I tried increasing timeout to 40 minutes and one more suggested change:
"disableOrderByHighWaterMarkColumn" : true
It timed-out. I also set Batch Size:1000. This time it populated the index but failed after few hours saying "connection lost". and "thanks" to disableOrderByHighWaterMarkColumn, If I rerun indexer again it will process all the rows again.
My question is what is the best way to approach a solution to this problem.
Follow-up Question: Since I am relying on a view, I cannot have auto change tracking. I am using a high watermark column (LastUpdatedTime) to track changes in my View. I only want to keep 6 months of data in my index so I am not sure how I can do that when I am using View. I have "where CreateDateTime > dateadd(month, -6, getdate())" clause already in my View but this will not enable Indexer to delete "out-of-time-window" rows(documents) from index. How can I achieve my goals here?
Should I write a processor task to periodically query all documents using C# SDK and delete documents based on date?

Sorry to hear the Azure SQL Database indexer is giving you trouble. I noticed a couple of things in your question that might be worth thinking about in terms of SQL performance:
My data source cannot be a single table as I need data that spans across 6 tables. For that I have created a view that does joins on these tables When I use this view as data source, indexing takes a lot of times and times out.
It's worth taking a look at the query performance troubleshooting guide and figure out what exactly is happening in your Azure SQL database that is causing problems. Assuming you want to use change tracking support, the default query the indexer uses against the SQL database looks like this:
SELECT * FROM c WHERE hwm_column > #hwmvalue ORDER BY hwm_column
We frequently see issues with performance here when there isn't an index on the hwm_column or if hwm_column is computed. You can read more about issues with the high water mark column here.
I tried increasing timeout to 40 minutes and one more suggested change: "disableOrderByHighWaterMarkColumn" : true It timed-out. I also set Batch Size:1000. This time it populated the index but failed after few hours saying "connection lost". and "thanks" to disableOrderByHighWaterMarkColumn, If I rerun indexer again it will process all the rows again.
disableOrderByHighWaterMarkColumn doesn't seem like it will work for your scenario, so I agree that you shouldn't set it. Decreasing the batch size seems to have had a positive effect, I would consider measuring the performance gain here using the troubleshooting guide referenced above
Follow-up Question: Since I am relying on a view, I cannot have auto change tracking. I am using a high watermark column (LastUpdatedTime) to track changes in my View. I only want to keep 6 months of data in my index so I am not sure how I can do that when I am using View. I have "where CreateDateTime > dateadd(month, -6, getdate())" clause already in my View but this will not enable Indexer to delete "out-of-time-window" rows(documents) from index. How can I achieve my goals here? Should I write a processor task to periodically query all documents using C# SDK and delete documents based on date?
Instead of filtering out data that is more than 6 months old, I would consider adding soft delete policy. The challenge here is that the indexer needs to pick up rows that should be deleted. The easiest way to accomplish this might updating your application logic to add a new column to your view indicating the row should be deleted. Once the value of this column changes, the LastUpdatedTime should also be updated so it shows up in the next indexer query.
You can write your own processor task, but querying all documents in Azure Cognitive Search and paging through them may have negative performance implications on your search performance. I would recommend trying to get it working with your indexer first.

Related

MAX on a VIEW doing a TableScan instead of using metadata lookup. Why doesn't SF use metadata?

I noticed even the simplest 'SELECT MAX(TIMESTAMP) FROM MYVIEW' is somewhat slow (taking minutes) in my environment, and found it's doing a TableScan of 100+GB across 80K micropartitions.
My expectation was this to finish in milliseconds using MIN/MAX/COUNT metadata in each micropartitions. In fact, I do see Snowflake finishing the job in milliseconds using metadata for almost same MIN/MAX value lookup in following article:
http://cloudsqale.com/2019/05/03/performance-of-min-max-functions-metadata-operations-and-partition-pruning-in-snowflake/
Is there any limitation in how Snowflake decides to use metadata? Could it be because I'm querying through a view, instead of querying a table directly?
=== Added for clarity ===
Thanks for answering! Regarding how the view is defined, it seems to adds a WHERE clause for additional filtering with a cluster key. So I believe it should still be possible to fully use metadata of miropartitions. But as posted, TableScan is being done in profilter output.
I'm bit concerned on your comment on SecureView. The view I'm querying is indeed a SECURE VIEW - does it affect how optimizer handles my query? Could that be a reason why TableScan is done?
It looks like you're running the query on a view. The metadata you're referring to will be used when you're running a simple MIN MAX etc on the table, however if you have some logic inside your view which requires filtering / joining of data then Snowflake cannot return results just based off the metadata.
So yes, you are correct when you say the following because your view is probably doing something other than a simple MAX on the table:
...Could it be because I'm querying through a view, instead of querying a table directly?

SQL Server transactional replication filter

I have a very large table from which I need to extract a subset of records, the last 30 days records, and to replicate this 30 days records to a second database, for reporting purposes. Now I am using transactional replication where I added a filter in the published articles to isolate the 30 days records, to get a near real time replication envirnment.
The issue I have is that : the replication seems to be incremental, meaning that the most recent records are added to the replica, but the older records are not removed so it keeps getting large.
When a record that is out of filtering criteria is updated and enters again under the filtering criteria the replication crashes with an "duplicate primary key error".
How to make it work so that the replica to contain only the last 30 days of data ?
Is the above described behaviour something that I shall expect to see ?
Many thanks,
Well the simplest way, is not to use mssql's filter. The simplest way is to change the SPS used for update and delete with custom sps so that you will not get errors on deleting (absent rows) and updating (absent rows). This is done from the article's advanced properties. In case of a delete you should just use a merge and filter there your criteria.
Also have a job that deletes from the tables what you need to have deleted.
Of course you will need to be very careful when doing structure updates, but it is doable.
Another more ugly way is to keep sql's stored procedures and just ignore the errors (through the distribution agent .. -SkipErrors 2601:2627:20598). This will require again a job to delete old rows and it will not bring you back into your scope the old rows that are just updated. All in all the first solution should be the best one.
Hope it helps.

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

How to improve performance in SQL Server table with image fields?

I'm having a very particular performance problem at work!
In the system we're using there's a table that holds information about the current workflow process. One of the fields holds a spreadsheet that contains metadata about the process (don't ask me why!! and NO I CAN'T CHANGE IT!!)
The problem is that this spreadsheet is stored in an IMAGE field in an SQL Server 2005 (within a database set with SQL 2000 compatibility).
This table currently has 22K+ lines and even a simple query like this:
SELECT TOP 100 *
FROM OFFENDING_TABLE
Takes 30 seconds to retrieve the data in Query Analyser.
I'm thinking about updating the compatibility to SQL 2005 (once that I was informed that the app can handle it).
The second thing I'm thinking is to change the data-type of the column to varbinary(max) but I don't know if doing this will affect the application.
Another thing that I'm considering is to use sp_tableoption to set the large value types out of row to 1 as it's currently 0, but I have no information if doing this will improve performance.
Does anyone know how to improve performance in such scenario?
Edited to clarify
My problem is that I have no control on what the application asks to the SQL Server, and I did some Reflection on it (the app is a .NET 1.1 website) and it uses the offending field for some internal stuff that I have no idea what it is.
I need to improve the overall performance of this table.
I'd recommend you look into the offending table layout health:
select * from sys.dm_db_index_physical_stats(
db_id(), object_id('offending_table'), null, null, detailed);
Things too look for are avg_fragmentation_in_percent, page_count, avg_page_space_used_in_percent, record_count and ghost_record_count. Cues like high fragmentation, or a high number of ghost records, or a low page used percent indicate problems and things can be improved quite a bit just by rebuilding the index (ie. the table) from scratch:
ALTER INDEX ALL ON offending_table REBUILD;
I'm saying this considering that you cannot change the table nor the app. If you'd be able to change the table and the app, the advice you already got is good advice (don't use '*', dont' select w/o a condition, use the newer varbinary(max) type etc etc).
I'd also look into the average page lifetime in performance counters to understand if the system is memory starved. From your description of the symptomps the system looks IO bound which leads me to think there is little page caching going on, and more RAM could help, as well as a faster IO subsytem. On a SQL 2008 system I would also suggest turning page compression on, but on 2005 you can't.
And, just to be sure, make sure the queries are not blocked by contention from the app itself, ie. the query doesn't spend 90% of that 30 seconds waiting for a row lock. Look at sys.dm_exec_requests while the query is running, see the wait_time, wait_type and wait_resource. Is it PAGEIOLATCH_XX? Or is it a lock? Also, how is the sys.dm_os_wait_stats in your server, what are the top wait reasons?
First of all - don't ever do a SELECT * in production code - reporting or not.
You have three basic choices:
move that blob field out into a separate table if it's not always needed; probably not practical since you mention you cannot change the schema
be more careful with your SELECT statements to select only those fields that you really need - and omit the blob field
see if you can limit your query to include a WHERE clause and find a way to optimize the query plan by e.g. adding a suitable index to the table (if you can)
There's no magic "make this faster" switch - but you can optimize your query or optimize your table layout. Both help. If you can't change anything - neither the table layout, nor add an index, nor change the queries, you'll have a hard time optimizing anything, I'm afraid....
Just changing the field to VARBINARY(MAX) won't change anything at all - no performance improvement to be expected just from changing the data type.
A short answer is to only do SELECTs against multiple rows when the fields returned do not include the offending image field, ie no SELECT *. If you want the value of the image field, retrieve it on a case-by-case basis.
Setting the large value types out of row option should definitely help performance. The row size will be significantly smaller, SQL Server can do a lot fewer physical reads to get throught the table.

Have you ever encountered a query that SQL Server could not execute because it referenced too many tables?

Have you ever seen any of there error messages?
-- SQL Server 2000
Could not allocate ancillary table for view or function resolution.
The maximum number of tables in a query (256) was exceeded.
-- SQL Server 2005
Too many table names in the query. The maximum allowable is 256.
If yes, what have you done?
Given up? Convinced the customer to simplify their demands? Denormalized the database?
#(everyone wanting me to post the query):
I'm not sure if I can paste 70 kilobytes of code in the answer editing window.
Even if I can this this won't help since this 70 kilobytes of code will reference 20 or 30 views that I would also have to post since otherwise the code will be meaningless.
I don't want to sound like I am boasting here but the problem is not in the queries. The queries are optimal (or at least almost optimal). I have spent countless hours optimizing them, looking for every single column and every single table that can be removed. Imagine a report that has 200 or 300 columns that has to be filled with a single SELECT statement (because that's how it was designed a few years ago when it was still a small report).
For SQL Server 2005, I'd recommend using table variables and partially building the data as you go.
To do this, create a table variable that represents your final result set you want to send to the user.
Then find your primary table (say the orders table in your example above) and pull that data, plus a bit of supplementary data that is only say one join away (customer name, product name). You can do a SELECT INTO to put this straight into your table variable.
From there, iterate through the table and for each row, do a bunch of small SELECT queries that retrieves all the supplemental data you need for your result set. Insert these into each column as you go.
Once complete, you can then do a simple SELECT * from your table variable and return this result set to the user.
I don't have any hard numbers for this, but there have been three distinct instances that I have worked on to date where doing these smaller queries has actually worked faster than doing one massive select query with a bunch of joins.
#chopeen You could change the way you're calculating these statistics, and instead keep a separate table of all per-product stats.. when an order is placed, loop through the products and update the appropriate records in the stats table. This would shift a lot of the calculation load to the checkout page rather than running everything in one huge query when running a report. Of course there are some stats that aren't going to work as well this way, e.g. tracking customers' next purchases after purchasing a particular product.
This would happen all the time when writing Reporting Services Reports for Dynamics CRM installations running on SQL Server 2000. CRM has a nicely normalised data schema which results in a lot of joins. There's actually a hotfix around that will up the limit from 256 to a whopping 260: http://support.microsoft.com/kb/818406 (we always thought this a great joke on the part of the SQL Server team).
The solution, as Dillie-O aludes to, is to identify appropriate "sub-joins" (preferably ones that are used multiple times) and factor them out into temp-table variables that you then use in your main joins. It's a major PIA and often kills performance. I'm sorry for you.
#Kevin, love that tee -- says it all :-).
I have never come across this kind of situation, and to be honest the idea of referencing > 256 tables in a query fills me with a mortal dread.
Your first question should probably by "Why so many?", closely followed by "what bits of information do I NOT need?" I'd be worried that the amount of data being returned from such a query would begin to impact performance of the application quite severely, too.
I'd like to see that query, but I imagine it's some problem with some sort of iterator, and while I can't think of any situations where its possible, I bet it's from a bad while/case/cursor or a ton of poorly implemented views.
Post the query :D
Also I feel like one of the possible problems could be having a ton (read 200+) of name/value tables which could condensed into a single lookup table.
I had this same problem... my development box runs SQL Server 2008 (the view worked fine) but on production (with SQL Server 2005) the view didn't. I ended up creating views to avoid this limitation, using the new views as part of the query in the view that threw the error.
Kind of silly considering the logical execution is the same...
Had the same issue in SQL Server 2005 (worked in 2008) when I wanted to create a view. I resolved the issue by creating a stored procedure instead of a view.

Resources