Any way to invalidate the high watermark value on a datasource? - azure-cognitive-search

I've realized that a lot of data was missing from a view I'm using as a data source to populate an index, due to using an INNER JOIN instead of a LEFT JOIN. But since the data uses ROWVERSION, these rows are before the current high watermark value of the data source/indexer.
So rather than rebuild the entire index, I'd like to invalidate the current high watermark value, so that it will just pull in all of the data from this data source (this index pulls in data from multiple data sources).
I tried deleting/creating the underlying datasources, but this didn't seem to pull in the new data.
Is this possible? Or will I need to rebuild any indexes that pull data from these data sources?

To reset the indexer high-water mark state, use the Reset Indexer API, or its equivalent in .NET SDK.

Related

MAX on a VIEW doing a TableScan instead of using metadata lookup. Why doesn't SF use metadata?

I noticed even the simplest 'SELECT MAX(TIMESTAMP) FROM MYVIEW' is somewhat slow (taking minutes) in my environment, and found it's doing a TableScan of 100+GB across 80K micropartitions.
My expectation was this to finish in milliseconds using MIN/MAX/COUNT metadata in each micropartitions. In fact, I do see Snowflake finishing the job in milliseconds using metadata for almost same MIN/MAX value lookup in following article:
http://cloudsqale.com/2019/05/03/performance-of-min-max-functions-metadata-operations-and-partition-pruning-in-snowflake/
Is there any limitation in how Snowflake decides to use metadata? Could it be because I'm querying through a view, instead of querying a table directly?
=== Added for clarity ===
Thanks for answering! Regarding how the view is defined, it seems to adds a WHERE clause for additional filtering with a cluster key. So I believe it should still be possible to fully use metadata of miropartitions. But as posted, TableScan is being done in profilter output.
I'm bit concerned on your comment on SecureView. The view I'm querying is indeed a SECURE VIEW - does it affect how optimizer handles my query? Could that be a reason why TableScan is done?
It looks like you're running the query on a view. The metadata you're referring to will be used when you're running a simple MIN MAX etc on the table, however if you have some logic inside your view which requires filtering / joining of data then Snowflake cannot return results just based off the metadata.
So yes, you are correct when you say the following because your view is probably doing something other than a simple MAX on the table:
...Could it be because I'm querying through a view, instead of querying a table directly?

Look up transformation in SSIS bug:doesn`t prevent insertion of already existing record

I have SSIS package that reads from a source, performs look up transofrmation to check if the record exists in the destination , if it exists it redirects to match output and updates it, otherwise to no match output and inserts it. The problem is that sometimes it inserts a record that should be redirected for update. This is done via job, if I manually execute the package, everything is fine. The look up component is set up correctly with the matching column.
I can`t find out why this happens, the silliest thing is I can not debug it because manually everything is fine.
Any ideas?
Two options on I the scenarios where you have inserts that should have been updates.
Duplicate source values
The first is that you have duplicate keys in your source data and nothing in the target table.
Source data
Key|Value
A |abc
B |bcd
A |cde
Destination data
Key|Value
C |yza
B |zab
In this situation, assuming the default behaviour of the Lookup Component, full cache, before the package begins execution, SSIS will run the source query for the Lookup reference table. Only once all the lookup table data has been cached will the data flow being flowing data.
The first row, A:abc hits the lookup. Nope, no data and off to the Insert path.
The second row B:bcd hits the lookup. Nope, no data and off to the Insert path.
The third row A:cde hits the lookup. Nope, no data and off to the Insert path (and hopefully a primary/unique key violation)
When the package started, it only knew about data in the destination table. During the run you added the same key value to the table but never asked the lookup component to check for updates.
In this situation, there are two resolutions: The first is to change the cache mode from Full to None (or Partial). This will have the lookup component issue a query against the target table for every row that flows through the data flow. That can get expensive for large rows. It also won't be fool proof because the data flow has the concept of buffers and in a situation like our sample 3 row load, that would all fit in one buffer. All the rows in the buffer would hit the Lookup at approximately the same time and thus the target table will still not contain an A value when the third row flows through the component. You can put the brakes on the data flow and force it to process one row at a time by adjusting the buffer size to 1 but that's generally not going to be a good solution.
The other resolution would be dedupe/handle survivorship. Which A row should make it to the database in the event our source has different values for the same business key? First, last, pick one? If you can't eliminate the data before it hits the Data Flow, then you'll need to deduplicate the data using an Aggregate component to rollup your data best you can.
Case sensitive lookups
Source data
Key|Value
A |abc
B |bcd
a |cde
Destination data
Key|Value
C |yza
B |zab
The other scenario where the Lookup component bites you is that the default, Full Cache, matching is based on .NET matching rules for strings. Thus AAA is not equal to AaA. If your lookup is doing string matching, even if your database is case insensitive, the SSIS lookup will not be insensitive.
In situations where I need to match alpha data, I usually make an extra/duplicate column in my source data which is the key data all in upper or lower case. If I am querying the data, I add it to my query. If I am loading from a flat file, then I use a Derived Column Component to add my column to the data flow.
I then ensure the data in my reference table is equally cased when I use the Lookup component.
Lookup Component caveats
Full Cache mode:
- insensitive to changes to the reference data
- Case sensitive matches
- generally faster overall
- delays data flow until the lookup data has been cached
- NULL matches empty string
- Cached data requires RAM
No Cache mode:
- sensitive to changes in the reference
- Case sensitivity matching is based on the rules of the lookup system (DB is case sensitive, you're getting a sensitive match)
- It depends (100 rows of source data, 1000 rows of reference data - no one will notice.
1B rows of source data and 10B rows of reference data - someone will notice. Are there indexes to support the lookups, etc)
- NULL matches nothing
- No appreciable SSIS memory overhead
Partial Cache:
The partial cache is mostly like the No cache option except that once it gets a match against the reference table, it will cache that value until execution is over or until it gets pushed out due to memory pressure
Lookup Cache NULL answer

data migration in informatica

A large amount of data is coming from source to target. After a successful insertion in target, we have to change the status to every rows as "committed". But when will we know that all datas have come or not in target without directly querying the source?
For example - suppose 10 records have migrated to target from source.
We cannot change the status of all the records as "committed" before successful insertion of all records in target.
So before changing the status of all the records, how will we know that 11th record is coming or not?
Is there anything that will give me the information about total records in source?
I need a real-time based answer.
we had the same scenario and this is what we did:
First of all
to check if data is loaded in target you can join source and target table, update will lock the rows so for this commit must be fired at database level in target table (so that lock for update can happen).
after joining, update the loaded data based on join with target column.
Few things.
You have to stop you session (used pmcmd to stop session in command task)
update data in your source table and restart session.
keep load for counter of 20k-30 rows so update goes smoothly.

Entity Framework Algorithm For Combining Data

This pertains to a project I am inheriting and cannot change table structure or data access model. I have been asked to optimize the algorithm being used to insert data into the database.
We have a dataset in table T. From that, we pull a set we will call A. We also query an XML feed and get a set we will call X.
If a value from X is in A, record in A must be updated to reflect data for X.record
If a value from X is not in A, X.record must be inserted into A
If a value from A is not in X, A.record must be preserved in A
X must be fully iterated through for all records, and A must be updated
All these changes need to be insert back into the database.
The algo as set up does the following:
Query XML into a LIST
foreach over the XML LIST
look up foreach.item in A via LINQ (i.e. query = from record in A where
record.GUID == foreach.item.GUID
select record)
if query.Count() == 0
insert into A (via context.AddToTableName(newTableNameObject)
else
var currentRecord = query.First()
set all properties on currentRecord = properties from foreach.item
context.SaveChanges()
I know this to be suboptimal. I tried to get A into a object (call it queryA) outside of the foreach loop in an effort to move the query to memory and not hitting the disk, but after thinking that through, I realized the database was already in memory.
Having added timer objects into the algo, it's clear that what is costing the most time is the SaveChanges() function call. In some cases it's 20ms, and in some others, in explicably, it will jump to 100ms.
I would prefer to only call the SaveChanges() one time. I can't figure out how to do that given my depth of knowledge of EF (which is thin at best) and the constraints of not being able to change the table structures and having to preserve data from A which is not in X.
Suggestions?
I don't think that you will improve the performance when using Entity framework:
The query
Loading each record by separate query is not good
You can improve the performance by loading multiple records in the same query. For example you can load small batch of records by using either || in condition or Contains (like IN in SQL). Contains is supported only by .NET 4.0.
Another improvement can be replacing the query with stored procedure and table valued parameter to pass all guids to SQL server join A with X.Guids and get results. Table valued parameters are only supported on SQL 2008 and newer.
The data modification
You don't have to should not call SaveChanges after each modification. You can call it after foreach loop and it will still work. It will pass all modifications in single transaction but you will not get any performance boost by such operation and according to this answer it can give you a significant boost.
EF doesn't support command batching and because of that each update or insert always takes separate round trip to the database. There is no way around this when using EF to modify data except implementing whole new EF ADO.NET provider (it is like starting a new project).
Again solution is reducing roundtrips by using stored procedure with table valued parameter
If your DB also uses that GUID as primary key and clustered index you have another performance decrease by reordering index after each insert = modifying data on disk.
The problem is not in algorithm but in the way you process the data and technology used to process the data. Entity framework is not a good choice for data pumps. You should go with these information to your boss because improving performance means more complicated change in your application. It is not your fault and it is not the fault of the programmer who did the application. This is a feature of EF which is not very well known and as I know it is not clearly stated in any MS best practices.

Synchronize DataSet

What is the best approach to synchronizing a DataSet with data in a database? Here are the parameters:
We can't simply reload the data because it's bound to a UI control which a user may have configured (it's a tree grid that they may expand/collapse)
We can't use a changeflag (like a UpdatedTimeStamp) in the database because changes don't always flow through the application (e.g. a DBA could update a field with a SQL statement)
We cannot use an update trigger in the database because it's a multi-user system
We are using ADO.NET DataSets
Multiple fields can change of a given row
I've looked at the DataSet's Merge capability, but this doesn't seem to keep the notion of an "ID" column. I've looked at DiffGram capability but the issue here is those seem to be generated from changes within the same DataSet rather than changes that occured on some external data source.
I've been running from this solution for a while but the approach I know would work (with a lot of ineffeciency) is to build a separate DataSet and then iterate all rows applying changes, field by field, to the DataSet on which it is bound.
Has anyone had a similar scenario? What did you do to solve the problem? Even if you haven't run into a similar problem, any recommendation for a solution is appreciated.
Thanks
DataSet.Merge works well for this if you have a primary key defined for each DataTable; the DataSet will raise changed events to any databound GUI controls
if your table is small you can just re-read all of the rows and merge periodically, otherwise limiting the set to be read with a timestamp is a good idea - just tell the DBAs to follow the rules and update the timestamp ;-)
another option - which is a bit of work - is to keep a changed-row queue (timestamp, row ID) using a trigger or stored procedure, and base the refresh queries off of the timestamp in the queue; this will be more efficient if the base table has a lot of rows in it, allowing you (via an inner join on the queue record) to pull only the changed rows since the last poll time.
I think it would be easier to store a list of the nodes that the user has expanded (assuming you can uniquely identify each one), then re-load the data and re-bind it to the tree view, and then expand all the nodes previously expanded.

Resources