I have a small azure search index (28K doc / 50 Mb) with around 600 updates a day from one Azure SQL server data source and need to have an efficient search solution near "real time" (meaning that each time a row is created or updated in the DB I would like to have the updates in my search results within one or 2 seconds max). I also would like to avoid modifying all our code to push the data to the index each time we update our DB.
Is there a way to have some automation within Azure to update the index each time the Azure SQL server DB is updated ... WITHOUT pushing the data?
from a logic app checking every second or 2 new or updated entries :
and running the indexer when needed with a custom connector?
OR pushing new row to the index with a custom connector?
from a view with a timestamp column (but it seems that indexer autorun minimum delay is 5 minutes)?
from a table with SQL Integrated Change Tracking Policy (same ... 5 minutes seems to be the minimum time range update)?
Is there another way (without pushing data)?
It is possible to run an indexer on-demand using the Run Indexer API. This can work well for occasional updates. However, if you're constantly adding new rows to the SQL table, you may want to consider batching to improve indexing performance.
Correct, 5 minutes is currently the minimal supported schedule interval.
Related
We have an azure search instance(s1, 2 replica, 2 Partitions) created in 2016, and when I tried to upload 50 million rows to this instance, we found out that the old instance still has a limit of 30million records.
No problem, I created a new azure search instance (s1, 1 replica, 1 Partition), and start to upload the same data up. To my surprise, the uploading speed is much better on the new instance comparing to the old one (almost double the update speed).
I am wondering what could be the reason? The index I was uploading to is a new index, so no one will query it. These are the differences I can see between new and old search index:
No query traffic in the new search instance, old search instance
does get traffic from production environment. But it is on other search indexes.
New search instance has 1 replica, 1 Partition, old one has 2 replica, 2 partitions.
Just very curious on why I see such a speed difference. If I run a search query, actually, the performance will be very similar between old and new. Just the index update speed is much much better.
Query traffic is a factor, but it could also be the replica count. Every replica adds work to the indexing process, while every partition adds to the parallelism available for indexing. If you added a partition to your new service and indexing sped up further, that wouldn't be a surprising result.
All that said, the most likely explanation in your case is that your new service is running on faster hardware than the old one. This is how we were able to remove the document limit for new services.
I have 7 reports which are downloaded daily at late night.
These reports can be downloaded in csv/xml. I am downloading them csv format as they are memory efficient.
This process runs in background and is managed by hangfire.
After they are downloaded, I am using dapper to run a stored procedure which insert/update/update data using merge statements. This stored procedure has seven table value parameters.
Instead of delete, I am updating that record's IsActive column to false.
Note that 2 reports have more than 1 million records.
I am getting timeout exceptions only in Azure SQL. In SQL Server, it works fine. As a workaround, I have increased timeouts to 1000 for this query.
This app is running in Azure s2.
I have pondered over the option of sending xml but I have found SQL Server is slow at processing xml which counter productive.
I can not also use SqlBulkCopy as I have to update based on some conditions.
Also note that more reports will be added in future.
Also when a new report is added then there are large amount of inserts. If previously added report is ran again then majority updates are run.
These tables currently do not have any indexes, only clustered integer primary key.
Each row has a unique code. This code is used to identify whether to insert/update/delete
Can you recommend a way to increase performance?
Is your source inputting the whole data? Whether they are updated/new. I assume by saying the unique code(insert/update/delete) you are only considering changes (Delta). If not that's one area. Another is to consider parallelism. I think then you need to have different stored procedures for each table. Non dependent tables could be processed together
We have ETL process which ingest data every 5 mins once from different source system(as400,orcale,sap etc) into our sqlserver database, and from there we ingest data into elastic index every 5 mins so that both are in sync.
I wanted to tighten the timeframe to seconds rather than 5 mins and i wanted to make sure they both are in sync all time.
I am using control log table to make sure elastic ingestion and SSIS ETL are not running at the same time, so that we might go out of sync. which is very poor solution and not allowing me to achieve near real time data capture
I am looking for better solution to sync sqlserver database and elastic index in near real time rather than doing manually.
Note:I am using python scripts for pumping the data from sql to elastic index currently.
One approach would be to have an event stream coming out of your database or even directly out of the SSIS package run (which might actually be simpler to implement) that feeds directly into your elastic search index. ELK handles streaming log files so should handle an event stream pretty well.
In an on-premises SQL Server database, I have a number of tables in to which various sales data for a chain of stores is inserted during the day. I would like to "harvest" these data to Azure every, say 15, minutes via Data Factory and an on-premises data management gateway. Clearly, I am not interested in copying all table data every 15 minutes, but only in copying the rows that have been inserted since last fetch.
As far as I can see, the documentation suggests using data "slices" for this purpose. However, as far as I can see, these slices require a timestamp (e.g. a datetime) column to exist on the tables where data is fetched from.
Can I perform a "delta" fetch (i.e. only fetch the rows inserted since last fetch) without having such a timestamp column? Could I use a sequential integer column instead? Or even have no incrementally increasing column at all?
Assume that the last slice fetched had a window from 08:15 to 08:30. Now, if the clock on the database server is a bit behind the Azure clock, it might add some rows with the timestamp being set to 08:29 after that slice was fetched, and these rows will not be included when the next slice (08:30 to 08:45) is fetched. Is there a smart way to avoid this problem? Shifting the slice window a few minutes into the past could minimize the risk, but not totally eliminate it.
Take Azure Data Factory out of the equation. How do you arrange for transfer of deltas to a target system? I think you have a few options:
add date created / changed columns to the source tables. Write parameterised queries to pick up only new or modified values. ADF supports this scenario with time slices and system variables. Re identity column, you could do that with a stored procedure (as per here) and a table tracking the last ID sent.
Engage Change Data Capture (CDC) on the source system. This will allow you to access deltas via the CDC functions. Wrap them in a proc and call with the system variables, similar to the above example.
Always transfer all data, eg to staging tables on the target. Use delta code EXCEPT and MERGE to work out what records have change; obviously not ideal for large volumes, this would work for small volumes.
HTH
We are planning to add this capability into ADF. It may start from sequential integer column instead of timestamp. Could you please let me know if the sequential integer column will help?
By enabling "Change Tracking" on SQL Server, you can leverage on the "SYS_CHANGE_VERSION " to incrementally load data from On-premise SQL Server or Azure SQL Database via Azure Data Factory.
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-tracking-feature-portal
If using SQL Server 2016, see https://msdn.microsoft.com/en-us/library/mt631669.aspx#Enabling-system-versioning-on-a-new-table-for-data-audit. Otherwise, you can implement the same using triggers.
And use NTP to synchronize your server time.
I'm current on POS project. User require this application can work both online and offline which mean they need local database. I decide to use SQL Server replication between each shop and head office. Each shop need to install SQL Server Express and head office already has SQL Server Enterprise Edition. Replication will run every 30 minutes as schedule and I choose Merge Replication because data can change at both shop and head office.
When I'm doing POC, I found this solution not work properly, sometime job is error and I need to re-initialize its. This solution also take a very long time, which obviously unacceptable to user.
I want to know, are there any solutions better than one that I'm doing now?
Update 1:
Constraints of the system are
Almost of transactions can occur at
both shop and head office.
Some transaction need to work in real-time mode, that being said,
after user save data to their local shop that data should go to update at head office too. (If they're currently online)
User can working even their shop has disconnected from head office database.
Our estimation about amount of data is at-most 2,000 rows in each day.
Windows 2003 is OS of Server at head office and Windows XP is OS of all clients.
Update 2:
Currently they're about 15 clients, but this number will growing in fairly slow rate.
Data's size is about 100 to 200 rows per replication, I think it may not more than 5 MB.
Client connect to server by lease-line connection; 128 kbps.
I'm in situation that replication take a very long time (about 55 minutes while we've only 5 minutes or so) and almost of times I need to re-initialize job to start replicate again, if I don't re-initialize job, it can't replicate at all. In my POC, I find that it always take very long time to replicate after re-initialize, amount of time doesn't depend on amount of data. By the way, re-initialize is only solution I find it work for my problem.
As above, I conclude that, replication may not suitable for my problem and I think it may has another better solution that can serve what I need in Update 1:
Sounds like you may need to roll your own bi-directional replication engine.
Part of the reason things take so long is that over such a narrow link (128kbps), the two databases have to be consistent (so they need to check all rows) before replication can start. As you can imagine, this can (and does) take a long time. Even 5Mb would take around a minute to transfer over this link.
When writing your own engine, decide what needs to be replicated (using timestamps for when items changed), figure out conflict resolution (what happens if the same record changed in both places between replication periods) and more. This is not easy.
My suggestion is to use MS access locally and keep updating data to the server after a certain interval. Add a updated column to every table. When a record is added or updated, set the updated coloumn. For deletion you need to have a seprate table where you can put primary key value and table name. When synchronizing fetch all local records whose updated field not set and update (modify or insert) it to central server. Delete all records using local deleted table and you are done!
I assume that your central server is only for collecting data.
I currently do exactly what you describe using SQL Server Merge Replication configured for Web Synchronization. I have my agents run on a 1-minute schedule and have had success.
What kind of error messages are you seeing?