does azure search replica/Partitions impact index update speed? - azure-cognitive-search

We have an azure search instance(s1, 2 replica, 2 Partitions) created in 2016, and when I tried to upload 50 million rows to this instance, we found out that the old instance still has a limit of 30million records.
No problem, I created a new azure search instance (s1, 1 replica, 1 Partition), and start to upload the same data up. To my surprise, the uploading speed is much better on the new instance comparing to the old one (almost double the update speed).
I am wondering what could be the reason? The index I was uploading to is a new index, so no one will query it. These are the differences I can see between new and old search index:
No query traffic in the new search instance, old search instance
does get traffic from production environment. But it is on other search indexes.
New search instance has 1 replica, 1 Partition, old one has 2 replica, 2 partitions.
Just very curious on why I see such a speed difference. If I run a search query, actually, the performance will be very similar between old and new. Just the index update speed is much much better.

Query traffic is a factor, but it could also be the replica count. Every replica adds work to the indexing process, while every partition adds to the parallelism available for indexing. If you added a partition to your new service and indexing sped up further, that wouldn't be a surprising result.
All that said, the most likely explanation in your case is that your new service is running on faster hardware than the old one. This is how we were able to remove the document limit for new services.

Related

Long loading time after creating Availability Groups and migrating in SQL

so I have this issue. Our client using MS SQL databases. Two months ago they migrated their databases to the SQL Enterprise 2019 from earlier version and Standard edition.
They major reason was to secure high availability through feature in MS SQL - Availability groups.
After that our application get really slowed. In the simply way to tell, customer startup an app select workspace and then its takes like 15 seconds to load data.
First step is just sending request to database to select data - no inserts, deletes or any high performance processes.
App is using and working with geographical and geometry data, every geo objects is saved in database as geometry data type. The first huge, major select is causing the slow issue.
When I was looking at activity mon under wait categories is only one thing suspicious to me and its type Other.
In database I dont see any high cost queries and availability group mode is set to synchronous.
If Im getting this right, the synchronous mode should not be the cause of this problem because this database is clearly for reading a data not as I mentioned modifying.
I made changes to some instance parameters and set Optimize for Ad hoc workloads to True and and threshold for parallelism from 5 to 20.
Other thing which I tried was create a new app source database and database which contains geo data inside of that SQL instance and didnt add them to availability groups.
From application we are using, for test causes, a connection to the one instance with new test databases.
Neither of this settings work. So guys if you have any idea or any experience with this please help me.
Here is a screen of top 10 waits from sys dmv.
1 - Stats recompute...
When you are going from a SQL version to a higher one, you must first change the compatibility level (to have some performance benefits) and then recompute all statistics in the database with a FULLSCAN. Why ? Because each version of SQL Server come with a new optimizer that have new operators, new algorithms and many improvements... To stick to this new version of the optimizer the method of computing statistics and the form of the results of these calculations, is rethought with each modification of the engine ... so much so that if we use the old statistics with a new engine, it is like taking the census of the population in 1930, to plan the construction of roads, schools and hospitals for the current actual population ....
2 - SQL Server Editions...
When upscaling SQL Server from Standard to Enterprise, you need to increase the "hardware" (even if it is a VM) because many of the features that runs under Enterprise version, and does not exists in Standard, needs some more computationnal resources. As an example, using the AUTO_UPDATE_STATISTICS_ASYNC will use automatically one more thread to the detriment of other processes... In comparison, using a Rolls Royce or a Hummer, instead of a VolksWagen is arguably more comfortable, faster ... but requires more oil and more expensive insurance!
3 - Synchronous AVG...
Synchronous AlwaysOn availability groups must have a very fast and faultless network .... If this is not the case, the replication of update requests can drag performance down, especially if you are in pessimistic lockdown (default mode).
4 - Transaction logs...
One common global lack of performances can be the latency to write the transaction log.
5 - Tempdb files...
Another current global lack of performances can be the latency to access tempdb files.
For those two file problems, use the Glenn Berry latency file query that will give you a indice... Good values are under 7 ms for reads and 15 ms for writes...
CONCLUSION
Many other factors can contribute to slow down you system. But without no more information, we cannot help you...

What is my mongodb database size doubled after sharding?

After my mongodb database grows quite a bit I decided to shard the collections.
So i created a new sharded cluster and imported my old data to the cluster using mongorestore.
sh.status() command shows that everything works fine as shown below
However my db size doubled. Seems like instead of balancing the entire db was cloned to both shards.
The images show the result of running db.stats() on both old unsharded db and the new sharded one. There were no new data inserted to the new db after the restore.
Is this a bug with mongodb balancer or am I missing something?
Statistics in databases are approximate and frequently delayed. When a chunk is moved from one shard to another, it is shown as existing on both shards for some time. This is probably because it does in fact exist in two places for a while.
To find out the optimal database size, insert your documents in such a way that no balancing is needed (each document is written into its final shard up front), then measure disk space usage.
To find out actual disk usage, look at disk usage instead of statistics.
Note that all databases have overhead when storing data on disk to attain better performance. In case of MongoDB this can be significant and there are various options for tuning various aspects between the server itself and WiredTiger.

How to update Azure Search Index every seconds?

I have a small azure search index (28K doc / 50 Mb) with around 600 updates a day from one Azure SQL server data source and need to have an efficient search solution near "real time" (meaning that each time a row is created or updated in the DB I would like to have the updates in my search results within one or 2 seconds max). I also would like to avoid modifying all our code to push the data to the index each time we update our DB.
Is there a way to have some automation within Azure to update the index each time the Azure SQL server DB is updated ... WITHOUT pushing the data?
from a logic app checking every second or 2 new or updated entries :
and running the indexer when needed with a custom connector?
OR pushing new row to the index with a custom connector?
from a view with a timestamp column (but it seems that indexer autorun minimum delay is 5 minutes)?
from a table with SQL Integrated Change Tracking Policy (same ... 5 minutes seems to be the minimum time range update)?
Is there another way (without pushing data)?
It is possible to run an indexer on-demand using the Run Indexer API. This can work well for occasional updates. However, if you're constantly adding new rows to the SQL table, you may want to consider batching to improve indexing performance.
Correct, 5 minutes is currently the minimal supported schedule interval.

How to administrate storage of ClickHouse server in a Cluster when disks get full

I'm setting up a ClickHouse server in cluster, but one of the things that doesn't appear in the documentation is how to manage very large amount of data, it says that it can handle up to petabytes of data, but you can't store that much data in single server. You usually will have a few teras in each.
So my question is, how can I handle it to store in a node of the cluster, and then when it requires more space, add another, will it handle the distribution to the new server automatically or will I have to play with the weights in the shard distribution.
When you have more than 1 disk in one server, how can it use them all to store the data?
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
What solution would you find to this? Handling an ever exapanding database to avoid disk space issues in the future.
Thanks
I will assume that you use standard configuration for the ClickHouse cluster: several shards consisting of 2-3 replica nodes, and on each of these nodes a ReplicatedMergeTree table containing data for its respective shard. There are also Distributed tables created on one or more nodes that are configured to query the nodes of the cluster (relevant section in the docs).
When you add a new shard, old data is not moved to it automatically. Recommended approach is indeed to "play with the weights" as you have put it, i.e. increase the weight of the new node until the volume of data is even. But if you want to rebalance the data immediately, you can use the ALTER TABLE RESHARD command. Read the docs carefully and keep in mind various limitations of this command, e.g. it is not atomic.
When you have more than 1 disk in one server, how can it use them all to store the data?
Please read the section on configuring RAID in the administration tips.
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
MergeTree tables in ClickHouse are partitioned by month. You can use ALTER TABLE DETACH/ATTACH PARTITION commands to manipulate partitions. You can e.g. at the start of each month detach the partition for some older month and back it up to Amazon S3. Or you can setup a cluster of cheaper machines with ample disk space and manually move old partitions there. If your queries always include a filter on date, irrelevant partitions will be skipped automatically, else you can setup two Distributed tables: table_recent and table_all (with the cluster config including the nodes with old partitions).
Version 19.15 introduced multidisk strorage configuration. 20.1 introduces time-based data rearrangements.

What is the best solution for POS application?

I'm current on POS project. User require this application can work both online and offline which mean they need local database. I decide to use SQL Server replication between each shop and head office. Each shop need to install SQL Server Express and head office already has SQL Server Enterprise Edition. Replication will run every 30 minutes as schedule and I choose Merge Replication because data can change at both shop and head office.
When I'm doing POC, I found this solution not work properly, sometime job is error and I need to re-initialize its. This solution also take a very long time, which obviously unacceptable to user.
I want to know, are there any solutions better than one that I'm doing now?
Update 1:
Constraints of the system are
Almost of transactions can occur at
both shop and head office.
Some transaction need to work in real-time mode, that being said,
after user save data to their local shop that data should go to update at head office too. (If they're currently online)
User can working even their shop has disconnected from head office database.
Our estimation about amount of data is at-most 2,000 rows in each day.
Windows 2003 is OS of Server at head office and Windows XP is OS of all clients.
Update 2:
Currently they're about 15 clients, but this number will growing in fairly slow rate.
Data's size is about 100 to 200 rows per replication, I think it may not more than 5 MB.
Client connect to server by lease-line connection; 128 kbps.
I'm in situation that replication take a very long time (about 55 minutes while we've only 5 minutes or so) and almost of times I need to re-initialize job to start replicate again, if I don't re-initialize job, it can't replicate at all. In my POC, I find that it always take very long time to replicate after re-initialize, amount of time doesn't depend on amount of data. By the way, re-initialize is only solution I find it work for my problem.
As above, I conclude that, replication may not suitable for my problem and I think it may has another better solution that can serve what I need in Update 1:
Sounds like you may need to roll your own bi-directional replication engine.
Part of the reason things take so long is that over such a narrow link (128kbps), the two databases have to be consistent (so they need to check all rows) before replication can start. As you can imagine, this can (and does) take a long time. Even 5Mb would take around a minute to transfer over this link.
When writing your own engine, decide what needs to be replicated (using timestamps for when items changed), figure out conflict resolution (what happens if the same record changed in both places between replication periods) and more. This is not easy.
My suggestion is to use MS access locally and keep updating data to the server after a certain interval. Add a updated column to every table. When a record is added or updated, set the updated coloumn. For deletion you need to have a seprate table where you can put primary key value and table name. When synchronizing fetch all local records whose updated field not set and update (modify or insert) it to central server. Delete all records using local deleted table and you are done!
I assume that your central server is only for collecting data.
I currently do exactly what you describe using SQL Server Merge Replication configured for Web Synchronization. I have my agents run on a 1-minute schedule and have had success.
What kind of error messages are you seeing?

Resources