How do I keep track of progress of a Cloudant replication?

How do I keep track of progress of a Cloudant replication? - database

If I have an IBM Cloudant database with around one million documents in it and I set up a replication process to copy this data to another region, how can I tell how far the replication job has progressed? I know when it starts and when it has finished, but nothing in between. Is there a way to track progress?

If the target database was empty at the start of the replication, and there's no other changes being written to the target, then it's a case of waiting until the sequence token of the target database matches the sequence token of the source.
You can find the current sequence token of a database by using the GET /<database name> endpoint on the source & the target e.g.
curl $URL/sourcedb
{
"update_seq": "23600-g1AAAARXeJyd0",
"db_name": "sourcedb",
...
}
In the above example, there are 23600 changes that the replicator needs to work through. The same command can be run against the target database to see the progress of replication.
Alternatively, there is an API endpoint that allows you to view the replication job's progress: GET _scheduler/docs/_replicator/<replication id> where the replication id is the _id of the document in the _replicator database that was created to initiate the replication.
It returns an object like this:
{
"database": "_replicator",
"doc_id": "e0330b1936f6194da22af8fa663c5be8",
"id": null,
"source": "https://source.cloudant.com/sourcedb/",
"target": "https://target.cloudant.com/targetdb/",
"state": "completed",
"error_count": 0,
"info": {
"revisions_checked": 1005,
"missing_revisions_found": 1005,
"docs_read": 1005,
"docs_written": 1005,
"changes_pending": 376,
"doc_write_failures": 0,
"checkpointed_source_seq": "1011-g1AAAAfLeJy91FFKwzAYwPGigo_uBvqq0JmkbZqCsomoj3oDzZcvZYxtFbc96w30BnoDvYHeQG-gN9AbzCYpbntbhfQlhdJ"
},
"start_time": "2020-11-17T09:55:01Z",
"last_updated": "2020-11-17T09:55:58Z"
}
which includes the status of the replication, how many documents have been processed and the last checkpointed sequence token, which should be enough to estimate progress of the replication job.
The full details of the API call are here.

Related

Increase Throughput of Debezium Kafka Connector

I have started to use Debezium recently to deal with capture changes data in realtime and sink to the target database.
Instead of Kafka, I use Azure Event Hub with Kafka Connect to connect SQL Server and use confluent JDBC to sink changed data to the target database which is SQL Server.
I understand Debezium does async to have less impact on the performance of the database but is there any way I can increase the throughput of streaming?
Recently, I spin up Event Hub with minimum throughput units is 10 and auto-inflate to 20. So I expect Debezium + Kafka Connect + Event Hubs could stream 10MB - 20MB / second and the egress should be 20 - 40MB / second.
However, the real performance is worst. I manually import 10k of records to the source database which is less than 6MB. So I expect Debezium with sink connector will capture the changes and sink to the target database in a few seconds.
Instead of getting data at one, the sink connector manually updates the data to the target database.
The following is my config. Please let me know if I need to change the configuration to improve performance. Any helps would be very much appreciated.
Kafka Connect: Kafk
bootstrap.servers=sqldbcdc.servicebus.windows.net:9093
group.id=connect-cluster-group
# connect internal topic names, auto-created if not exists
config.storage.topic=connect-cluster-configs
offset.storage.topic=connect-cluster-offsets
status.storage.topic=connect-cluster-status
# internal topic replication factors - auto 3x replication in Azure Storage
config.storage.replication.factor=1
offset.storage.replication.factor=1
status.storage.replication.factor=1
rest.advertised.host.name=connect
offset.flush.interval.ms=10000
connections.max.idle.ms=180000
metadata.max.age.ms=180000
auto.register.schemas=false
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
# required EH Kafka security settings
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://sqldbcdc.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=**************************=";
producer.security.protocol=SASL_SSL
producer.sasl.mechanism=PLAIN
producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://sqldbcdc.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=**************************=";
consumer.security.protocol=SASL_SSL
consumer.sasl.mechanism=PLAIN
consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://sqldbcdc.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=**************************=";
plugin.path=C:\kafka\libs
SQL Connector:
{
"name": "sql-server-connection",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "1433",
"database.user": "sa",
"database.password": "******",
"database.dbname": "demodb",
"database.server.name": "dbservername",
"table.whitelist": "dbo.portfolios",
"database.history":"io.debezium.relational.history.MemoryDatabaseHistory",
"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3"
}
}
Sink Connector:
{
"name": "jdbc-sink",
"config":{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "portfolios",
"connection.url": "jdbc:sqlserver://localhost:1433;instance=NEWMSSQLSERVER;databaseName=demodb",
"connection.user":"sa",
"connection.password":"*****",
"batch.size":2000,
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": false,
"transforms.unwrap.delete.handling.mode": "none",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled":true,
"pk.fields": "portfolio_id",
"pk.mode": "record_key",
"table.name.format": "replicated_portfolios"
}
}

In an Azure logic app using oracle get rows how to I query a table with a $ in the table name

I am trying to create a logic app that finds all the err$ tables in an oracle database (err$_table_name are the default names of the rejected row tables for the log errors option). The problem I am stuck on is that when I use the oracle get rows action, the dollar sign in the table name is causing a json error.
Error message - BadRequest. Http request failed: the content was not a valid JSON.
In the "Inputs" sections the table name is correct, in this case the table name is "CHEETAH.ERR$_ALL_D_MARKET_HIER"
Under the raw inputs though I see this and I can see the $ was switched to %2524
{
"method": "get",
"path": "/datasets/default/tables/CHEETAH.ERR%2524_ALL_D_MARKET_HIER/items",
"host": {
"connection": {
"name": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Web/connections/oracle-3"
}
}
}
Here is the code view of the the get rows action:
"method": "get",
"path": "/datasets/default/tables/#{encodeURIComponent(encodeURIComponent(concat(variables('Owner'), '.', variables('Table') )))}/items"
I get the this json error if I enter in the table name or if I pass in the table name via a variable.
Anyone have any thoughts on how to get this to work. The only workaround I can think of is to use a stored procedure to create views without the $ in them.
Tried the suggestion of the slash . It changed the error at least.
Looking at below it took the single slash I added and replaced it with two slashes.
{
"status": 400,
"message": "The specified item 'CHEETAH.ERR\\$_ALL_D_PROD_HIER' is not found.\r\n inner exception: The specified item 'CHEETAH.ERR\\$_ALL_D_PROD_HIER' is not found.\r\nclientRequestId: b9038635-4007-48f5-aebd-ce94e1faf90a",
"error": {
"message": "The specified item 'CHEETAH.ERR\\$_ALL_D_PROD_HIER' is not found.\r\n inner exception: The specified item 'CHEETAH.ERR\\$_ALL_D_PROD_HIER' is not found."
},
"source": "oracle-cc.azconn-cc.p.azurewebsites.net"
}

I figured this out and it had nothing to do with the $ in the table name. It was the data being returned in one of the columns.
The problem column was of data type "UROWID"

Get Display Name for License SKU in Microsoft Graph

I am trying to use Microsoft Graph to capture the products which we have licenses for.
While I can get the skupartname, that name is not exactly display-friendly.
I have come across DisplayName as a datapoint in almost all the API calls that give out an object with an id.
I was wondering if there was a DisplayName for the skus, and where I could go to get them via the graph.
For reference, the call I made was on the https://graph.microsoft.com/v1.0/subscribedSkus endpoint following the doc https://learn.microsoft.com/en-us/graph/api/subscribedsku-list?view=graph-rest-1.0
The following is what's returned (after filtering out things I don't need), and as mentioned before, while I have a unique identifier which I can use via the skuPartNumber, that is not exactly PRESENTABLE.
You might notice for some of the skus, it difficult to figure out what it is referring to based on the names in the image of the Licenses page posted after the output
[
{
"capabilityStatus": "Enabled",
"consumedUnits": 0,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_2b9c8e7c-319c-43a2-a2a0-48c5c6161de7",
"skuId": "2b9c8e7c-319c-43a2-a2a0-48c5c6161de7",
"skuPartNumber": "AAD_BASIC"
},
{
"capabilityStatus": "Enabled",
"consumedUnits": 0,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_df845ce7-05f9-4894-b5f2-11bbfbcfd2b6",
"skuId": "df845ce7-05f9-4894-b5f2-11bbfbcfd2b6",
"skuPartNumber": "ADALLOM_STANDALONE"
},
{
"capabilityStatus": "Enabled",
"consumedUnits": 96,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_0c266dff-15dd-4b49-8397-2bb16070ed52",
"skuId": "0c266dff-15dd-4b49-8397-2bb16070ed52",
"skuPartNumber": "MCOMEETADV"
}
]
Edit:
I am aware that I can get "friendly names" of SKUs in the following link
https://learn.microsoft.com/en-us/azure/active-directory/users-groups-roles/licensing-service-plan-reference
The problem is that it contains ONLY the 70ish most COMMON SKUs (in the last financial quarter), NOT ALL.
My organization alone has 5 SKUs not present on that page, and some of our clients for who we are an MSP for, also have a few. In that context, the link really does not solve the problem, since it is not reliable, nor updated fast enough for new SKUs

You can see a match list from Product names and service plan identifiers for licensing.
Please note that:
the table lists the most commonly used Microsoft online service
products and provides their various ID values. These tables are for
reference purposes and are accurate only as of the date when this
article was last updated. Microsoft does not plan to update them for
newly added services periodically.
Here is an extra list which may be helpful.

There is a CSV download available of the data on the "Product names and service plan identifiers for licensing" page now.
For example, the current CSV (as of the time of posting this answer) is located at https://download.microsoft.com/download/e/3/e/e3e9faf2-f28b-490a-9ada-c6089a1fc5b0/Product%20names%20and%20service%20plan%20identifiers%20for%20licensing%20v9_22_2021.csv. This can be downloaded, cached and parsed in your application to resolve the product display name.
This is just a CSV format of the same table that is displayed on the webpage, which is not comprehensive, but it should have many of the products listed. If you find one that is missing, you can use the "Submit feedback for this page" button on the bottom of the page to create a GitHub issue. The documentation team usually responds in a few weeks.
Microsoft may provide an API for this data in the future, but it's only in their backlog. (source)

Multiple insert into mongodb - only the first collection gets updated

I am trying to update my collections in my mongodb instance hosted on mlab.
I am running the following code:
...
db.collectionOne.insert(someArrayOfJson)
db.collectionTwo.insert(someArrayOfJson)
The first collection gets updated and the second doesn't.
Using the same/different valid Json arrays produce the same outcome. Only the first gets updated.
I have seen this question duplicate document - same collection and I can understand why it wouldn't work. But my problem is across two seperate collections?
When inserting the data manually on mlab the document goes in the second collection fine - so I am lead to believe it allows duplicate data accross seperate collections.
I am new to mongo - am I missing something simple?
Update:
The response is:
22:01:53.224 [main] DEBUG org.mongodb.driver.protocol.insert - Inserting 20 documents into namespace db.collectionTwo on connection [connectionId{localValue:2, serverValue:41122}] to server ds141043.mlab.com:41043
22:01:53.386 [main] DEBUG org.mongodb.driver.protocol.insert - Insert completed
22:01:53.403 [main] DEBUG org.mongodb.driver.protocol.insert - Inserting 20 documents into namespace db.collectionOne on connection [connectionId{localValue:2, serverValue:41122}] to server ds141043.mlab.com:41043
22:01:55.297 [main] DEBUG org.mongodb.driver.protocol.insert - Insert completed
But there is nothing entered into the db for the second dataset.
Update v2:
If I make a call after the two inserts such as:
db.createCollection("log", { capped : true, size : 5242880, max : 5000 } )
The data collections get updated!

what is total data size ???
here is the sample code it works for me
db.collectionOne.insert([{"name1":"John","age1":30,"cars1":[ "Ford", "BMW", "Fiat"]},{"name1":"John","age1":30,"cars1":[ "Ford", "BMW", "Fiat" ]}]); db.collectionTwo.insert([{"name2":"John","age2":30,"cars2":[ "Ford", "BMW", "Fiat"]},{"name2":"John","age2":30,"cars2":[ "Ford", "BMW", "Fiat" ]}])
If data is more the you can use "Mongo Bulk Write Operations" and also you can refer Mongo DB limits and thresholds
https://docs.mongodb.com/manual/reference/limits

How did you determine that the second collection did not get updated?
I believe you are simply seeing the difference between NoSQL and SQL databases. A SQL database will guarantee you that a read after a successful write will read the data you just wrote. A NoSQL database does not guarantee that you can immediately read data you just wrote. See this answer for more details.

Timeout during indexing view in Azure Search despite settings

I'm using Azure Search Indexer to index a view from Azure SQL DB.
I've created Data Source (view) and set such settings in connection string
(...)Trusted_Connection=False;Encrypt=True;Connection Timeout=1200;" },
The indexer still returns timeouts and I see from Azure SQL DB logs, that Indexer's query gets cancelled after 30 seconds:
ActionStatus: Cancellation
Statement: SET FMTONLY OFF; SET NO_BROWSETABLE ON;SELECT * FROM
[dbo].[v_XXX] ORDER BY [rowVersion] SET NO_BROWSETABLE OFF
ServerDuration: 00:00:30.3559524
The same statement takes ~2 minutes when run through SQL Server Mgmt Studio and gets completed.
I wonder if there may be any other settings (server or DB) that overwrite my connection timeout preferences? If yes, then why there is no timeout when I query my DB using SSMS and there is timeout when Indexer tries to index the view?

The timeout that cancels the operation is the command timeout, not the connection timeout. The default command timeout used to be 30 seconds, and currently there is no way to change it. We have increased the default command timeout to a much larger value (5 minutes) to mitigate this in the short term. Longer term, we will add the ability to specify a command timeout in the data source definition.

Now there is a setting on the indexer. With it you can configure the queryTimeout. I think it is in minutes. My indexer runs now longer then 20 minutes without error.
"startTime": "2016-01-01T00:00:00Z"
},
"parameters": {
"batchSize": null,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0,
"base64EncodeKeys": false,
"configuration": {
"queryTimeout": "360"
}
},
"fieldMappings": [
{
Update: At the moment it can not be set over the azure portal. You can set it via the REST Api:
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=[api-version]
Content-Type: application/json
api-key: [admin key]

Use the REST API link https://[SERVICE].search.windows.net/indexers/[Indexer]?api-version=2016-09-01 to get the Indexer definition and then use a POST to the same address to update it.
Ref MSDN

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight