How to mark particular records or ranges as deleted in Timestream - data-modeling

I have a stream processing application that constantly ingest the data to AWS Timestream.
I try to come up with the approach when a particular range of data is processed incorrectly, thus I need to re-ingest them again and mark the ones that are already processed as deleted.
What is the best approach to do that?
Thanks in advance.

You can use Version field while re-ingesting data.
When Version field is provided Timestream overwrites/upserts the data if it has higher Version.
Ref: Timestream Developer Guide

Related

how to handle if multiple instances are trying to update a document in Elasticsearch?

I am new to Elasticsearch and trying to explore some use case for my business requirement.
What happens if multiple instances try to update a document?
Is there any error handling in place or the document gets locked?
Please advise
Elasticsearch is using optimistic concurrency control to ensure that an older version of a document never overwrites a newer version.
When documents are created, updated, or deleted, the new version of the document has to be replicated to other nodes in the cluster. Elasticsearch is also asynchronous and concurrent, meaning that these replication requests are sent in parallel, and may arrive at their destination out of sequence.
For more information you can check Elasticsearch documentation about optimistic concurrency control.

Does QuickBooks have any kind of audit log?

QuickBooks allows users to change posted periods. How can I tell if a user does this?
I actually don't need an audit log, but just the ability to see recently added/edited data that has a transaction date that's over a month in the past.
In a meeting today it was suggested that we may need to refresh data for all our users going back as far as a year on a regular basis. This would be pretty time consuming, and I think unnecessary when the majority of the data isn't changing. But I need to find out how can I see if data (such as an expense) has been added to a prior period so I know when to pull it again.
Is there a way to query for data (in any object or report) based not on the date of the transaction, but based on the date it was entered/edited?
I'm asking this in regard to using the QBO api, however if you know how to find this information from the web portal that may also be helpful.
QuickBooks has a ChangeDataCapture endpoint which is specifically for exactly the purpose you are describing. It's documented here:
https://developer.intuit.com/app/developer/qbo/docs/api/accounting/all-entities/changedatacapture
The TLDR summary is this:
The change data capture (cdc) operation returns a list of objects that have changed since a specified time.
e.g. You can continually query this endpoint, and you'll only get back the data that has actually changed since the last time you hit the endpoint.

Stream live data into Google cloud pub/sub

Is it possible to connect Oracle database and get live data stream into Google cloud pub/sub?
The short answer to your question is yes, but the longer more detailed answer includes making some assumptions like, when you say stream, do you literally mean stream or do you mean batch updating every minute?
Asking this question because there are huge implications depending on the answer meaning, if you require a true streaming solution, the only one is to bolt an Oracle product on top of your database called Oracle GoldenGate. This product is costly both in dollars and in engineering effort.
If a near real time solution is suitable to you, then you can use any of the following solutions:
NiFi
Or
Airflow
Luigi
With either plain SQL or using a streaming framework like Beam or Spark.
Or any other orchestration platform that can run queries on a timer. At the end of the day, all you need is something that can do select * from table where last_update > now() - threshold, generate an event for each delta, and then publish all the deltas to PubSub.
Yes, you can see a provided template at https://cloud.google.com/dataflow/docs/templates/provided-templates#gcstexttocloudpubsub that reads from Google Cloud Storage Text to Cloud Pub/Sub. You should be able to change the code that reads from storage to read from your database instead.
yes. I tried as part of 1 POC. Using triggers capture the changed records from Oracle, using a Cursor convert those into .txt file with data in JSON format.Prepare batch script to read the data and include Publish command inside the batch file to push the data through cloud PubSub. This is the overall flow
You can consider using Change data Capture(CDC) tools like Debezium that detect your db changes in real time.
Docs: https://debezium.io/documentation/reference/operations/debezium-server.html
With Spring boot: https://www.baeldung.com/debezium-intro

How can I poll the Salesforce API to find records that meet criteria and have not been seen by my app before?

I am working on a Salesforce integration for an high-traffic app where we want to be able to automate the process of importing records from Salesforce to our app. To be clear I am not working from the Salesforce side (i.e. Apex), but rather using the Salesforce Rest API from within the other app.
The first idea was to use the cutoff time for when the record was created where we would increase that time on each poll based on the creation time of the applicant in the last poll. It was quickly realized this wouldn't work for this. There can be other filters in the query that might include a status field in Salesforce, for example, where the record should only import after a certain status is set. This would make checking creation time or anything like that unreliable since an older record could later become relevant to our auto importing.
My next idea was to poll the Salesforce API to find records every few hours. In order to avoid importing the same record twice, the only way I could think to do this is by keeping track of the IDs we already attempted to import and using these to do a NOT IN condition:
SELECT #{columns} FROM #{sobject_name}
WHERE Id NOT IN #{ids_we_already_imported} AND #{other_filters}
My big concern at this point was whether or not Salesforce had a limitation on the length of the WHERE clause. Through some research I see there are actually several limitations:
https://developer.salesforce.com/docs/atlas.en-us.salesforce_app_limits_cheatsheet.meta/salesforce_app_limits_cheatsheet/salesforce_app_limits_platform_soslsoql.htm
The next thing I considered was doing queries to find the all of the IDs in Salesforce that meet the conditions of the other filters without checking the ID itself. Then we could take that list of IDs and remove the ones we already tracked on our end to find a smaller IN condition we could set to find all of the data on the records we actually need.
This still doesn't seem completely reliable though. I see a single query can only return 2000 rows and only have an offset up to 2000. If we already imported 2000 records the first query might not have any necessary rows we'd want to import, but we can't offset it to get the relevant rows because of these limitations.
With these limitations I can't figure out a reliable way to find the relevant records to import as the number of records we already imported grows. I feel like this would be common usage of a Salesforce integration, but I can't find anything on this. How can I do this without having to worry about issues when we reach a high volume?
Not sure what all of your requirements are or if the solution needs to be generic, but you could do a few of things.
Flag records that have been imported, but that means making a call back to salesforce to update the records, but that can be bulkified to reduce the number of calls and modify your query to exclude the flag
Reverse the way you get the data to push instead of pull, so have salesforce push records that meet the criteria to you app whenever the record meets the criteria with workflow and outbound messages
Use the streaming API to setup a push topic that you app can subscribe to that would get notified when a records meets the criteria

Indexing SQLServer data with SOLR

What is the best way of syncing the database change with solr incremental indexing? What is the best way of getting MSSQL server data to be indexed by solr?
Thank so much in addvance
Solr works with plugins. you will need to create your own data importer plugin that will be called in a periodically manner (based on notifications, time period that passed, etc). You will point your solr configuration to the class that will be called upon update.
Regarding your second Q, I used a text file, that holds a time date description. Each time Solr was started it looked at said file and retrieved from the DB the relevant data that was changed in the DB from that point on (the file is updated when the index is updated).
I would suggest reading a good solr/lucene book/guide such as lucidworks-solr-refguide-1.4 before getting started, so you will be sure that your architectural solution is correct

Resources