max attempts of 10 was reached for request with last error being: RECEIPT_NOT_FOUND - hedera-hashgraph

I've upgraded my sdk to v2.20.0-beta.4, and I am receiving this error when submitting transactions.
I expect the transactions to succeed. They will if I downgrade the sdk to a stable release, so I am guessing this is a bug in beta which happens, but Im more interested in what it means

It means the sdk tried to get a receipt for your transaction 10 times and eventually gave up trying, this could be for a number of reasons
The transactionId you're asking a receipt for doesn't actually exist
The node the receipt query goes to has never seen that transaction
The communication between you and the node you're asking the receipt from is broken
You're asking for a receipt for a transaction that's more than 3 minutes old
Given earlier versions of the SDK work fine, it's probably a bug, I'd encourage you to file an issue on the SDK with your findings.
While asking for a receipt is usually successful (even if the transaction failed), there are edge cases where a successful transaction will not be followed by a successful receipt request (for the reasons above). Those are not necessarily Hedera's fault, you could send a tx from a mobile device, loose network connectivity and then fail to fetch a receipt when you regain connectivity.
The belts and braces approach is to log the transaction ids in a persisted list, remove from the list when a receipt is obtained and in the event a receipt cannot be obtained, check with a mirror node whether any transactions in the list have succeeded (or not).
If a transaction in the list is more than 3 minutes old and there is no record of it on a mirror node, it hasn't and will never be processed.

Related

How to clean up after network failure caused database update exception

I have created a vb program for a small local business. They have 2 locations. The Access database resides on a computer at one of the locations, and data is transmitted across a VPN through the local internet provider (a major carrier). This VPN is not in my control, and the transmission method can’t be changed.
Everything works well, as long as the provider’s network doesn’t fail. Unfortunately, it sometimes does. The program requires frequent database updates, for instance: a new customer is entered into the database, and after an update, the customer number (auto-generated) is retrieved. This customer number is then used as part of the order table row, which is then updated and its auto-number is retrieved. The order number is then used as part of the order-item table rows, which lists the items in the order. You get the picture… lots of updates, which must occur to get the auto numbers for the next step. Inventory, payment info, etc.
If the network fails at a time when say, after an order has been updated, but before the order-items have been updated, the database becomes a mess, and is out of sync. Because the network is down, a database correction or rollback is not possible. I have try/catch blocks in place to find out whether an update was successful or not. My question is, how to
Possibly try again, the network might have just hiccupped – I know I can loop here, are there better methods?
If that fails and I must consider the update a no-go and the network truly down, how to clean up the mess. Unluckily for me, a major 2-day failure occurred an hour after my program went live for the first time, so this is not an unlikely possibility.
Thank you for any suggestions

SQL HA Cluster TempDB Version Store blocking on secondary Replica due to open transaction? [migrated]

This question was migrated from Stack Overflow because it can be answered on Database Administrators Stack Exchange.
Migrated 17 days ago.
I am currently investigating a repeating error which occurs on the secondary Replica of our 2 node Alwasy on High Availability cluster. The Replica is set up with Read-Intent only because we use a separate Backup solution (Dell Networker).
The Tempdb keeps growing in the secondary replica because the Version Store never gets cleared.
I can fix it temporarly when i failover the Availability Groups, but after a couple of hours the error appears again on the repilca node. The error seems to follow one specific Availabilty Group, every node where its currently replicating gets the error after some time. So i guess it has to be a issue caused by a transaction and not from the sytem itself.
I tried all suggestions on google to find the culprit but even if i recklessly kill all sessions with last_batch in the timeframe i get from the Perfmon "longest running transaction time" indicator (as advised here: https://www.sqlservercentral.com/articles/tempdb-growth-due-to-version-store-on-alwayson-secondary-server ), it won't start cleaning up the Versionstore.
The shown elapsed seconds also match the output of the Query on the Secondary node:
select * from sys.dm_tran_active_snapshot_database_transactions
details are sadly not usefull:
Output
Here it shows Transaction ID 0 and Session ID 823 but the Session ID is long gone and keeps getting used by other processes alread. So i am stuck here.
I tried to match the Transaction_sequence_num with anyting, but no luck so far.
On the Primary Node it show no open transactions of any kind.
Any help finding the cause of this open snapshot transaction is appreciated.
I followed this guides already to find the issue:
https://sqlundercover.com/2018/03/21/tempdb-filling-up-on-secondary-replicas/
https://sqlgeekspro.com/tempdb-growth-due-to-version-store-in-alwayson/
https://learn.microsoft.com/en-us/archive/blogs/docast/rapid-growth-of-tempdb-on-alwayson-secondary-replica-due-to-version-store
https://www.sqlshack.com/how-to-detect-and-prevent-unexpected-growth-of-the-tempdb-database/
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/635a2fc6-550b-4e08-b232-0652bd6ea17d/version-store-space-not-being-released?forum=sqldatabaseengine
https://www.sqlservercentral.com/articles/tempdb-growth-due-to-version-store-on-alwayson-secondary-server
Update:
To show my claim that the session is long gone:
Here the Pictures you see the output of sys.dm_tran_active_snapshot_database_transactions, sys.sysprocesses and sys.dm_exec_sessions:
The first Picture shows currently 2 "open" snapshot database transactions. (normally it was always one in the past, but maybe the more the better) and it shows the now to this time running sessions on this ids.
Than i proceeded to kill session 899 and 823 and checked again:
Here you can see the active_snapshot_database_transaction is still showing the 2 Session_ids and the sysprocesses and dm_exec_sessions show now the 2 IDs are in use by a different Program, User, database etc. because i killed them and the ID number immediately got reused. If i check through the day sometimes they are even not in use at all.
If i check the elapsed time and the perfmon longest running transaction i would be looking for a session with a logintime or batch at aroung 2023-02-03 00:00:56. But if i check all sleeping sessions or sessions with last batch in this range and even kill all of them ( like described in all of the links above) it still shows the "transaction" in sys.dm_tran_active_snapshot_database_transactions with ever growing numbers.
Update 2:
in the meantime we needed to resolve the issue with a failover because the tempdb ran out of space. Now the new "stuck" session id as shown in dm_tran_active_transaction has the session id 47 and is currently at around 30000sec rising. So the problem started at around 11.2.2023 00:00:20.
Here is the Output of dm_tran_active_transaction:
There are many different assumptions going on here which need to be addressed.
First, the mechanism by which availability groups work is by log blocks. They don't send individual transactions. Log blocks are written to the log on the secondary and eventually the information inside of them is redone.
Second, the version store is only used on readable secondary replicas. Thus, looking at the primary for sessions that are using items on the secondary is not going to help. The version store can only be cleaned up by removing the oldest unused versions until it hits a version in use, it cannot skip versions, thus is version 3 is needed but 4-10 aren't, anything below 3 can be cleaned up but anything (including 3) can't.
Third, if a session is closed then any outstanding items are cleaned up. Whether that is freeing memory, killing transactions, etc. There is no evidence that the session is actually disconnected on your secondary replica that was given.
I wrote this in lieu of adding comments. If the OP decides to add in more data on the secondary they can and I'll address it. The replica can also be changed to not readable and that will solve the problem, since the issue is queries on the secondary.

Appengine ndb - Transactions can report failure but succeed

I have some concerns over using Appengine after finding this:
https://cloud.google.com/appengine/docs/python/datastore/transactions
Note: If your application receives an exception when committing a
transaction, it does not always mean that the transaction failed. You
can receive Timeout, TransactionFailedError, or InternalError
exceptions in cases where transactions have been committed and
eventually will be applied successfully. Whenever possible, make your
Cloud Datastore transactions idempotent so that if you repeat a
transaction, the end result will be the same.
Ndb transactions:
https://cloud.google.com/appengine/docs/python/ndb/transactions
This post has touched on the subject before but not answered it:
app engine datastore transaction exception
I have searched and read about the issue but cannot find anything more specific.
I have a game where I will do the following:
signup
login
update progress
update password/email
delete account
As it is only one entity is modified at a time (updating progress or pw etc). In the future I might add multiplayer which might complicate things since it could mean updates to more than one entity at a time.
If we start with what I have, I see some problems. I need to understand exactly how things work.
What does it mean if a failure is reported but it actually succeeds later?
Let's say the progress has been saved successfully before and is level 12. Now the user posts that he/she progressed to level 15. It reports failure but succeeds later. Let's say the user's brother plays with the same profile on another device which is only on level 12 and has not seen the other update yet, he completes level 13 and saves. Now what can happen?
The transaction consists of reading the saved level and if it is bigger then write the new one else keep the old.
But will transaction number 2 view the old value, decide that it should write 13 and then be enqueued after transaction 1 so that transaction 1 eventually completes and writes 15 but transaction 2 then writes 13 because it looked at the old value?
Or will transaction number one read the value before the actual true write and write 15 and then transaction 2 will also read the true value which is now 15 and then not write 13?
What does it mean that the transaction has been commited?
For signup it could mean the signup is reported as a failure but is written to the database and then the user tries to signup again but the server says that the email already exists. This will annoy people.
Updating emails and passwords - The user updates the pw or email and it is reported to have failed. So they log in with the old pw just to see that it does not work because the new password was actually written although it said failed. This can be solved by the user by using "forgot password" but it is still very annoying.
And when are the exceptions like TransactionFailedError returned? After each try or only after all retries failed?
Is it possible to give some sort of estimations of how likely all this is? I mean if I have 10 people using it I guess it never happens and everything seems to work but then I get a million users and things start to fail everywhere...
If these problems occur to one out of one million users one time a year then it is not something to worry about but if it happens to every user once a week it is a disaster.

Syncing database and an external payment service

Are there any "design patterns" related to processing important financial operations so that there's no way that a local database can become out of sync because of some errors ?
Example:
A financial transaction record is created in a local db, then a request is sent to a remote payment API endpoint to charge a customer. Pseudocode:
record = TransactionRecord.create(timestamp=DateTime.now, amount=billed_amount, status=Processing)
response = Request.post(url=remote_url, data=record.post_data)
if response.ok:
record.mark_as_ok()
else:
record.mark_failed()
Now, even if I handle errors that can be returned by the remote payment service a lot of other bad things can still happen: DB server can go down, network connection can go down etc., at arbitrary points in time.
In the above code the DB server can become inaccessible right after creating the transaction record, so it might not be possible to mark that record as ok, even if the financial transaction itself has been performed successfuly by the remote service.
In other words: customer is charged but we don't have that booked..
This can be worked around in a number of ways - by periodically syncing with the remote service, by investigating TransactionReturn-s which are being processed but are older than e.g. 10 minutes or an hour.
But my question is if there are some well established patterns for handling such situations (where money is involved, so everything should work properly "all the time") ?
PS. I'm not sure what tags should I use for this question, feel free to re-tag it.
I don't think there is any 'design pattern' to address cases such as database connection going down or network connection going down as it happens in your scenario. Any of those two scenarios are major fault events and would most likely require manual intervention.
There is not much coding you can do to address them other than being defensive by doing proper error checking, providing proper notifications to support and automatically disabling functionality which does not work (if the application detects that the payment service is down then 'Submit payment' button should be disabled).
You will be able to cut down significantly on support if you do proper error handling and state management. In your case, the transaction record would have to change its state from Pending -> Submitted -> Processed or Rejected or something like this.
Also, not every service provides functionality to for syncing up.

Are PeopleSoft Integration Broker asynchronous messages fired serially on the receiving end?

I have a strange problem on a PeopleSoft application. It appears that integration broker messages are being processed out of order. There is another possibility, and that is that the commit is being fired asynchronously, allowing the transactions to complete out of order.
There are many inserts of detail records, followed by a trailer record which performs an update on the rows just inserted. Some of the rows are not receiving the update. This problem is sporadic, about once every 6 months, but it causes statistically significant financial reporting errors.
I am hoping that someone has had enough dealings with the internals of PeopleTools to know what it is up to, so that perhaps I can find a work around to the problem.
You don't mentioned whether you've set this or not, but you have a choice with Integration Broker. All messages flow through message channels, and a channel can either be ordered or unordered. If a channel is ordered then - if a message errors - all subsequent messages queue up behind it and will not be processed until it succeeds.
Whether a channel is ordered or not depends upon the checkbox on the message channel properties in Application Designer. From memory channels are ordered by default, but you can uncheck the box to increase throughput.
Hope this helps.
PS. As of Tools 8.49 the setup changed slightly, Channels became Queues, Messages Service Operations etc.
I heard from GSC. We had two domains on the sending end as well as two domains on the receiving end. All were active. According to them, it is possible when you have multiple domains for each of the servers to pick up some of the messages in the group, and therefore, process them asynchronously, rather than truly serially.
We are going to reduce the active servers to one, and see it it happens again, but it is so sporadic that we may never know for sure.
There are few changes happened in PSFT 9 IB so please let me know the version of your apps. Async services can work with Sync now. Message channel properties are need to set properly. Similar kind of problem, I found on www.itwisesolutions.com/PsftTraining.html website but that was more related to implementing itself.
thanks

Resources