I have a pipe that seems to be set up fine, but it just isn't working.
I ran
select system$pipe_status('"MY_DB"."MY_SCHEMA".MY_PIPE_NAME');
i'm getting back a growing number of numOutstandingMessagesOnChannel
Can someone please explain what that means?
Is that rows that will be processed? Do I expect this number to go down? Is there a number where it's too high?
Is there something / some way to track why/when it goes up?
The documentation says merely,
numOutstandingMessagesOnChannel
Number of messages in the queue that have been queued but not received yet.
numOutstandingMessagesOnChannel
=> Number of messages in the external cloud provider queue that have been queued but not received by Snowflake yet. ( this is not real-time, it's an approximate value)
The numOutstandingMessagesOnChannel should fluctuate if continuously ingesting at a very high rate and Snowflake is not able to process, throttling would occur.
Best to open a case with Snowflake Support, and we can look into the pipe status in more detail for you.
Related
As far as I can tell, when deserializing objects using KafkaDeserializationSchema[T], my 3 options are to return T, return null (ignore the record) or throw an exception (shut down the task manager) [from: https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#the-deserializationschema]. I have a requirement to stop processing subsequent messages on the topic where a poison message fails deserialization, but only until a human intervenes and makes a decision whether to ignore the message or replace it with a corrected one.
Has anyone had to deal with a similar requirement?
I was thinking about introducing a separate process function for dealing with converting an array of bytes to T, connecting a broadcast stream to it, and reacting to commands from a human operator in all instances of that operator. The problem here is that I can't figure out a way to pause reading from kafka while the system waits for a human to make a decision. I could throw exceptions and restart indefinitely, or I could keep reading from the topic and holding the incoming messages in the state, but I'm worried about additional CPU usage and balooning state for options 1 and 2 respectively.
Any thoughts anyone? Thanks!
I am trying to get the emails of a bunch of users on our service. I am first getting a list of messages, and if the message is not in the DataStore, then we fetch them. However, I'm using the deferred library to avoid the DeadlineExceeded error. The current algorithm is:
Put each user task on a queue
For each user, get the list of messages
For each 10 messages from this list, enqueue to fetch the messages 10 at a time.
However, I realized that this also exceeds the rate limit since I could be doing more than 10 queries/sec. When I tried to do only 1 message at a time instead of 10, and included getting the list of messages (which makes 1 network request for each page of emails), I got an error saying I was using too much memory and my process was shut down.
What is the best algorithm so I can ensure I am always under 10 qps to GMail and yet not run out of memory?
I don't think hitting the rate limit is a big deal, just make sure you handle the error and slow down a little in that case. Fetching messages in batches of 10 seems fine.
If you run out of memory in the scenario that you described, that means you have a memory leak or an infinite loop in your code. 10 queries can be easily processed on the smallest instance possible.
I have an error log which reports a deadlock:
Transaction (Process ID 55) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
I am trying to reproduce this error, but my standard deadlock SQL code produces a different error:
Transaction (Process ID 54) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
I want to be very clear that I am not asking what a deadlock is. I do understand the basics.
My question is: what is the meaning of lock | communication buffer resources in this context? What are "communication buffer resources"? Does the lock | signify anything?
My best guess is that a communication buffer is used when parallel threads combine their results. Can anyone confirm or deny this?
My ultimate goal is to somehow trigger the first error to occur again.
I would interpret the message as a deadlock on some combination of Lock resources or Communication Buffer resources. "Lock resources" are ordinary object locks, and "Communication Buffer resources" are exchangeEvents used for combining results of parallel queries. These are described further in https://blogs.msdn.microsoft.com/bartd/2008/09/24/todays-annoyingly-unwieldy-term-intra-query-parallel-thread-deadlocks/ where the relevant paragraph is:
An "exchangeEvent" resource indicates the presence of parallelism operators in a query plan. The idea is that the work for an operation like a large scan, sort, or join is divided up so that it can be executed on multiple child threads. There are "producer" threads that do the grunt work and feed sets of rows to "consumers". Intra-query parallel requires signaling between these worker threads: the consumers may have to wait on producers to hand them more data, and the producers may have to wait for consumers to finish processing the last batch of data. Parallelism-related waits show up in SQL DMVs as CXPACKET or EXCHANGE wait types (note that the presence of these wait types is normal and simply indicates the presence of parallel query execution -- by themselves, these waits don't indicate that this type or any other type of deadlock is occurring).
The deadlock graph for one of these I've seen included a set of processes with only one SPID and a graph of objectlocks and exchangeEvents. I guess the message "Transaction (Process ID 55) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction" appears instead of "Intra-query parallelism caused your server command (process ID #51) to deadlock. Rerun the query without intra-query parallelism by using the query hint option (maxdop 1)" because of the combination of objectlocks and exchangeevents, or else the message has been changed in SQL Server since the article was written.
Your issue is parallelism related, and the error has "no meaning" as the error message is not reflecting your problem and no do not go and change the maxdope settings. in order to get to the cause of the error you need to use trace flag 1204 , have a look as to how to use the trace flag and what info you get.
When you do this you'd get the answer as to why, where and what line of code caused the lock. I guess you're able to google your self from that point, and if not then post it and you'll get the answer you need.
You can use MAXDOP 1 as a query hint - i.e. run that query on one cpu - without affecting the rest of the server.
This will avoid the error for that query - doesn't tell you why it's failing but does provide a work-around if you have to get it working fast :-)
I have a query like below in PostgreSQL:
UPDATE
queue
SET
queue.status = 'PROCESSING'
WHERE
queue.status = 'WAITING' AND
queue.id = (SELECT id FROM queue WHERE STATUS = 'WAITING' LIMIT 1 )
RETURNING
queue.id
and many workers try to process one work at a time (that's why I have sub-query with limit 1). After this update, each worker grabs information about the id and processes the work, but sometimes they grab the same work and process it twice or more. The isolation level is Read Committed.
My question is how can I guarantee one work is going to be processed once? I know there is so many post out there but I can say I have tried most of them and it didn't help () ;
I have tried SELECT FOR UPDATE, but it caused deadlocked situation.
I have tried pg_try_advisory_xact_lock, but it caused out of shared
memory
I tried adding AND pg_try_advisory_xact_lock(queue.id) to the outer query's WHERE clause, but ... [?]
Any help would be appreciated.
A lost update won't occur in the situation you describe, but it won't work properly either.
What will happen in the example you've given above is that given (say) 10 workers started simultaneously, all 10 of them will execute the subquery and get the same ID. They will all attempt to lock that ID. One of them will succeed; the others will block on the first one's lock. Once the first backend commits or rolls back, the 9 others will race for the lock. One will get it, re-check the WHERE clause and see that the queue.status test no longer matches, and return without modifying any rows. The same will happen with the other 8. So you used 10 queries to do the work of one query.
If you fail to explicitly check the UPDATE result and see that zero rows were updated you might think you were getting lost updates, but you aren't. You just have a concurrency bug in your application caused by a misunderstanding of the order-of-execution and isolation rules. All that's really happening is that you're effectively serializing your backends so that only one at a time actually makes forward progress.
The only way PostgreSQL could avoid having them all get the same queue item ID would be to serialize them, so it didn't start executing query #2 until query #1 finished. If you want to you can do this by LOCKing the queue table ... but again, you might as well just have one worker then.
You can't get around this with advisory locks, not easily anyway. Hacks where you iterated down the queue using non-blocking lock attempts until you got the first lockable item would work, but would be slow and clumsy.
You are attempting to implement a work queue using the RDBMS. This will not work well. It will be slow, it will be painful, and getting it both correct and fast will be very very hard. Don't roll your own. Instead, use a well established, well tested system for reliable task queueing. Look at RabbitMQ, ZeroMQ, Apache ActiveMQ, Celery, etc. There's also PGQ from Skytools, a PostgreSQL-based solution.
Related:
In PostgreSQL, do multiple UPDATES to different rows in the same table having a locking conflict?
Can multiple threads cause duplicate updates on constrained set?
Why do we need message brokers like rabbitmq over a database like postgres?
SKIP LOCKED can be used to implement queue in PostgreSql. see
In PostgreSQL, lost update happens in READ COMMITTED and READ UNCOMMITTED but if you use SELECT FOR UPDATE in READ COMMITTED and READ UNCOMMITTED, lost update doesn't happen.
In addition, lost update doesn't happen in REPEATABLE READ and SERIALIZABLE whether or not you use SELECT FOR UPDATE. *Error happens if there is a lost update condition.
I have a large table, 1B+ records that I need to pull down and run an algorithm on every record. How can I use ADO.NET to exec a "select * from table" asynchronously and start reading the rows one by one while ado.net is receiving the data?
I also need to dispose of the records after I read them to save on memory. So I am looking of a way to pull a table down record by record and basically shove the record into a queue for processing.
My datasources are oracle and mssql. I have to do this for several datasources.
You should use SSIS for this.
You need a bit of background detail on how the ADO.Net data providers work to understand what you can do and what you can't do. Lets take the SqlClient provider for example. It is true that it is possible to execute queries asynchronously with BeginExecuteReader but this asynchronous execution is only until the query start returning results. At the wire level the SQL text is sent to the server, the server start churning the query execution and eventually will start pushing result rows back to the client. As soon as the first packet comes back to the client, the asynchronous execution is done and the completion callback is executed. After that the client uses the SqlDataReader.Read() method to advance the result set. There are no asynchronous methods in the SqlDataReader. This pattern work wonders for complex queries that return few results after some serious processing is done. While the server is busy producing the result, the client is idle with no threads blocked. However things are completely different for simple queries that produce large result sets (as seem to be the case for you): the server will immedeatly produce resutls and will continue to push them back to the client. The asynchronous callback will be almost instantenous and the bulk of the time will be spent by the client iterating over the SqlDataReader.
You say you're thinking of placing the records into an in memory queue first. What is the purpose of the queue? If your algorithm processing is slower than the throughput of the DataReader result set iteration then this queue will start to build up. It will consume live memory and eventualy will exhaust the memory on the client. To prevent this you would have to build in a flow control mechanism, ie. if the queue size is bigger than N don't put any more records into it. But to achieve this you would have to suspend the data reader iteration and if you do this you push flow control to the server which will suspend the query until the communication pipe is available again (until you start reading from the reader). Ultimately the flow control has to be proagated all the way to the server, which is always the case in any producer-consumer relation, the producer has to stop otherwise intermediate queues fill up. Your in-memory queue serves no purpose at all, other than complicating things. You can simply process items from the reader one by one and if your rate of processing is too slow, the data reader will cause flow control to be applied on the query running on the server. This happens automatically simply because you don't call the DataReader.Read method.
To summarise up, for a large set processing you cannot do asynchronous processing and there is no need for a queue.
Now the difficult part.
Is your processing doing any sort of update back in the database? If yes, then you have much bigger problems:
You cannot use the same connection to write back the result, because it is busy with the data reader. SqlClient for SQL Server supports MARS but that only solves the problem with SQL 2005/2008.
If you're going to enroll the read and update in a transaction if your updates occur on a different connection (see above), then this means using distributed transactions (even when the two conencitons involved point back to the same server). Distributed transactions are slow.
You will need to split the processing into several batches because is very bad to process 1B+ records in a single transaction. This means also that you are going to have to be able to resume processing of an aborted batch, which means you must be able to identify records that were already processed (unless processing is idempotent).
A combination of a DataReader and an iterator block (a.k.a. generator) should be a good fit for this problem. The default DataReaders provided by Microsoft pull data one record at a time from a datasource.
Here's an example in C#:
static IEnumerable<User> RetrieveUsers(DbDataReader reader)
{
while (reader.NextResult())
{
User user = new User
{
Name = reader.GetString(0),
Surname = reader.GetString(1)
};
yield return user;
}
}
A good approach to this would be to pull back the data in blocks, iterate through adding to your queue then calling again. This is going to be better than hitting the DB for each row. If you are pulling them back via a numeric PK then this will be easy, if you need to order by something you can use ROW_NUMBER() to do this.
Just use the DbDataReader (just like Richard Nienaber said). It is a forward-only way of scrolling through the retrieved data. You don't have to dispose of your data because a DbDataReader is forward only.
When you use the DbDataReader it seems that the records are retrieved one by one from the database.
It is however slightly more complicated:
Oracle (and probably MySQL) will fetch a few 100 rows at a time to decrease the number of round trips to the database. You can configure the fetch size of DataReader. Most of the time it will not matter whether you fetch 100 rows or 1000 rows per round trip. However, a very low value like 1 or 2 rows slows things down because with a low value retrieving the data will require too many round trips.
You probably don't have to set the fetch size manually, the default will be just fine.
edit1: See here for an Oracle example: http://www.oracle.com/technology/oramag/oracle/06-jul/o46odp.html