Interpretation of Steps (1/2) failed in Snowflake - snowflake-cloud-data-platform

I'm running an sql query on snowflake, which has been running for more than 24 hours.
When I go to 'Profile Overview', I see Steps (1/2, Failed):
However, under 'Details' tab, I see that the status of the query is still 'Running'.
Can someone please explain whether the query is still running or has there been some error ?

This must be retrying after failing on a first attempt. For example, it might have failed on a memory error and would retry internally twice (total three tries) before either successfully completing or failing with an error (in case all tries fail).
You may reach out to Snowflake Support for looking into this.

Your query is likely resource (memory) intensive. Something to check in your query would include:
- large sort or large number of sort?
- large group by (or many aggregates?)
- Many Windowing functions with order by?
For a query to be running for many hours, likely it would be best to dig deeper either with help of support (open case), or look for opportunity to review the query pattern.


Index on array element attribute extremely slow

I'm new to mongodb but not new to databases. I created a collection of documents that look like this:
{_id: ObjectId('5e0d86e06a24490c4041bd7e')
_id: ObjectId(5e0c35606a24490c4041bd71),
ts: 1234456,
So there is a list of objects on the documents and within the list there might be many objects with the same _id field. I have a handful of documents in this collection and my query that selects on selected match._id's is horribly slow. I mean unnaturally slow.
Query is simply this: {match: {$elemMatch: {_id:match._id }}} and literally hangs the system for like 15 seconds returning 15 matching documents out of 25 total!
I put an index on the collection like this:
collection.createIndex({"match._id" : 1}) but that didn't help.
Explain says execution time is 0 and says it's using the index but it still takes 15 seconds or longer to complete.
I'm getting the same slowness in nodejs and in compass.
Explain Output:
{"explainVersion":"1","queryPlanner":{"namespace":"hp-test-39282b3a-9c0f-4e1f-b953-0a14e00ec2ef.lead","indexFilterSet":false,"parsedQuery":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"maxIndexedOrSolutionsReached":false,"maxIndexedAndSolutionsReached":false,"maxScansToExplodeReached":false,"winningPlan":{"stage":"FETCH","filter":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"inputStage":{"stage":"IXSCAN","keyPattern":{"match._id":1},"indexName":"match._id_1","isMultiKey":true,"multiKeyPaths":{"match._id":["match"]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"match._id":["[ObjectId('5e0c3560e5a9e0cbd994fa52'), ObjectId('5e0c3560e5a9e0cbd994fa52')]"]}}},"rejectedPlans":[]},"executionStats":{"executionSuccess":true,"nReturned":15,"executionTimeMillis":0,"totalKeysExamined":15,"totalDocsExamined":15,"executionStages":{"stage":"FETCH","filter":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"nReturned":15,"executionTimeMillisEstimate":0,"works":16,"advanced":15,"needTime":0,"needYield":0,"saveState":0,"restoreState":0,"isEOF":1,"docsExamined":15,"alreadyHasObj":0,"inputStage":{"stage":"IXSCAN","nReturned":15,"executionTimeMillisEstimate":0,"works":16,"advanced":15,"needTime":0,"needYield":0,"saveState":0,"restoreState":0,"isEOF":1,"keyPattern":{"match._id":1},"indexName":"match._id_1","isMultiKey":true,"multiKeyPaths":{"match._id":["match"]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"match._id":["[ObjectId('5e0c3560e5a9e0cbd994fa52'), ObjectId('5e0c3560e5a9e0cbd994fa52')]"]},"keysExamined":15,"seeks":1,"dupsTested":15,"dupsDropped":0}},"allPlansExecution":[]},"command":{"find":"lead","filter":{"match":{"$elemMatch":{"_id":"5e0c3560e5a9e0cbd994fa52"}}},"skip":0,"limit":0,"maxTimeMS":60000,"$db":"hp-test-39282b3a-9c0f-4e1f-b953-0a14e00ec2ef"},"serverInfo":{"host":"Dans-MacBook-Pro.local","port":27017,"version":"5.0.9","gitVersion":"6f7dae919422dcd7f4892c10ff20cdc721ad00e6"},"serverParameters":{"internalQueryFacetBufferSizeBytes":104857600,"internalQueryFacetMaxOutputDocSizeBytes":104857600,"internalLookupStageIntermediateDocumentMaxSizeBytes":104857600,"internalDocumentSourceGroupMaxMemoryBytes":104857600,"internalQueryMaxBlockingSortMemoryUsageBytes":104857600,"internalQueryProhibitBlockingMergeOnMongoS":0,"internalQueryMaxAddToSetBytes":104857600,"internalDocumentSourceSetWindowFieldsMaxMemoryBytes":104857600},"ok":1}
The explain output confirms that the operation that was explained is perfectly efficient. In particular we see:
The expected index being used with tight indexBounds
Efficient access of the data (totalKeysExamined == totalDocsExamined == nReturned)
No meaningful duration ("executionTimeMillis":0 which implies that the operation took less than 0.5ms for the database to execute)
Therefore the slowness that you're experiencing for that particular operation is not related to the efficiency of the plan itself. This doesn't always rule out the database (or its underlying server) as the source of the slowness completely, but it is usually a pretty strong indicator that either the problem is elsewhere or that there are multiple factors at play.
I would suggest the following as potential next steps:
Check the mongod log file (you can confirm its location by running db.adminCmd("getCmdLineOpts") via the shell connected to the instance). By default any operation slower than 100ms is captured. This will help in a variety of ways:
If there is a log entry (with a meaningful duration) then it confirms that the slowness is being introduced while the database is processing the operation. It could also give some helpful hints as to why that might be the case (waiting for locks or server resources such as storage for example).
If an associated entry cannot be found, then that would be significantly stronger evidence that we are looking in the wrong place for the source of the slowness.
Is the operation that you gathered explain for the exact one that the application and Compass are observing as being slow? Were you connected to the same server and namespace? Is the explained operation simplified in some way, such as the original operation containing sort, projection, collation, etc?
As a relevant example that combines these two, I notice that there are skip and limit parameters applied to the command explained on a mongod seemingly running on a laptop. Are those parameters non-zero when running the application and does the application run against a different database with a larger data set?
The explain command doesn't include everything that an application would. Notably absent is the actual time it takes to send the results across the network. If you had particularly large documents that could be a factor, though it seems unlikely to be the culprit in this particular situation.
How exactly are you measuring the full execution time? Does it potentially include the time to connect to the database? In this case you mentioned that Compass itself also demonstrates the slowness, so that may rule out most of this.
What else is running on the server hosting the database? Is there a container or VM involved? Would the database or the underlying server be experiencing resource contention due to concurrency?
Two additional minor asides:
25 total documents in a collection is extremely small. I would expect even the smallest hardware to be able to process such a request without an index unless there was some complicating factor.
Assuming that match is always an array then the $elemMatch operator is not strictly necessary for this particular query. You can read more about that here. I would not expect this to have a performance impact for your situation.

Concurrent queries in PostgreSQL - what is actually happening?

Let us say we have two users running a query against the same table in PostgreSQL. So,
User 1: SELECT * FROM table WHERE year = '2020' and
User 2: SELECT * FROM table WHERE year = '2019'
Are they going to be executed at the same time as opposed to executing one after the other?
I would expect that if I have 2 processors, I can run both at the same time. But I am thinking that matters become far more complicated depending on where the data is located (e.g. disk) given that it is the same table, whether there is partitioning, configurations, transactions, etc. Can someone help me understand how I can ensure that I get my desired behaviour as far as PostgreSQL is concerned? Under which circumstances will I get my desired behaviour and under which circumstances will I not?
EDIT: I have found this other question which is very close to what I was asking - It is a bit old and doesn't have much answers, would appreciate a fresh outlook on it.
If the two users have two independent connections and they don't go out of their way to block each other, then the queries will execute at the same time. If they need to access the same buffer at the same time, or read the same disk page into a buffer at the same time, they will use very fast locking/coordination methods (LWLocks, spin locks, or atomic operations like CAS) to coordinate that. The exact techniques vary from version to version, as better methods become widely available on supported platforms and as people find the time to change the implementation to use those better methods.
I can ensure that I get my desired behaviour as far as PostgreSQL is concerned?
You should always get the correct answer to your query (Or possibly some kind of ERROR indicating a failure to serialize if you are using the highest (and non-default) isolation level, but that doesn't seem to be a risk if each of those queries is run in a single-statement transaction.)
I think you are overthinking this. The point of using a database management system is that you don't need to micromanage it.
Also, "parallel-query" refers to a single query using multiple CPUs, not to different queries running at the same time.

Postgresql inserts stop at random number of records

I am developing a test application that requires me to insert 1 million records in a Postgresql database but at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records. I've read that databases have a size cap, which is around 4 Gb, but I'm sure my database didn't even come close to this value.
So, what other reasons could be for why insertion stopped?
It happened a few times, once capping at 170872 records, another time at 25730 records.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
Thanks in advance!
Indeed the problem isn't the database cap, here are the official figures for PostgreSQL:
- Maximum Database Size Unlimited
- Maximum Table Size 32 TB
- Maximum Row Size 1.6 TB
- Maximum Field Size 1 GB
- Maximum Rows per Table Unlimited
- Maximum Columns per Table 250 - 1600 depending on column types
- Maximum Indexes per Table Unlimited
Error in log file:
2012-03-26 12:30:12 EEST WARNING: there is no transaction in progress
So I'm looking up for an answer that fits this issue. If you can give any hints I would be very grateful.
I've read that databases have a size cap, which is around 4 Gb
I rather doubt that. It's certainly not true about PostgreSQL.
[...]at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records
Again, I'm afraid I doubt this. Unless your application has become self aware it's refusing to do nothing. It might be crashing, or locking, or waiting for something to happen though.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
I don't think you've looked hard enough. Obvious things to check:
Are you getting any errors in the PostgreSQL logs?
If not, are you sure you're logging errors? Issue a bad query to check.
Are you getting any errors in the application?
If not,. are you sure you're checking? Again, check
What is/are the computer(s) up to? How much CPU/RAM/Disk IO is in use? Any unusual activity?
Any unusual locks begin taken (check the pg_locks view).
If you asked the question having checked the above then there's someone who'll be able to help. Probably though, you'll figure it out yourself once you've got the facts in front of you.
OK - if you're getting "no transaction in progress" that means you're issuing a commit/rollback but outside of an explicit transaction. If you don't issue a "BEGIN" then each statement gets its own transaction.
This is unlikely to be the cause of the problem.
Something is causing the inserts to stop, and you've still not told us what. You said earlier you weren't getting any errors inside the application. That shouldn't be possible if PostgreSQL is returning an error you should be picking it up in the application.
It's difficult to be more helpful without more accurate information. Every statement you send to PostgreSQL will return a status code. If you get an error inside a multi-statement transaction then all the statements in that transaction will be rolled back. You've either got some confused transaction control in the application or it is falling down for some other reason.
One of the possibilities is that the OP is using ssl, and the ssl_renegotiation_limit is reached. In any case: set the log_connections / log_disconnections to "On" and check the logfile.
I found out what was the problem with my insert command, and although it might seem funny it's one of those things you never thought could go wrong.
My application is developed in Django and has a command that simply calls for the file that does the insert operations into the tables.
i.e. in the command line terminal I just write:
time python populate_sql
The reason for which I use the time command is because I want to see how long it takes for the insertion to execute. Well, the problem was here. That time command issued an error, a Out of memory error which stopped the insertion into the database. I found this little code while running the command with the --verbose option which lets you see all the details of the command.
I would like to thank you all for your answers, for the things that I have learned from them and for the time you used trying to help me.
If you have a Django application in which you make a lot of operations with the database, then my advice to you is to set the 'DEBUG' variable in to 'FALSE' because it eats up a lot of your memory in time.
DEBUG = False
And in the end, thank you again for the support Richard Huxton!

Solr indexing - Master/Slave replication, how to handle huge index and high traffic?

I'm currently facing an issue with SOLR (more exactly with the slaves replication) and after having spent quite a few time reading online I find myself having to ask for some enlightenment.
- Does Solr have some limitation in size for its index?
When dealing with a single master, when is it the right moment to decide to use multi cores or multi indexes?
Is there any indications on when reaching a certain size of index, partitioning is recommended?
- Is there any max size when replicating segments from master to slave?
When replicating, is there a segment size limit when the slave won't be able to download the content and index it? What is the threshold to which a slave won't be able to replicate when there's a lot of traffic to retrieve info and lots of new documents to replicate.
To be more factual, here is the context that led me to these questions:
We want to index a fair amount of documents, but when the amount reaches more than a dozen millions, the slaves can't handle it and start failing replicating with a SnapPull error.
The documents are composed with a few text fields (name, type, description, ... about 10 other fields of let's say 20 characters max).
We have one master, and 2 slaves which replicate data from the master.
This is my first time working with Solr (I work usually on webapps using spring, hibernate... but no use of Solr), so I'm not sure how to tackle this issue.
Our idea is for the moment to add multiple cores to the master, and have a slave replicating from each of this core.
Is it the right way to go?
If it is, how can we determine the number of cores needed? Right now we're just going to try and see how it behaves and adjust if necessary, but I was wondering if there was any best practices or some benchmarks that have been done on this specific topic.
For this amount of documents with this average size, x cores or indexes are needed ...
Thanks for any help in how I could deal with a huge amount of documents of average size!
Here is a copy of the error that is being thrown when a slave is trying to replicate:
ERROR [org.apache.solr.handler.ReplicationHandler] - <SnapPull failed >
org.apache.solr.common.SolrException: Index fetch failed :
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(
at org.apache.solr.handler.ReplicationHandler.doFetch(
at org.apache.solr.handler.SnapPuller$
at java.util.concurrent.Executors$
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(
at java.util.concurrent.FutureTask.runAndReset(
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(
at java.util.concurrent.ScheduledThreadPoolExecutor$
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.lang.RuntimeException: read past EOF
at org.apache.solr.core.SolrCore.getSearcher(
at org.apache.solr.update.DirectUpdateHandler2.commit(
at org.apache.solr.handler.SnapPuller.doCommit(
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(
... 11 more
Caused by: read past EOF
at org.apache.lucene.index.SegmentInfos$2.doBody(
at org.apache.lucene.index.SegmentInfos$
at org.apache.lucene.index.SegmentInfos$
at org.apache.lucene.index.SegmentInfos.readCurrentVersion(
at org.apache.lucene.index.DirectoryReader.isCurrent(
at org.apache.lucene.index.DirectoryReader.doReopen(
at org.apache.lucene.index.DirectoryReader.reopen(
at org.apache.solr.core.SolrCore.getSearcher(
... 14 more
After Mauricio's answer, the solr libraries have been updated to 1.4.1 but this error was still raised.
I increased the commitReserveDuration and even if the "SnapPull Failed" error seems to have disappeared, another one started being raised, not sure about why as I can't seem to find much answer on the web:
ERROR [org.apache.solr.servlet.SolrDispatchFilter] - <ClientAbortException:
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(
at org.apache.tomcat.util.buf.ByteChunk.append(
at org.apache.catalina.connector.OutputBuffer.writeBytes(
at org.apache.catalina.connector.OutputBuffer.write(
at org.apache.catalina.connector.CoyoteOutputStream.write(
at org.apache.solr.common.util.FastOutputStream.flushBuffer(
at org.apache.solr.common.util.JavaBinCodec.marshal(
at org.apache.solr.request.BinaryResponseWriter.write(
at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
at org.apache.catalina.core.ApplicationFilterChain.doFilter(
at org.apache.catalina.core.StandardWrapperValve.invoke(
at org.apache.catalina.core.StandardContextValve.invoke(
at org.apache.catalina.core.StandardHostValve.invoke(
at org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(
at org.apache.catalina.valves.ErrorReportValve.invoke(
at org.apache.catalina.core.StandardEngineValve.invoke(
at org.apache.catalina.connector.CoyoteAdapter.service(
at org.apache.coyote.http11.Http11AprProcessor.process(
at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(
Caused by:
at org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer(
at org.apache.coyote.http11.InternalAprOutputBuffer$SocketOutputBuffer.doWrite(
at org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(
at org.apache.coyote.http11.InternalAprOutputBuffer.doWrite(
at org.apache.coyote.Response.doWrite(
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(
... 22 more
ERROR [org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/].[SolrServer]] - <Servlet.service() for servlet SolrServer threw exception>
at org.apache.catalina.connector.ResponseFacade.sendError(
at org.apache.solr.servlet.SolrDispatchFilter.sendError(
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
at org.apache.catalina.core.ApplicationFilterChain.doFilter(
at org.apache.catalina.core.StandardWrapperValve.invoke(
at org.apache.catalina.core.StandardContextValve.invoke(
at org.apache.catalina.core.StandardHostValve.invoke(
at org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(
at org.apache.catalina.valves.ErrorReportValve.invoke(
at org.apache.catalina.core.StandardEngineValve.invoke(
at org.apache.catalina.connector.CoyoteAdapter.service(
at org.apache.coyote.http11.Http11AprProcessor.process(
at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(
I still wonder what are the best practices to handle a big index (more than 20G) containing a lot of documents with solr. Am I missing some obvious links somewhere? Tutorials, documentations?
Cores are a tool primarily used to have different schemas in a single Solr instance. Also used as on-deck indexes. Sharding and replication are orthogonal issues.
You mention "a lot of traffic". That's a highly subjective measure. Instead, try to determinate how many QPS (queries per second) you need from Solr. Also, does a single Solr instance answer your queries fast enough? Only then can you determine if you need to scale out. A single Solr instance can handle a lot of traffic, maybe you don't even need to scale.
Make sure you run Solr on a server with plenty of memory (and make sure Java has access to it). Solr is quite memory-hungry, if you put it on a memory-constrained server, performance will suffer.
As the Solr wiki explains, use sharding if a single query takes too long to run, and replication if a single Solr instance can't handle the traffic. "Too long" and "traffic" depend on your particular application. Measure them.
Solr has lots of settings that affect performance: cache auto-warming, stored fields, merge factor, etc. Check out SolrPerformanceFactors.
There are no hard rules here. Every application has different search needs. Simulate and measure for your particular scenario.
About the replication error, make sure you're running 1.4.1 since 1.4.0 had a bug with replication.

What does the sys.dm_exec_query_optimizer_info "timeout" record indicate?

During an investigation of some client machines losing their connection with SQL Server 2005, I ran into the following line of code on the web:
Select * FROM sys.dm_exec_query_optimizer_info WHERE counter = 'timeout'
When I run this query on our server - we are getting the following results:
counter - occurrence - value
timeout - 9100 - 1
As far as I can determine, this means that the query optimizer is timing out while trying to optimize queries run against our server – 9100 times. We are however, not seeing any timeout errors in the SQL Server error log, and our end-users have not reported any timeout specific errors.
Can anyone tell me what this number of “occurrences” means? Is this an issue we should be concerned about?
This counter is nothing to do with your connection issues.
SQL Server won't spend forever trying to compile the best possible plan (at least without using trace flags).
It calculates two values at the beginning of the optimisation process.
Cost of a good enough plan
Maximum time to spend on query optimisation (this is measured in number of transformation tasks carried out rather than clock time).
If a plan with a cost lower than the threshold is found then it needn't continue optimising. Also if it exceeds the number of tasks budgeted then optimisation will also end and it will return the best plan found so far.
The reason that optimisation finished early shows up in the execution plan in the StatementOptmEarlyAbortReason attribute. There are actually three possible values.
Good enough plan found
Memory Limit Exceeded.
A timeout will increment the counter you ask about in sys.dm_exec_query_optimizer_info.
Further Reading
Reason for Early Termination of Statement
Microsoft SQL Server 2014 Query Tuning & Optimization
The occurence column will tell you the number of times that counter has been incremented and the value column is an internal column for this counter.
See here
Sorry, the documentation say this is internal only.
Based on the other link, I suspect this is for internal engine timeouts (eg SET QUERY_GOVERNOR_COST_LIMIT)
A client timeout will also not be logged in SQL because the client aborts the batch, ths stopping SQL processing.
Please do you have more details?
