Optimizing array overlap query using && operator - arrays

I have a relatively simple query that attempts to calculate the count of rows I'll have to deal with in a later operation. It looks like:
SELECT COUNT(*)
FROM my_table AS t1
WHERE t1.array_of_ids && ARRAY[cast('1' as bigint)];
The tricky piece is that ARRAY[] portion is determined by the code that invokes the query so instead of having 1 element in this example it could have hundreds or thousands. This makes the query take a decent amount of time to run if a user is actively waiting for the calculation to complete.
Is there anything obvious I'm doing wrong or any obvious improvement that could be made?
Thanks!
Edit:
There are not any indexes on the table. I tried to create one with
CREATE INDEX my_index on my_table(array_of_ids);
and it came back with
ERROR: index row requires 8416 bytes, maximum size is 8191
I'm not very experienced here unfortunately. Maybe there is simply too many rows for an index to be useful?
I ran an explain on the query and the output essentially looks like:
QUERY PLAN | Filter: ((array_of_ids && '{1, 2, 3, 4, 5, 6, 7 ... n}'::bigint[])
so I guess it is automatically doing the ::bigint[]. I tried this as well and the query takes the same time to execute which I guess makes sense.
I realize I'm only pasting a portion of the response to the explain (analyze, buffers, format text) but I'm doing this in psql and my system often runs out of memory. There are tons of --- in the output, I am not sure if there is a way to not have psql do that.
The plan looks pretty simple so is this basically saying there is no way to optimize this? I have two huge arrays and it just takes time to determine an overlap? Not sure if there is a JOIN solution here, I tried to unnest and do a JOIN on equivalent entries but the query never returned so I'm not sure if I got it wrong or if it is just a far slower approach.

Related

Excel calculate smallest of X columns within Y columns, ignoring zeros

I'm trying to calculate the sum of best segments in a run. For example, each Km gives a list as such:
5:40 6:00 5:45 5:55 6:21 6 :30
I'm trying to gather the best segments of 2km/3km/4km etc and would like a simple code to do it. At the moment, I'm using the formula
=Min(If(B1=0,9:9:9,sum(A1:B1),If(C1=0,9:9:9,sum(B1:C1))
but this goes all the way to 50km, meaning a very long formulae that I then have to repeat slightly differently at 3km, then 4km, then 5km etc. Surely there must me a way of
generating an array of summed columns of every n column, then iterating over that to find the min while ignoring values of 0?
I can do it manually for now, but what if I want to go over 50km? I might want to incorporate bike rides/car drives in the future just for some data analysis so I figured it best finding an ideal formulae now.
It's frustrating as I could code it and I want to avoid VBA ideally and stick to formulae in Excel.
Here is a draft of the case where there aren't any zeroes just for groups of 2Km. I decided the simplest approach initially was to add a couple of helper rows containing the running total of times (and for later use counts) and use a formula like this to subtract them in pairs:
=MIN(INDEX(A2:J2,SEQUENCE(1,9,2))-IF(SEQUENCE(1,9,0)=0,0,INDEX(A2:J2,SEQUENCE(1,9,0))))
but if you have access to recent additions to Excel 365 like Scan you can do it without helper rows.
Here is a more realistic scenario with a couple of zeroes thrown in
=LET(runningSum,Y$4:AP$4,runningCount,Y$5:AP$5,cols,COLUMNS(runningSum),leg,X7,
seqEnd,SEQUENCE(1,cols-leg+1,leg),seqStart,SEQUENCE(1,cols-leg+1,0),
times,INDEX(runningSum,seqEnd)-IF(seqStart=0,0,INDEX(runningSum,seqStart)),
counts,INDEX(runningCount,seqEnd)-IF(seqStart=0,0,INDEX(runningCount,seqStart)),
MIN(IF(counts=leg,times)))
Note that there are no runs of more than seven consecutive legs that don't contain a zero so 8, 9, 10 etc. just work out to 0.
As mentioned you could dispense with the helper rows by using Scan, but not everyone has access to this so I will add it separately:
=LET(data,Y$3:AP$3,runningSum,SCAN(0,data,LAMBDA(a,b,a+b)),
runningCount,SCAN(0,data,LAMBDA(a,b,a+(b>0))),leg,X7,cols,COLUMNS(data),
seqEnd,SEQUENCE(1,cols-leg+1,leg),seqStart,SEQUENCE(1,cols-leg+1,0),
times,INDEX(runningSum,seqEnd)-IF(seqStart=0,0,INDEX(runningSum,seqStart)),
counts,INDEX(runningCount,seqEnd)-IF(seqStart=0,0,INDEX(runningCount,seqStart)),
MIN(IF(counts=leg,times)))
Tom that worked! I learnt a few things on the way too and using the indexing method alongside sequence and columns is something I had not thought of. I'd never heard of the LET command before and I can already see that this is going to really help with some of the bigger calculations in the future.
Thank you so much, I'd like to show you how it now looks. Row 3087 is my old formula, row 3088 is a copy of the same data using the new formula, as you can see I've gotten exactly the same results so it's clear that it works perfectly and it is can be easily duplicated.

SQL Server "shortcut" processing

Here is a simplified version of the problem I'm trying to solve.
I have a temporary table #MyData with 2 columns: description and value.
I also have tables MyRuleSequenceCollection, MyRuleSequence, and MyRule, which I want to use to assess the data in the temporary table. MyRuleSequence is an ordered list of MyRule records, and MyRuleSequenceCollection is an unordered collection of MyRuleSequence records.
One of the sequences looks for records in the temporary table with descriptions "A" and "B", then attempts to divide A by B. The first rule tests for the presence of A, if it's not there, the process should stop. The second rule tests for the presence of B, if it's not there, the process should stop. The third rule tests that B is not 0, if it is 0, the process should stop. Finally, the 4th rule divides A by B and tests whether the result is more than 1.
Temporary table contains:
A 20
B 5
Result: all 4 rules assessed, final result true
Temporary table contains:
A 20
B 0
Result: only first 3 rules run, no divide by 0 error, final result false
Temporary table contains:
B 20
C 5
Result: only first rule runs, final result false
The only way I can see to design this is either with cursors, or worse, with dynamic SQL.
So I'm looking for design suggestions. Considering the above as just an example (many cases are far more complex), can this process be designed to avoid cursors or dynamic SQL? Could recursion be a solution?
Update: several days with no suggestions or input. Does anyone have an opinion about using CTE for this? Or is that just a cursor with the deallocate handled for you?
Sometimes, there really is no valid choice other than using a database cursor.
So that's what I've done. It works, and performance and resource usage are reasonable. It's a firehose forward only local static no locking everything else I could think of to speed it up cursor, and if I have performance problems later I can subset a couple of the joined/aliased tables (3 tables are left joined/aliased 10 times) into a table variable and use that in the cursor to speed it up further.
I'm left wondering why the automatic aversion to cursors is so strong, even in a case where there isn't a pragmatic alternative.

SqlDataReader does not return all records (3rd attempt)

I've tried to find solution for this problem twice before, but unfortunately those answers haven't provided permanent fix, so here I am, giving it another try.
I have an SQL Server stored procedure that returns list of 1.5 million integer IDs. I am calling this SP from ASP.NET/VB.NET code and executing a SqlDataReader:
m_dbSel.CommandType = CommandType.StoredProcedure
m_dbSel.CommandText = CstSearch.SQL.SP_RS_SEARCH_EX
oResult = m_dbSel.ExecuteReader(CommandBehavior.CloseConnection)
Then I am passing that reader to a class constructor to build Generic List(Of Integer). The code is very basic:
Public Sub New(i_oDataReader As Data.SqlClient.SqlDataReader)
m_aFullIDList = New Generic.List(Of Integer)
While i_oDataReader.Read
m_aFullIDList.Add(i_oDataReader.GetInt32(0))
End While
m_iTotalNumberOfRecords = m_aFullIDList.Count
End Sub
The problem is - this doesn't read all 1.5 million of records, the number is inconsistent, final count could be 500K or 1 million etc. (Most often "magic" number of 524289 records is returned). I've tried using CommandBehavior.SequentialAccess setting when executing command, but the results turned out to be inconsistent as well.
When I am running SP in SSMS, it returns certain number of records almost right away and displays them, but then continues to run for a few seconds more until all 1.5 million records are done - does it have anything to do with this?
UPDATE
After a while I found that on very-very rare occasions the loop code above does throw an exception:
System.NullReferenceException: Object reference not set to an instance
of an object. at
System.Data.SqlClient.SqlDataReader.ReadColumnHeader(Int32 i)
So some internal glitch does happen. Also it looks like if I replace
While i_oDataReader.Read
m_aFullIDList.Add(i_oDataReader.GetInt32(0))
End While
that deals in Integers with
While i_oDataReader.Read
m_aFullIDList.Add(i_oDataReader(0))
End While
that deals in Objects - the code seems to run without a glitch and returns all records.
Go figure.
Basically, as we've flogged out in the comments(*), the problem isn't with SqlDataRead, the stored procedure, or SQL at al. Rather, your List.Add is failing because it cannot allocate the additional memory for 2^(n+1) items to extend the List and copy your existing 2^n items into. Most of the time your n=19 (so 524289 items), but sometimes it could be higher.
There are three basic things that you could do about this:
Pre-Allocate: As you've discovered, by pre-allocating you should be able to gwet anywhere from 1.5 to 3 times as many items. This works best if you know ahead of time how many items you'll have, so I'd recommend either excuting a SELECT COUNT(*).. ahead of time, or adding a COUNT(*) OVER(PARTITION BY 1) column and picking it out of the first row returned to pre-allocate the List. The problem with this approach is that you're still pretty close to your limit and could easily run out of memory in the near future...
Re-Configure: Right now you are only getting at most 2^22 bytes of memory for this, when in theory you shoud be able to get around 2^29-2^30. That means that something on your machine is preventing you from extending your writeable Virtual Memory limit that high. Likely causes include the size of your pagefile and competition from other processes (but there are other possibilities). Fix that and you should have more than enough headroom for this.
Streaming: Do you really need all 1.5 million items in memory at the same time? If not and you can determine which you don't need (or extract the info that you do need) on the fly, then you can solve this problem the same way that SqlDataReader does, with streaming. Just read a row, use it, then lose it and go on to the next row.
Hopefully this helps.
(* -- Thanks, obviously, to #granadaCoder and #MartinSmith)
If you really think that the problem rests solely with the List data structure (and not that you are just running out of memory), then there are some other ways to work around the List structure's allocation behavior. One way would be to implement an alternative List class (as IList(of Integer)).
Through the interface it would appear the same as List but internally it would have a different allocation scheme, by storing the data in a nested List(of List(of Integer)). Every 1000 items, it would create a new List(of Integer), add it to the parent nested list and then use it to add in the next 1000 items.
The reason that I didn't suggest this before is because, like pre-allocation, this may allow you to get closer to your memory limit, but, if that's the problem, you are still going to run out eventually (just as with pre-allocating) because this limit is too close to the actual number of items that you need (1.5 million).
Basically you read all record in SqlDataReader with select query I suggest you to add order by in your query and it sort all records in Acceding order and they also read in acceding order in SqlDataReader.
I also face this problem in my last project I have read more than 2 million records from database with unique id serialNo but this records are not come in sequence after 1000 records it jumps to 21, 00, 263th record and all records are come in wrong sequence.
Then I use (order by serialNo) this query and my problem is solved you not need to do anything extra only put order by in your select query and it will work for you
I hope this helps for you.

Is it better to maintain 3 small columns or 1 large column in a Table?

Three small number columns [Number(1)] >>
OptionA | 0/1
OptionB | 0/1
OptionC | 0/1
or one larger string column [Varchar2(29)] >>
Options | OptionA=0/1|OptionB=0/1|OptionC=0/1
I'm not sure about the way database handles tables, but I think that maintaining three columns as Number(1) is better than one column as Varchar2(29) !
-EDIT-
Let me explain the situation a bit more:
I am working on a common framework where the all incoming/outgoing request/response is tracked, these interactions can be channeled to a DB/File/JMS; now the all the configuration is being loaded from a table which has a column that corresponds to the output type, currently I'm using "DB=1|FILE=1|JMS=0" as the value of that column so that later if anyone wants to add this for their module they can easily understand what is going on, in my code I've written a simple logic which splits the string by "|" and then I use the exclusive or operator to switch between choice using a switch case..
Everything is already done but I don't like the idea of one large column is better than three small + it will remove the split string I'm doing.
-EDIT-
I finally got it clarified, there may be a situation where we have to add more options; in that case if we add the data column wise, it will result in modifying the table + changing the entity + adding more if's n all; on the other hand I ended up making an enum out of it in a simple bit wise logic to switch between options; this way, I need to modify the enum & add a new handler for the new option & then we are good to go.
Using a single column to store multiple pieces of data is probably the worst thing you can do in a database.
Violating first normal form has at least the following disadvantages:
More difficult to query. OptionA = 1 and OptionB = 1 and OptionC = 0 versus substr(options, 9, 1) = '1' and substr(options, 19, 1) = '1' and substr(options, 19, 1) = '0'.
Less flexable. What happens when you need to add another option? Adding a new column is easy. Adding a new format could mess up old queries. For example, if someone tries to read OptionC with substr(options, -1, 1). (Although this is a good reason to use a 3rd option - a separate table.)
No type safety. This can be a very subtle and tricky problem. Let's say you write substr(options, 9, 1) = 1 instead of substr(options, 9, 1) = '1'. If anyone ever gets the format wrong, a single value could ruin lots of queries. Or worse, it only intermittently crashes a small number of queries, because the access paths keep changing. (Although you can prevent this with a check constraint.)
Slower queries. Normally the amount of work done in an expression or condition isn't a significant cost for a query. But adding a lot of unnecessary string manipulation can make a difference.
Less optimizing. Oracle can only build efficient query plans if it can understand your data. For example, let's say that OptionA is "0" 99.9% of the time. When you filter OptionA = 0, Oracle can use a histogram make a very accurate prediction about the number of rows returned. But for substr(options, 9, 1) = '1' you'll only get a wild guess. If you have complicated queries using this columns you may spend a lot of time trying to "fix" the cardinality estimates. (Although maybe expression statistics could help with this?)
There are times when denormalizing is a good idea. For example, if you have terabytes of data, and compress the table, the single column may take up less space. (But if you're trying to save space, why not use a format like "000" instead?).
If there really is a good reason for this, it definitely needs to be documented. Perhaps add a comment on the column.
For a start, if I am reading your question right, you want each of the options to have one of just two possible values, correct?
If so then you could:
have a separate integer (or boolean) column for each option
have an options column that is a string of 1's and 0's, one digit for each options e.g. "001"
use an 'options' column that is an integer and use a bit value for each options, e.g. optionA == options & 1, optionB == options & 2 etc.
some databases have a bit vector data type which you could use. For mysql there is the BIT data type, which can store bit strings up to 64 bits long.
There will be a trade-off between code complexity and efficiency for each of these. Ask yourself, how much of the machine's time or storage will be saved by employing each of these options? And how much of your time will be saved?
In this instance the 3 column approach is the one I would recommend, not only does this keep things simple in terms of extracting data, but should you ever wish you could set values against all 3 columns rather than being limited to one VarChar2 field. If you opt for the single column VarChar2 then it is fairly simple to extract the info you need using the substr command or perhaps another variation, and although this isn't heavy work for an Oracle db, it does essentially put extra work on the server which is not necessary.

GQL: is myList.contains(myField) any faster than 30 separate myField == myList(i) queries?

I've got a list with 100 Longs in it, and an Entity kind with a Long field. I want to find all the entities whose field value is in the list.
I was thinking, "Great! I'll just write where :p1.contains(field)," but AppEngine will only split this out on less than 31 elements (new Baskin Robbins' slogan?). So, now I'm thinking I'll have to split the list of 100 into multiple lists of 30.
But at this point, my hopes of a one-liner shot, I realized I could do something like
for (Long number : list)
GQL("select * from Kind where field = " + number)
essentially splitting into all the subqueries myself. My question is... is this equivalent to letting appengine split my list of 30 into 30 separate queries? Or is there some back-end magic they do to fetch all 30 sub-queries simultaneously?
Sub-queries are functionally identical to regular queries. Currently, sub-queries are run serially, so doing it yourself isn't going to be any slower than letting the SDK do it for you. In future, though, it's likely that sub-queries will be executed in parallel, making them much more efficient than doing it yourself. It's also possible functionality like 'IN' could be pushed to the backend, avoiding the issue in the app server altogether.
You should be aware, though, that anything that requires you to execute 100 queries is going to be very, very slow. If you can, you should find a way to work around this without doing piles of queries and merging the results.
That function likely utilizes the IN (arg1,arg2,...) operator.
The problem being:
A single query containing != or IN operators is limited to 30 sub-queries.
Each item in the list counts as a sub query so you can only fetch 30 items at a time with that query style.
Using SELECT * FROM YourKind WHERE value IN list_property will cause multiple sub-queries to be run. It is these sub-queries that are limited to 30.
Using SELECT * FROM YourKind WHERE list_property = value will use the index that is built for your list_property so should work with as many entries as your list-property can contain†.
† I think this would be limited to the 5000 max items for an index

Resources