Here is a simplified version of the problem I'm trying to solve.
I have a temporary table #MyData with 2 columns: description and value.
I also have tables MyRuleSequenceCollection, MyRuleSequence, and MyRule, which I want to use to assess the data in the temporary table. MyRuleSequence is an ordered list of MyRule records, and MyRuleSequenceCollection is an unordered collection of MyRuleSequence records.
One of the sequences looks for records in the temporary table with descriptions "A" and "B", then attempts to divide A by B. The first rule tests for the presence of A, if it's not there, the process should stop. The second rule tests for the presence of B, if it's not there, the process should stop. The third rule tests that B is not 0, if it is 0, the process should stop. Finally, the 4th rule divides A by B and tests whether the result is more than 1.
Temporary table contains:
A 20
B 5
Result: all 4 rules assessed, final result true
Temporary table contains:
A 20
B 0
Result: only first 3 rules run, no divide by 0 error, final result false
Temporary table contains:
B 20
C 5
Result: only first rule runs, final result false
The only way I can see to design this is either with cursors, or worse, with dynamic SQL.
So I'm looking for design suggestions. Considering the above as just an example (many cases are far more complex), can this process be designed to avoid cursors or dynamic SQL? Could recursion be a solution?
Update: several days with no suggestions or input. Does anyone have an opinion about using CTE for this? Or is that just a cursor with the deallocate handled for you?
Sometimes, there really is no valid choice other than using a database cursor.
So that's what I've done. It works, and performance and resource usage are reasonable. It's a firehose forward only local static no locking everything else I could think of to speed it up cursor, and if I have performance problems later I can subset a couple of the joined/aliased tables (3 tables are left joined/aliased 10 times) into a table variable and use that in the cursor to speed it up further.
I'm left wondering why the automatic aversion to cursors is so strong, even in a case where there isn't a pragmatic alternative.
Related
Summary
I am looking to compare two data sets within Excel, and produce an output depending on which has changed, and what to.
More info
I hold two databases, which are updated independently. I cross compare these databases monthly, to see which database(s) have changed, and who holds the most accurate data. The other database is then amended to reflect the correct value. I am trying to automate the process of deciding which database needs to be updated. I'm comparing not just data change, but data change over time.
Example
On month 1, database 1 contains the value "Foo". Database 2 also contains the value "Foo". On month 2, database 1 now contains the value "Bar", but database 2 still contains the value "Foo". I can ascertain that because database 1 holds a different value, but last month they held the same value, database 1 has been updated, and database 2 should be updated to reflect this.
Table Example
Data1 Month1
Data2 Month1
Data1 Month2
Data2 Month2
Database to update
Reason
Foo
Foo
Foo
Foo
None
All match
Apple
Apple
Orange
Apple
Data2
Data1 has new data when they did match previously. Data2 needs to be updated with the new info.
Cat
Dog
Dog
Dog
None
They mismatched previously, but both databases now match.
1
1
1
2
Data1
Data2 has new data when they did match previously. Data1 needs to be updated with the new info.
AAA
BBB
AAA
BBB
CHECK
Both databases should match, but you cannot ascertain which should be updated.
ABC
ABC
DEF
GHI
CHECK
Both databases changed, but you cannot tell if Data1 or Data2 is correct as they were updated at the same time.
Current logic
Currently, I'm trying to get this to work using multiple nested =IF statements, combined with some =AND and =NOT statements. Essentially, an example part of the statement would be (database 1, month 1 = DB1M1, etc.): =IF(AND(DB1M1=DB2M1,DB2M1=DB2M2),"None",IF(AND(DB1M1=DB2M1,DB1M1=DB2M2,NOT(DB2M1=DB1M2)),"Data2",IF(ETC,ETC,ETC).
I've had some success with this, but due to the length of the statement, it is very messy and I'm struggling to make it work, as it becomes unreadable for me trying to calculate the possible outcomes in just =IF clauses. I also have no doubt it's incredibly inefficient, and I'd like to make it more efficient, especially considering the size of the database is around 10,000 lines.
Final Notes / Info
I'd appreciate any help with getting this to work. I'm keen to learn, so any tips and advice are always welcomed.
I'm using MSO 365, version 2202 (I cannot update beyond this). This will be run in the Desktop version of Excel. I would prefer this is done exclusively using formulas, but I am open to using Visual Basic if it would be otherwise impossible or incredibly inefficient. Thanks!
In previous similar scenarios, it sounds me familiar using bitwise operations or binary numbers. The main idea behind a binary number is that each digit can act as flag indicating if certain property is present or not.
The goal is to identify if two databases (DB1, DB2) are on sync based on a given value over two periods (M1, M2). If one database is out of sync we would like to know which action to carry out to have it on sync with respect to the other database. Similarly we would like to know when both databases are out of sync at the end of the period.
Here is the Excel solution in cell M2, then extend down the formula:
=LET(dec,
BIN2DEC(IF(B2=C2,0,1)&IF(D2=E2,0,1)&IF(B2=D2,0,1)&IF(C2=E2,0,1)),
DBsOnSync, ISNUMBER(FIND(dec, "0;10;3;9;11")),
DBsOutOfSync, ISNUMBER(FIND(dec, "7;12;13;14;15")),
IFERROR(IFS(dec=5,"Update DB1", dec=6,"Update DB2", DBsOnSync=TRUE,
"DBs on Sync", DBsOutOfSync=TRUE, "DBs out of Sync"), "Case not defined")
)
The input table tries to consider all possible combinations, so we can build the logic. The highlighted columns are not really necessary, it is just for illustrative or testing purpose. In red combinations already defined previously so it is no really necessary to take into account.
Explanation
We build a binary number based on the following conditions for each binary digit. This is just an intermediate result to convert it to a decimal number via BIN2DEC and determine the case for each possible value.
BIN2DEC(IF(B2=C2,0,1)&IF(D2=E2,0,1)&IF(B2=D2,0,1)&IF(C2=E2,0,1))
We have four conditions, so we build a binary number of length 4, where each digit represent a flag condition (0-equal, 1-not equal).
We build the binary number that will be the input for BIN2DEC via concatenation of the logical conditions we are looking for. Each IF condition represents a binary digit from left to right:
IF(B2=C2,0,1) check for DB1, DB2 are consistent in M1 (intermediate calculation shown in column M1).
IF(D2=E2,0,1) check for DB1, DB2 are consistent in M2 (intermediate calculation shown in column M2).
IF(B2=D2,0,1) DB1 keeps consistency over time (intermediate calculation shown in column DB1).
IF(C2=E2,0,1) DB2 keeps consistency over time (intermediate calculation shown in column DB2).
Converting the binary number to decimal, we can identify each case, assigning a set of decimal numbers. The following decimal number or set of decimal numbers represent each case:
dec
Scenario
0,10,3,9,11
DBs on Sync
5
Update DB1
6
Update DB2
7,12,13,14,15
DBs out of sync
We use IFS and FIND to identify each case based on dec value. We use FIND to find dec in the string that represents the set of possible numbers for each case. We use ISNUMBER to check whether the number was found or not. We include as last resource, for testing purpose, if some case was not defined yet, it returns Case not defined.
Notes
Columns F:I give a hint about maximum number of possible combinations. We have four columns, with only two possible values: Sync, NotSync. Which represents 2*2*2*2=16 combinations, which represents the maximum possible binary numbers of size 4 we can have (we have four conditions).
As you can see from the screenshot we have less numbers of unique combinations (12). The reason is because the way we build the binary numbers they have dependencies, so some combination are impossible.
I've finished my first semester in a college-level SQL course where we used "SQL queries for Mere Mortals" 3rd edition.
Long term I want to work in data governance or as a data scientist, so digging deeper is needed and I found the Stanford SQL course. Today taking the first mini quiz, I got the answers right but on these two I'm not understanding WHY I got the answers right.
My 'SQL for Mere Mortals' book doesn't even cover hash or tree-based indexes so I've been searching online for them.
I mostly guessed based on what she said but it feels more like luck than "I solidly understand why". So I've ordered "Introduction to Algorithms" 3rd edition by Thomas Cormen and it arrived last week but it will take me a while to read through all 1,229 pages.
Found that book in this other stackoverflow link =>https://stackoverflow.com/questions/66515417/why-is-hash-function-fast
Stanford Course => https://www.edx.org/course/databases-5-sql
I thought a hash index on College.enrollment would not speed up because they limit it to less than a number vs an actual number ?? I'm guessing per this link Better to use "less than equal" or "in" in sql query that the query would be faster if we used "<=" rather than "<" ?
This one was just a process of elimination as it mentions the first item after the WHERE clause, but then was confusing as it mentions the last part of Apply.cName = College.cName.
My questions:
I'm guessing that similar to algebra having numerators and denominators, quotients, and many other terms that specifically describe part of an equation using technical terms. How would you use technical terms to describe why these answers are correct.
On the second question, why is the first part of the second line referenced and the last part of the same line referenced as the answers. Why didn't they pick the first part of each of the last part of each?
For context, most of my SQL queries are written for PostgreSQL now within PyCharm on python but I do a lot of practice using the PgAgmin4 or MySqlWorkbench desktop platforms.
I welcome any recommendations you have on paper books or pdf's that have step-by-step tutorials as many, many websites have holes or reference technical details that are confusing.
Thanks
1. A hash index is only useful for equality matches, whereas a tree index can be used for inequality (< or >= etc).
With this in mind, College.enrollment < 5000 cannot use a hash index, as it is an inequality. All other options are exact equality matches.
This is why most RDBMSs only let you create tree-based indexes.
2. This one is pretty much up in the air.
"the first item after the WHERE clause" is not relevant. Most RDBMSs will reorder the joins and filters as they see fit in order to match indexes and table statistics.
I note that the query as given is poorly written. It should use proper JOIN syntax, which is much clearer, and has been in use for 30 years already.
SELECT * -- you should really specify exact columns
FROM Student AS s -- use aliases
JOIN [Apply] AS a ON a.sID = s.sID -- Apply is a reserved keyword in many RDBMS
JOIN College AS c ON c.cName = a.aName
WHERE s.GPA > 1.5 AND c.cName < 'Cornell';
Now it's hard to say what a compiler would do here. A lot depends on the cardinalities (size of tables) in absolute terms and relative to each other, as well as the data skew in s.GPA and c.cName.
It also depends on whether secondary key (or indeed INCLUDE) columns are added, this is clearly not being considered.
Given the options for indexes you have above, and no other indexes (not realistic obviously), we could guesstimate:
Student.sID, College.cName
This may result in an efficient backwards scan on College starting from 'Cornell', but Apply would need to be joined with a hash or a naive nested loop (scanning the index each time).
The index on Student would mean an efficient nested loop with an index seek.
Student.sID, Student.GPA
Is this one index or two? If it's two separate indexes, the second will be used, and the first is obviously going to be useless. Apply and College will still need heavy joins.
Apply.cName, College.cName
This would probably get you a merge-join on those two columns, but Student would need a big join.
Apply.sID, Student.GPA
Student could be efficiently scanned from 1.5, and Apply could be seeked, but College requires a big join.
Of these options, the first or the last is probably better, but it's very hard to say without further info.
In a real system, I would have indexes on all tables, and use INCLUDE columns wisely in order to avoid key-lookups. You would want to try to get a better feel for which tables are the ones that need to be filtered early etc.
First question
A hash-index is not linearly-searchable (see Slide 7), that is, you cannot perform range-comparisons with a hash-index. This is because (in general terms) hash functions are one-way: given the output of a hash function you cannot determine the input, and the output will be in apparently random order (having a random order is good for ensuring an even load over the set of hashtable bins).
Now, for a contrived and oversimplified example:
Supposing you have these rows:
PK | Enrollment
----------------
1 | 1
2 | 10
3 | 100
4 | 1000
5 | 10000
A perfect hash index of this table would look something like this:
Assuming that the hash of 1 is 0xF822AA896F34253E and the hash of 10 is 0xB383A8BBDAA41F98, and so on...
EnrollmentHash | PhysicalRowPointer
---------------------------------------
0xF822AA896F34253E | 1
0xB383A8BBDAA41F98 | 2
0xA60DCD4E78869C9C | 3
0x49B0AF769E6B1EB3 | 4
0x724FD1728666B90B | 5
So given this hashtable index, looking at the hashes you cannot determine which hash represents larger enrollment values vs. smaller values. But a hashtable index does give you O(1) lookup for single specific values, which is why it works best for discrete, non-continuous, data values, especially columns used in JOIN criteria.
Whereas a tree-hash does preserve relative ordering information about values, but with O( log n ) lookup time.
Second question
First, I need to rewrite the query to use modern JOIN syntax. The old style (using commas) has been obsolete since SQL-92 in 1992, that's almost 30 years ago.
SELECT
*
FROM
Apply
INNER JOIN Student ON Student.sID = Apply.sID
INNER JOIN College ON Apply.cName = Apply.cName
WHERE
Student.GPA > 1.5
AND
College.cName < 'Cornell'
Now, generally speaking the best way to answer this kind of question would be to know what the STATISTICS (cardinality, value distribution, etc) of the tables are. But without that I can still make some guesses.
I assume that College is the smallest table (~500 rows?), Student will have maybe 1-2m rows, and assuming every Student makes 4-5 applications then the Apply table will have ~5m rows.
...armed with that inference, we can deduce:
Student.sID = Apply.sID is an ID match - so a hash-index would be better in most cases (excepting if the PK clustering matters, but I won't digress).
Student.GPA > 1.5 - this is a range search so having a tree-based index here helps.
College.cName < 'Cornell' - again, this is a range comparison so a tree-based index here helps too.
So the best indexes would be Student.GPA and College.cName, but that isn't an option - so let's see what the benefits of each option are...
(As I was writing this, I saw that #charlieface posted their answer which already covers this, so I'll just link to theirs to save my time: https://stackoverflow.com/a/67829326/159145 )
I have a relatively simple query that attempts to calculate the count of rows I'll have to deal with in a later operation. It looks like:
SELECT COUNT(*)
FROM my_table AS t1
WHERE t1.array_of_ids && ARRAY[cast('1' as bigint)];
The tricky piece is that ARRAY[] portion is determined by the code that invokes the query so instead of having 1 element in this example it could have hundreds or thousands. This makes the query take a decent amount of time to run if a user is actively waiting for the calculation to complete.
Is there anything obvious I'm doing wrong or any obvious improvement that could be made?
Thanks!
Edit:
There are not any indexes on the table. I tried to create one with
CREATE INDEX my_index on my_table(array_of_ids);
and it came back with
ERROR: index row requires 8416 bytes, maximum size is 8191
I'm not very experienced here unfortunately. Maybe there is simply too many rows for an index to be useful?
I ran an explain on the query and the output essentially looks like:
QUERY PLAN | Filter: ((array_of_ids && '{1, 2, 3, 4, 5, 6, 7 ... n}'::bigint[])
so I guess it is automatically doing the ::bigint[]. I tried this as well and the query takes the same time to execute which I guess makes sense.
I realize I'm only pasting a portion of the response to the explain (analyze, buffers, format text) but I'm doing this in psql and my system often runs out of memory. There are tons of --- in the output, I am not sure if there is a way to not have psql do that.
The plan looks pretty simple so is this basically saying there is no way to optimize this? I have two huge arrays and it just takes time to determine an overlap? Not sure if there is a JOIN solution here, I tried to unnest and do a JOIN on equivalent entries but the query never returned so I'm not sure if I got it wrong or if it is just a far slower approach.
Three small number columns [Number(1)] >>
OptionA | 0/1
OptionB | 0/1
OptionC | 0/1
or one larger string column [Varchar2(29)] >>
Options | OptionA=0/1|OptionB=0/1|OptionC=0/1
I'm not sure about the way database handles tables, but I think that maintaining three columns as Number(1) is better than one column as Varchar2(29) !
-EDIT-
Let me explain the situation a bit more:
I am working on a common framework where the all incoming/outgoing request/response is tracked, these interactions can be channeled to a DB/File/JMS; now the all the configuration is being loaded from a table which has a column that corresponds to the output type, currently I'm using "DB=1|FILE=1|JMS=0" as the value of that column so that later if anyone wants to add this for their module they can easily understand what is going on, in my code I've written a simple logic which splits the string by "|" and then I use the exclusive or operator to switch between choice using a switch case..
Everything is already done but I don't like the idea of one large column is better than three small + it will remove the split string I'm doing.
-EDIT-
I finally got it clarified, there may be a situation where we have to add more options; in that case if we add the data column wise, it will result in modifying the table + changing the entity + adding more if's n all; on the other hand I ended up making an enum out of it in a simple bit wise logic to switch between options; this way, I need to modify the enum & add a new handler for the new option & then we are good to go.
Using a single column to store multiple pieces of data is probably the worst thing you can do in a database.
Violating first normal form has at least the following disadvantages:
More difficult to query. OptionA = 1 and OptionB = 1 and OptionC = 0 versus substr(options, 9, 1) = '1' and substr(options, 19, 1) = '1' and substr(options, 19, 1) = '0'.
Less flexable. What happens when you need to add another option? Adding a new column is easy. Adding a new format could mess up old queries. For example, if someone tries to read OptionC with substr(options, -1, 1). (Although this is a good reason to use a 3rd option - a separate table.)
No type safety. This can be a very subtle and tricky problem. Let's say you write substr(options, 9, 1) = 1 instead of substr(options, 9, 1) = '1'. If anyone ever gets the format wrong, a single value could ruin lots of queries. Or worse, it only intermittently crashes a small number of queries, because the access paths keep changing. (Although you can prevent this with a check constraint.)
Slower queries. Normally the amount of work done in an expression or condition isn't a significant cost for a query. But adding a lot of unnecessary string manipulation can make a difference.
Less optimizing. Oracle can only build efficient query plans if it can understand your data. For example, let's say that OptionA is "0" 99.9% of the time. When you filter OptionA = 0, Oracle can use a histogram make a very accurate prediction about the number of rows returned. But for substr(options, 9, 1) = '1' you'll only get a wild guess. If you have complicated queries using this columns you may spend a lot of time trying to "fix" the cardinality estimates. (Although maybe expression statistics could help with this?)
There are times when denormalizing is a good idea. For example, if you have terabytes of data, and compress the table, the single column may take up less space. (But if you're trying to save space, why not use a format like "000" instead?).
If there really is a good reason for this, it definitely needs to be documented. Perhaps add a comment on the column.
For a start, if I am reading your question right, you want each of the options to have one of just two possible values, correct?
If so then you could:
have a separate integer (or boolean) column for each option
have an options column that is a string of 1's and 0's, one digit for each options e.g. "001"
use an 'options' column that is an integer and use a bit value for each options, e.g. optionA == options & 1, optionB == options & 2 etc.
some databases have a bit vector data type which you could use. For mysql there is the BIT data type, which can store bit strings up to 64 bits long.
There will be a trade-off between code complexity and efficiency for each of these. Ask yourself, how much of the machine's time or storage will be saved by employing each of these options? And how much of your time will be saved?
In this instance the 3 column approach is the one I would recommend, not only does this keep things simple in terms of extracting data, but should you ever wish you could set values against all 3 columns rather than being limited to one VarChar2 field. If you opt for the single column VarChar2 then it is fairly simple to extract the info you need using the substr command or perhaps another variation, and although this isn't heavy work for an Oracle db, it does essentially put extra work on the server which is not necessary.
I've got a list with 100 Longs in it, and an Entity kind with a Long field. I want to find all the entities whose field value is in the list.
I was thinking, "Great! I'll just write where :p1.contains(field)," but AppEngine will only split this out on less than 31 elements (new Baskin Robbins' slogan?). So, now I'm thinking I'll have to split the list of 100 into multiple lists of 30.
But at this point, my hopes of a one-liner shot, I realized I could do something like
for (Long number : list)
GQL("select * from Kind where field = " + number)
essentially splitting into all the subqueries myself. My question is... is this equivalent to letting appengine split my list of 30 into 30 separate queries? Or is there some back-end magic they do to fetch all 30 sub-queries simultaneously?
Sub-queries are functionally identical to regular queries. Currently, sub-queries are run serially, so doing it yourself isn't going to be any slower than letting the SDK do it for you. In future, though, it's likely that sub-queries will be executed in parallel, making them much more efficient than doing it yourself. It's also possible functionality like 'IN' could be pushed to the backend, avoiding the issue in the app server altogether.
You should be aware, though, that anything that requires you to execute 100 queries is going to be very, very slow. If you can, you should find a way to work around this without doing piles of queries and merging the results.
That function likely utilizes the IN (arg1,arg2,...) operator.
The problem being:
A single query containing != or IN operators is limited to 30 sub-queries.
Each item in the list counts as a sub query so you can only fetch 30 items at a time with that query style.
Using SELECT * FROM YourKind WHERE value IN list_property will cause multiple sub-queries to be run. It is these sub-queries that are limited to 30.
Using SELECT * FROM YourKind WHERE list_property = value will use the index that is built for your list_property so should work with as many entries as your list-property can contain†.
†I think this would be limited to the 5000 max items for an index