To find the highest level variable amongst 2 ID attributes in SAS - database

My input dataset looks as follows:
DRUG Table
DRUG_CD PARENT_DRUG_CD
DRG_1 DRG_5
DRG_2 DRG_5
DRG_3 DRG_4
DRG_4 --
DRG_5 DRG_6
DRG_6 --
I need to build a dataset with only those DRUGS (and all the attributes of this DRUGS) who does not have a parent anymore and an extra column (CHD_DRUG_CD) which contains the lowest level of DRUG_CD.
Resulting DRUG Table
DRUG_CD CHD_DRUG_CD
DRG_6 DRG_1
DRG_6 DRG_2
DRG_6 DRG_5
DRG_4 DRG_3

Parent-child relationships are relatively common in the literature, and you should be able to find plenty of examples out there. Here's a place to start:
http://support.sas.com/kb/25/968.html
The basic concept is that you need to iterate over the dataset to find the child for each parent. You can do that either by creating your own iterator, using POINT= (as in the paper linked) or a DOW loop, or you can do this using other techniques such as a hash table. Hash iterator is probably the fastest solution, although the most complex to understand and to code. Then when you find your child, you identify which one you found (in an array, for example, as in the paper) and return the loop to the beginning to find the next one (or, exit with none found).

Related

Why is hash index slower using "Less than" in SQL

I've finished my first semester in a college-level SQL course where we used "SQL queries for Mere Mortals" 3rd edition.
Long term I want to work in data governance or as a data scientist, so digging deeper is needed and I found the Stanford SQL course. Today taking the first mini quiz, I got the answers right but on these two I'm not understanding WHY I got the answers right.
My 'SQL for Mere Mortals' book doesn't even cover hash or tree-based indexes so I've been searching online for them.
I mostly guessed based on what she said but it feels more like luck than "I solidly understand why". So I've ordered "Introduction to Algorithms" 3rd edition by Thomas Cormen and it arrived last week but it will take me a while to read through all 1,229 pages.
Found that book in this other stackoverflow link =>https://stackoverflow.com/questions/66515417/why-is-hash-function-fast
Stanford Course => https://www.edx.org/course/databases-5-sql
I thought a hash index on College.enrollment would not speed up because they limit it to less than a number vs an actual number ?? I'm guessing per this link Better to use "less than equal" or "in" in sql query that the query would be faster if we used "<=" rather than "<" ?
This one was just a process of elimination as it mentions the first item after the WHERE clause, but then was confusing as it mentions the last part of Apply.cName = College.cName.
My questions:
I'm guessing that similar to algebra having numerators and denominators, quotients, and many other terms that specifically describe part of an equation using technical terms. How would you use technical terms to describe why these answers are correct.
On the second question, why is the first part of the second line referenced and the last part of the same line referenced as the answers. Why didn't they pick the first part of each of the last part of each?
For context, most of my SQL queries are written for PostgreSQL now within PyCharm on python but I do a lot of practice using the PgAgmin4 or MySqlWorkbench desktop platforms.
I welcome any recommendations you have on paper books or pdf's that have step-by-step tutorials as many, many websites have holes or reference technical details that are confusing.
Thanks
1. A hash index is only useful for equality matches, whereas a tree index can be used for inequality (< or >= etc).
With this in mind, College.enrollment < 5000 cannot use a hash index, as it is an inequality. All other options are exact equality matches.
This is why most RDBMSs only let you create tree-based indexes.
2. This one is pretty much up in the air.
"the first item after the WHERE clause" is not relevant. Most RDBMSs will reorder the joins and filters as they see fit in order to match indexes and table statistics.
I note that the query as given is poorly written. It should use proper JOIN syntax, which is much clearer, and has been in use for 30 years already.
SELECT * -- you should really specify exact columns
FROM Student AS s -- use aliases
JOIN [Apply] AS a ON a.sID = s.sID -- Apply is a reserved keyword in many RDBMS
JOIN College AS c ON c.cName = a.aName
WHERE s.GPA > 1.5 AND c.cName < 'Cornell';
Now it's hard to say what a compiler would do here. A lot depends on the cardinalities (size of tables) in absolute terms and relative to each other, as well as the data skew in s.GPA and c.cName.
It also depends on whether secondary key (or indeed INCLUDE) columns are added, this is clearly not being considered.
Given the options for indexes you have above, and no other indexes (not realistic obviously), we could guesstimate:
Student.sID, College.cName
This may result in an efficient backwards scan on College starting from 'Cornell', but Apply would need to be joined with a hash or a naive nested loop (scanning the index each time).
The index on Student would mean an efficient nested loop with an index seek.
Student.sID, Student.GPA
Is this one index or two? If it's two separate indexes, the second will be used, and the first is obviously going to be useless. Apply and College will still need heavy joins.
Apply.cName, College.cName
This would probably get you a merge-join on those two columns, but Student would need a big join.
Apply.sID, Student.GPA
Student could be efficiently scanned from 1.5, and Apply could be seeked, but College requires a big join.
Of these options, the first or the last is probably better, but it's very hard to say without further info.
In a real system, I would have indexes on all tables, and use INCLUDE columns wisely in order to avoid key-lookups. You would want to try to get a better feel for which tables are the ones that need to be filtered early etc.
First question
A hash-index is not linearly-searchable (see Slide 7), that is, you cannot perform range-comparisons with a hash-index. This is because (in general terms) hash functions are one-way: given the output of a hash function you cannot determine the input, and the output will be in apparently random order (having a random order is good for ensuring an even load over the set of hashtable bins).
Now, for a contrived and oversimplified example:
Supposing you have these rows:
PK | Enrollment
----------------
1 | 1
2 | 10
3 | 100
4 | 1000
5 | 10000
A perfect hash index of this table would look something like this:
Assuming that the hash of 1 is 0xF822AA896F34253E and the hash of 10 is 0xB383A8BBDAA41F98, and so on...
EnrollmentHash | PhysicalRowPointer
---------------------------------------
0xF822AA896F34253E | 1
0xB383A8BBDAA41F98 | 2
0xA60DCD4E78869C9C | 3
0x49B0AF769E6B1EB3 | 4
0x724FD1728666B90B | 5
So given this hashtable index, looking at the hashes you cannot determine which hash represents larger enrollment values vs. smaller values. But a hashtable index does give you O(1) lookup for single specific values, which is why it works best for discrete, non-continuous, data values, especially columns used in JOIN criteria.
Whereas a tree-hash does preserve relative ordering information about values, but with O( log n ) lookup time.
Second question
First, I need to rewrite the query to use modern JOIN syntax. The old style (using commas) has been obsolete since SQL-92 in 1992, that's almost 30 years ago.
SELECT
*
FROM
Apply
INNER JOIN Student ON Student.sID = Apply.sID
INNER JOIN College ON Apply.cName = Apply.cName
WHERE
Student.GPA > 1.5
AND
College.cName < 'Cornell'
Now, generally speaking the best way to answer this kind of question would be to know what the STATISTICS (cardinality, value distribution, etc) of the tables are. But without that I can still make some guesses.
I assume that College is the smallest table (~500 rows?), Student will have maybe 1-2m rows, and assuming every Student makes 4-5 applications then the Apply table will have ~5m rows.
...armed with that inference, we can deduce:
Student.sID = Apply.sID is an ID match - so a hash-index would be better in most cases (excepting if the PK clustering matters, but I won't digress).
Student.GPA > 1.5 - this is a range search so having a tree-based index here helps.
College.cName < 'Cornell' - again, this is a range comparison so a tree-based index here helps too.
So the best indexes would be Student.GPA and College.cName, but that isn't an option - so let's see what the benefits of each option are...
(As I was writing this, I saw that #charlieface posted their answer which already covers this, so I'll just link to theirs to save my time: https://stackoverflow.com/a/67829326/159145 )

Is there a way to optimize this Gremlin query?

I have a graph database which looks like this (simplified) diagram:
Each unique ID has many properties, which are represented as edges from the ID to unique values of that property. Basically that means that if two ID nodes have the same email, then their has_email edges will both point to the same node. In the diagram, the two shown IDs share both a first name and a last name.
I'm having difficulty writing an efficient Gremlin query to find matching IDs, for a given set of "matching rules". A matching rule will consist of a set of properties which must all be the same for IDs to be considered to have come from the same person. The query I'm currently using to match people based on their first name, last name, and email looks like:
g.V().match(
__.as("id").hasId("some_id"),
__.as("id")
.out("has_firstName")
.in("has_firstName")
.as("firstName"),
__.as("id")
.out("has_lastName")
.in("has_lastName")
.as("lastName"),
__.as("id")
.out("has_email")
.in("has_email")
.as("email"),
where("firstName", eq("lastName")),
where("firstName", eq("email")),
where("firstName", neq("id"))
).select("firstName")
The query returns a list of IDs which match the input some_id.
When this query tries to match an ID with a particularly common first name, it becomes very, very slow. I suspect that the match step is the problem, but I've struggled to find an alternative with no luck so far.
The performance of this query will depend on the edge degrees in your graph. Since many people share the same first name, you will most likely have a huge amount of edge going into a specific firstName vertex.
You can make assumptions, like: there are fewer people with the same last name than people with the same first name. And of course, there should be even fewer people who share the same email address. With that knowledge you just can start to traverse to the vertices with the lowest degree first and then filter from there:
g.V().hasId("some_id").as("id").
out("has_email").in("has_email").where(neq("id")).
filter(out("has_lastName").where(__.in("has_lastName").as("id"))).
filter(out("has_firstName").where(__.in("has_firstName").as("id")))
With that, the performance will mostly depend on the vertex with the lowest edge degree.

How to model the scope of a database transaction within an UML activity diagram?

I need to model a process, the required notation is an UML activity diagram.
An input file is read and for each line a database record would be batch-inserted, but only if all lines of the input file pass some validity checks. If any line violates the validation rules, the entire input file would be rejected.
This seems like a very common pattern, however, the only graphical way of modelling this seems to be modelling begin and end of the transaction as activities.
Isn't there a way to do this more nicely?
Is there a UML or SysML language element that corresponds to a looped transaction?
The diagram is intended for a non-tech customer who would be very confused by the transaction activites.
Structured Activity Nodes have been around in UML (though I have to admit the completely passed my attention so far). The current 2.5 Spec says on p. 477:
Loop Nodex
A LoopNode is a StructuredActivityNode that represents an iterative loop. A LoopNode consists of a setupPart, a test and a bodyPart, which identify subsets of the ExecutableNodes contained in the LoopNode. Any ExecutableNode in the LoopNode must be included in the setupPart, test or bodyPart for the LoopNode.
On p. 478:
Notation
A StructuredActivityNode is notated with a dashed round cornered rectangle enclosing its nodes and edges, with the keyword «structured» at the top.
No standard notation is defined for ConditionalNodes, LoopNodes or SequenceNodes.
Note the last sentence. The notation for this has been expanded in 2.5. But honestly I would not use it much here and look into BPMN instead, which is a UML based profile that has become more wide spread. I'd rather stay with basic UML notation like this unless you use BPMN:
There are two activities Process File (to the left) and Process Line (as diagram frame to the right). The latter is used as invocation in the first activity.

Need algorithm for fast storage and retrieval (search) of sets and subsets

I need a way of storing sets of arbitrary size for fast query later on.
I'll be needing to query the resulting data structure for subsets or sets that are already stored.
===
Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves.
I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.
Now, an even simpler example:
[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]
Query:
[alpha, delta]
Result:
[alpha, beta, gama, delta] [alpha, epsilon, delta]
So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.
==
Original examples:
For example, say the database contains these sets
[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]
If I use [A1, C1] as a query, these two sets should be returned as a result:
[A1, B1, C1, D1], [A1, D3, C1]
Example 2:
Database:
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]
Query:
[Distance to berlin: 240km]
Result
[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.
What algorithms are there for such needs?
I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.
Clarifications:
If it helps answer the question, I'm intending on using them for storing states:
Simple example:
[Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure
Conclusion
I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:
I'm confident that I can now contribute to the solution. One possible quite efficient way is a:
Trie invented by Frankling Mark Liang
Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.
The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.
What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).
But first a drawing by me:
The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.
Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.
Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).
Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.
Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.
I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.
I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.
How about having an inverse index built of hashes?
Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.
Then you define a map that maps an index to a vector of pointers to the tuples
std::map<size_t,std::vector<TupleStruct*> > mymap;
or, if you can use global indices, just
std::map<size_t,std::vector<size_t> > mymap;
For retrieval by queries X and Y, you need to
get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]
If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.
If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices
Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).
http://en.wikipedia.org/wiki/Bitmap_index
Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086
Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):
MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx
Oracle: http://www.orafaq.com/wiki/Bitmap_index
Edit:
Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:
Bitmap Index Design Choices and Their Performance
Implications
Bitmap Index Design and Evaluation
Compressing Bitmap Indexes for Faster Search Operations
This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).
One of the earliest results in this area is from this paper by Ron Rivest from 19761. This2 is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.
Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.
This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:
Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.
A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.
Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:
Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.
On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.
Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.
I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).

Relational Algebra Query Troubles

I have a problem where I have two relations, one containing attributes song_id, song_name, album_id, and the other containing album_id and album_name. I need to find the names of all the albums that do not have songs in the song relation. The problem is I can only use Rename, Projection, Selection, Grouping(with sum,min,max,count), Cartesian Product, and Natural join. I have spent a good amount of time working on this and would appreciate any help that pointed me in the right direction.
As #ErwinSmout pointed out, difference is a generally easy way to do it. But since you can't use it, there is a tricky workaround using counts. I'm assuming that every album_id present in the songs relation is also present in the albums relation.
PROJECT album_id from the songs relation (note that relational algebra's PROJECT is equivalent to SQL's SELECT DISTINCT). I'll call this relation song_albums. Now lets take the count of the albums relation, call this m, and take the count of the new table, call this n.
Take the Cartesian product of the albums relation and the song_albums relation. This new relation has m*n rows. Now if you do a count, grouped by album_name, each of the m album_name's will have a count of n. Not very helpful.
But now, we SELECT from the relation rows where albums.album_id != song_albums.album_id. Now, if you do a count grouped by album_name, the count for those albums that were not in the original songs relation will be n, while those that were originally in there will have a count less than n, since rows would have been removed based on how many songs with that album were in the original songs relation.
Edit: As it turns out, this isn't a strictly relational-algebra solution: In SQL, a 1 x 1 table, such as the one containing n can simply be treated as an integer and used in an equality comparison. However, according to Wikipedia, selection must make a comparison between either two attributes of a relation, or an attribute and a constant value.
Another obstacle which will be dealt with by another ill-recommended Cartesian product: we can take the Cartesian product of the 1 x 1 relation containing n with our most recent relation. Now we can make a proper relational-algebra selection since we have an attribute that is always equal to n.
Since this has gotten rather complex, here is a relational-algebra expression capturing the above english explanation:
Note that n is a 1 x 1 relation with an attribute named "count".
It's impossible. The problem includes a negation, and in relational algebra, that can only be epxressed using relational difference, which you're seemingly not allowed to use.
I'm curious to see what your teacher presents as the solution to this problem.

Resources