In what situations would Rank and Dense Rank be useful? - sql-server

I know the syntax for using the Rank and Dense Rank functions
but I can't find any uses in the real world for this .
For example DENSE_RANK
ranking userid
1 500
1 500
2 502
2 502
and Rank
Ranking UserID
1 500
1 500
1 500
1 500
1 500
1 500
1 500
8 502
8 502
8 502
8 502
8 502
8 502
8 502
15 504
I can't understand how the 1,1 2,2 values would be useful in the real world.
On the other hand, I do understand very clearly what the real-world uses for row_number over partition are; I just can't find what can I do with this kind of information (dense & regular rank)

you could use it to find the top n rows for each group
There is a very good explanation here:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2920665938600

Specific examples would be arbitrary and not necessarily helpful. So long as you understand what they do (q.v. #Kevin Burton's link), and can remember at least vaguely that this functionality exists, then if or when a situation comes up where they would be useful--if not critical--you'll be able to pull them out of the database developer's bag of tricks. (I've used RANK once, maybe twice, and it was very useful each time, but I can't--and don't need to--recall the details without looking them up.)

Related

Where to find a number-pattern based IQ test?

I'm asking this question in response to this article. The article claims that a computer program has been created that scores 150 on a number-pattern based IQ test. I can find no such test online, has anyone heard of one? I would like to test my own program, which currently answers 6 of 8 questions correctly on this test, but it doesn't give me an IQ.
Math-based patterns are ideal, like: 1 4 10 22
Not stuff like this: 1 2 1 3 1 4 1
Have a look on kaggle. They have have all kind of machine learning datasets and challenges. For example this sounds like something similar to what you are looking for.

How should i format/set up my dataset/dataframe? and factor ->numeric problems

New to R and new to this forum, tried searching, hope i dont embarass myself by failing to identify previous answers.
So i got my data, and i intend to do some kind of glmm's in the end but thats far away in the future, first im going to do some simple glm/lm's to learn what im doing
first about my data:
I have data sampled from 2 "general areas" on opposite sides of the country.
in these general areas there are roughly 50 trakts placed (in a grid, random staring point)
Trakts have been revisited each year for a duration of 4 years
A tract contains 16 sample plots, i intend to work on trakt-level so i use the means of the 16 sample plots for each trakt.
2x4x50 = 400 rows (actual number is 373 rows when i have removed trakts where not enough plots could be sampled due to terrain etc)
the data in my excel file is currently divided like this:
rows = trakts
Columns= the measured variable
i got 8-10 columns i want to use
short example how the data looks now:
V1 - predictor, 4 different columns
V2 - Response variable = proportional data, 1-4 columns depending on which hypothesis i end up testing,
the glmm in the end would look something like, (V2~V1+V1+V1,(area,year))
Area Year Trakt V1 V2
A 2015 1 25.165651 0
A 2015 2 11.16894652 0.1
A 2015 3 18.231 0.16
A 2014 1 3.1222 N/A
A 2014 2 6.1651 0.98
A 2014 3 8.651 1
A 2013 1 6.16416 0.16
B 2015 1 9.12312 0.44
B 2015 2 22.2131 0.17
B 2015 3 12.213 0.76
B 2014 1 1.123132 0.66
B 2014 2 0.000 0.44
B 2014 3 5.213265 0.33
B 2013 1 2.1236 0.268
How should i get started on this?
8 different files?
Nested by trakts ( do i start nesting now or later when i'm doing glmms?)
i load my data into r through the read.tables function
If i run: sapply(dataframe,class)
V1 and V2 are factors, everything else integer
if i run sapply(dataframe,mode)
everything is numeric
so finally to my actual problems, i have been trying to do normality tests (only trid shapiro so far) but i keep getting errors that imply my data is not numeric
also, when i run a normality test, do i only run one column and evaluate it before moving on to the next column or should i run several columns? the entire dataset?
should i in my case run independent normality tests for each of my areas and year?
hope it didnt end up to cluttered
best regards

NEXT Function, or test if following Row Group is hidden

I'm using ReportBuilder 2.0 / SQL Server 2008.
I have a report that uses visibility settings on the row groups which results in some row group headings being hidden, which in turn makes report totals seem incorrect. I can't change the visibility settings (for business reasons); what I'm looking for is a way to test EITHER for hidden items, OR for apparently incorrect totals. Consider the following dataset:
ItemCode SubPhaseCode SubPhase BidItem XTDPrice
1 1 Water Utility 1 5000
2 1 Water Utility 2 4000
3 2 Electrical Utility 3 75,000
4 2 Electrical Utility 3 75,000
5 2 Electrical Utility 3 100000
6 2 Electrical Utility 4 2500
7 2 Electrical Utility 4 2500
8 2 Electrical Utility 4 5064
9 2 Electrical Utility 5 3000
10 2 Electrical Utility 5 3000
11 2 Electrical Utility 5 5796
12 3 Gas Utility 6 60000
13 3 Gas Utility 6 60000
14 3 Gas Utility 6 61547
15 4 Other Utility 7 6000
16 4 Other Utility 7 7000
There are 3 Row Groups on the report, one for SubPhaseCode ("Group1"), and two for BidItem("Group2" and "DetailsGroup"):
Link to Design View Screenshot
The Row Visibility property for Group1 (SubPhaseCode) is:
=IIF(Fields!SubPhaseCode.Value = 3, true, false)
This results in the heading for the SubPhase "Gas" being hidden. This means that, when the report is run, I get something like the following:
Total 475407
Water 9000
-Utility 1 5000
-Utility 2 4000
Electrical 271860
-Utility 3 250000
-Utility 4 10064
-Utility 5 11796
-Utility 6 181547
Other 13000
-Utility 7 13000
The fact that SubPhase 3 ("Gas") is hidden results in 2 apparent errors:
1) The sum for "Electrical" (271860) appears incorrect for the 4 items below it (because there should be another row heading above "Utility 6")
2) The total of 475407 appears incorrect for the 3 groups below it (9000 + 271860 + 13000).
What I am looking for is a way to change the formatting of the headings (especially the Group Headings) if the numbers below them apparently don't add up. I understand how to implement conditional formatting and have done this for the Total. I am unclear how this could be implemented for the Row Group.
I would basically need some kind of a test, for each Row Heading, to see if the following heading would be hidden, according to the rules. This sounds to me like a "NEXT" function, which I know doesn't exist.
Other searches have indicated that I might need to add the desired data to the dataset or modify the underlying SP. Just wondering if there are any simpler solutions.
Thanks much for the help!
I'd avoid to sum the values in the SubPhase group SUM().
Try:
=SUM(IIF(Fields!SubPhaseCode.Value=3,0,Fields!XTDPrice.Value))
Let me know if this helps.

Sequences in Graph Database

All,
I am new to the graph database area and want to know if this type of example if applicable to a graph database.
Say I am looking at a baseball game. When each player goes to bat, there are 3 possible outcomes: hit, strikeout, or walk.
For each batter and throughout the baseball season, what I want to figure out is the counts of the sequences.
For example, for batters that went to the plate n times, how many people had a particular sequence (e.g, hit/walk/strikeout or hit/hit/hit/hit), and if so, how many of the same batters repeated the same sequence indexed by time. To further explain, time would allow me know if a particular sequence (e.g. hit/walk/strikeout or hit/hit/hit/hit) occurred during the beginning of the season, in the mid, or later half.
For a key-value type database, the raw data would look as follows:
Batter Time Game Event Bat
------- ----- ---- --------- ---
Charles April 1 Hit 1
Charles April 1 strikeout 2
Charles April 1 Walk 3
Doug April 1 Walk 1
Doug April 1 Hit 2
Doug April 1 strikeout 3
Charles April 2 strikeout 1
Charles April 2 strikeout 2
Doug May 5 Hit 1
Doug May 5 Hit 2
Doug May 5 Hit 3
Doug May 5 Hit 4
Hence, my output would appear as follows:
Sequence Freq Unique Batters Time
----------------------- ---- -------------- ------
hit 5000 600 April
walk/strikeout 3000 350 April
strikeout/strikeout/hit 2000 175 April
hit/hit/hit/hit/hit 1000 80 April
hit 6000 800 May
walk/strikeout 3500 425 May
strikeout/strikeout/hit 2750 225 May
hit/hit/hit/hit/hit 1250 120 May
. . . .
. . . .
. . . .
. . . .
If this is feasible for a graph database, would it also scale? What if instead of 3 possible outcomes for a batter, there were 10,000 potential outcomes with 10,000,000 batters?
More so, the 10,000 unique outcomes would be sequenced in a combinatoric setting (e.g. 10,000 CHOOSE 2, 10,000 CHOOSE 3, etc.).
My question then is, if a graphing database is appropriate, how would you propose setting up a solution?
Much thanks in advance.

db optimization: computing rank

This question asks how to select a user's rank by his id.
id name points
1 john 4635
3 tom 7364
4 bob 234
6 harry 9857
The accepted answer is
SELECT uo.*,
(
SELECT COUNT(*)
FROM users ui
WHERE (ui.points, ui.id) >= (uo.points, uo.id)
) AS rank
FROM users uo
WHERE id = #id
which makes sense. I'd like to understand what the performance tradeoffs would be, between this approach, or by modifying the db structure to store a calculated rank (I guess that would require massive changes every time there's a rank change), or any other approaches that I'm too newb to think of. I'm a db noob.
The performance tradeoff would basically be what you described:
If you modified the structure to store a rank, queries would be very, very simple and fast. However, this would require some overhead any time "points" changed, as you'd have to verify that the rank hasn't changed. If the ranking had changed, you'd have to do multiple updates.
This causes more work (with the potential for bugs) at every update/insert. The tradeoff is very fast reads. If you're typical usage is very few modifications compared to millions of reads, AND you found this query to be a bottleneck, it might be worth considering reworking this. However, I would avoid the added complexity and maintainability headaches unless you truly found this to be a problem, since the current solution requires less storage, and is very flexible.
The link you reference is a MySQL question. If the original database had been Oracle the accepted answer would be to use an analytic function, which does scale, very nicely:
SQL> select id, name, points from users order by id
2 /
ID NAME POINTS
---------- ---------- ----------
1 john 4635
3 tom 7364
4 bob 234
6 harry 9857
8 algernon 1
9 sebastian 234
10 charles 888
7 rows selected.
SQL> select name, id, points, rank() over (order by points)
2 from users
3 /
NAME ID POINTS RANK()OVER(ORDERBYPOINTS)
---------- ---------- ---------- -------------------------
algernon 8 1 1
bob 4 234 2
sebastian 9 234 2
charles 10 888 4
john 1 4635 5
tom 3 7364 6
harry 6 9857 7
7 rows selected.
SQL> select name, id, points, dense_rank() over (order by points desc)
2 from users
3 /
NAME ID POINTS DENSE_RANK()OVER(ORDERBYPOINTSDESC)
---------- ---------- ---------- -----------------------------------
harry 6 9857 1
tom 3 7364 2
john 1 4635 3
charles 10 888 4
bob 4 234 5
sebastian 9 234 5
algernon 8 1 6
7 rows selected.
SQL>
Does not the 'where' portion of that query internally require reading the entire table? I understand about premature optimization. Academically, it seems that this wouldn't scale further than a few thousand rows.

Resources