selecting rows based on multiple columns - database

Suppose i have a table like below
reg sub1 sub2 sub3
1 A F F
2 F F A
3 A B B
4 A B F
5 A F F
sub1, sub2, sub3 are the three different subjects and the values A, B & F are the grades where A and B holds certain points but F is fail....Now i want the reg of students who has one 'F'(i.e Failed in only one subject) and two subjects, 3 subjects and so on....How to do tat.....
My desired output for students with one F(i.e Failed in only one subject) is...
reg sub1 sub2 sub3
4 A B F
and students with two F(i.e Failed in two subjects) is...
reg sub1 sub2 sub3
1 A F F
2 F F A
5 A F F

First, like #Mat said in a comment above, this would be much easier if you normalised your data model.
Second, I'll be assuming that you are looking for a SQL query. Third, since you didn't mention at all what database system you are using (SQL Server, DB2, Oracle, Postgres, etc.), I will show a solution that works at least for SQL Server:
SELECT *
FROM grades
WHERE (CASE sub1 WHEN 'F' THEN 1 ELSE 0 END) +
(CASE sub2 WHEN 'F' THEN 1 ELSE 0 END) +
(CASE sub3 WHEN 'F' THEN 1 ELSE 0 END) = 2
This would find all records in your table (grades) with exactly 2 F grades. What it does is to convert each F into a 1, and every other grade into 0, then it builds the sum of these numbers. That is, it computes the number of Fs for each record.

Related

Split row based on result of divide operation

I have data shipment and multiply of 1 basket, Before :
Pack ID Brand Part Ship Qty Qty per Basket divideval mod Batch
4 Brand A Part P 145 50 2 45 OB
4 Brand A Part P 125 50 2 25 OB2
I need to multiple the data based on ship qty / qty per basket, After :
Pack ID Brand Part Ship Qty Batch
4 Brand A Part P 50 OB
4 Brand A Part P 50 OB
4 Brand A Part P 45 OB
4 Brand A Part P 50 OB2
4 Brand A Part P 50 OB2
4 Brand A Part P 25 OB2
How to make it using SQL Server?
Since you have already calculated the Qty per Basket, it makes thing a little bit simpler.
The solution use a number / tally table or recursive cte to generate one.
The following query uses a number table
select t.PackID, t.Brand, t.Part,
case when n = 1 and [mod] <> 0 then [mod] else t.QtyperBasket end as ShipQty,
t.Batch
from yourtable t
cross join numbers n
where n.n >= 1
and n.n <= t.divideval + case when [mod] > 0 then 1 else 0 end

Ancestry path with 'path_id' or 'group_id" and 'level' in graph SQL Server 2008

I have the following table:
pairID source_issue_id destination_issue_id
----------------------------------------------
1 J I
2 B C
3 J M
4 F I
5 A B
6 A E
7 N O
8 J L
9 C D
10 P Q
11 G H
12 B F
13 L K
14 C N
15 A G
16 E F
Representing nodes in a graph.
I'm not using pairID, since does not seem useful.
I want to get the ancestor path for all nodes, the level at which each pair occurs, the full path, and the 'path group'
So far I've used the following code:
;with auxPairs as
(
select
1 as lvl,
b.source_issue_id,
b.destination_issue_id,
cast((b.source_issue_id+ '|' + b.destination_issue_id) as varchar(50)) as "full_path"
from
Pairs2 b
where
b.source_issue_id not in (select destination_issue_id from Pairs2)
union all
select
lvl+1 as lvl,
c.source_issue_id,
c.destination_issue_id,
CAST((a.full_path + '|' + c.destination_issue_id) as varchar(50)) as "full_path"
from
Pairs2 c
join
auxPairs a on a.destination_issue_id = c.source_issue_id
)
That returns the 'connection level' of two nodes (e.g. A|B is level 1) in lvl, the source and destination nodes and the full path (up to that pair), e.g.
lvl source destination full_path
--------------------------------------
1 A B A|B
2 B C A|B|C
3 C N A|B|C|N
4 N O A|B|C|N|O
1 A B A|B
2 B C A|B|C
3 C D A|B|C|D
....
and so on for each path in the tree.
I need to add to this a Path_id or group_id so I get:
path_id lvl source destination full_path
------------------------------------------------
1 1 A B A|B
1 2 B C A|B|C
1 3 C N A|B|C|N
1 4 N O A|B|C|N|O
2 1 A B A|B
2 2 B C A|B|C
2 3 C D A|B|C|D
....
meaning that the nodes with the same path_id are connected in a given order
NOTE: The alphabetical 'fake" order will not work with actual data. I need to use the path_id and lvl to establish the partial order within the path.

SAS Function that can create every possible combination

I have a dataset that looks like this.
data test;
input cat1 $ cat2 $ score;
datalines;
A D 1
A D 2
A E 3
A E 4
A F 4
B D 3
B D 2
B E 6
B E 5
B F 6
C D 8
C D 5
C E 4
C E 12
C E 2
C F 7
;
run;
I want to create tables based off of this table that are summarized forms of this data. For example, I want one table that sums every score for every cat1 and cat2 together, like so
proc sql;
create table all as select
'all' as cat1
,'all' as cat2
,sum(score) as score
from test
group by 1,2
;quit;
I want a table that sums all the scores for cat1='A', despite what cat2 is, like so
proc sql;
create table a_all as select
cat1
,'all' as cat2
,sum(score) as score
from test
where
cat1='A'
group by 1,2
;quit;
I want a table that sums the score for cat1='A' and cat2='E', like so
proc sql;
create table a_e as select
cat1
,cat2
,sum(score) as score
from test
where
cat1='A'
and
cat2='E'
group by 1,2
;quit;
And so on and so forth. I want a comprehensive set of tables that consists of every possible combination. I can use loops if they are efficient. The problem that the real data set I'm using has 8 categories (as opposed to the 2 here) and within those categories, there are as many as 98 levels. So the loops I've been writing have been nested 8 degrees and take up a ton of time. Pain to debug too.
Is there some kind of function or a special array I can apply that will create this series of tables I'm talking about? Thanks!
I think you want what PROC SUMMARY does by default.
data test;
input cat1 $ cat2 $ score;
datalines;
A D 1
A D 2
A E 3
A E 4
A F 4
B D 3
B D 2
B E 6
B E 5
B F 6
C D 8
C D 5
C E 4
C E 12
C E 2
C F 7
;
run;
proc print;
run;
proc summary data=test chartype;
class cat:;
output out=summary sum(score)=;
run;
proc print;
run;

Nested loop with increasing number of elements

I have a longitudinal dataset of 18 time periods. For reasons not to be discussed here, this dataset is in the wide shape, not in the long one. More precisely, time-varying variables have an alphabetic prefix which identifies the time it belongs to. For the sake of this question, consider a quantity of interest called pay. This variable is denoted apay in the first period, bpay in the second, and so on, until rpay.
Importantly, different observations have missing values in this variable in different periods, in an unpredictable way. In consequence, running a panel for the full number of periods will reduce my number of observations considerably. Hence, I would like to know precisely how many observations a panel with different lengths will have. To evaluate this, I want to create variables that, for each period and for each number of consecutive periods count how many respondents have the variable with that time sequence. For example, I want the variable b_count_2 to count how many observations have nonmissing pay in the first period and the second. This can be achieved with something like this:
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
Now, since I want to do this automatically, this has to be in a loop. Moreover, there are different numbers of sequences for each period. For example, for the third period, there are two sequences (those with pay in period 2 and 3, and those with sequences in period 1, 2 and 3). Thus, the number of variables to create is 1+2+3+4+...+17 = 153. This variability has to be reflected in the loop. I propose a code below, but there are bits that are wrong, or of which I'm unsure, as highlighted in the comments.
local list b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local counter = 1 // counter to update; reflects sequence length
while `counter' > 0 { // loop over sequence lengths
gen _`var'_counter_`counter' = 0 // generate variable with counter
if `var'pay != . { // HERE IS PROBLEM 1. NEED TO MAKE THIS TO CHECK CONDITIONS WITH INCREASING NUMBER OF ELEMENTS
recode _`var'_counter_`counter' (0 = 1) // IM NOT SURE THIS IS HOW TO UPDATE SPECIFIC OBSERVATIONS.
local counter = `counter' - 1 // update counter to look for a longer sequence in the next iteration
}
}
local counter = `counter' + 1 // HERE IS PROBLEM 2. NEED TO STOP THIS LOOP! Otherwise counter goes to infinity.
}
An example of the result of the above code (if right) is the following. Consider a dataset of five observations, for four periods (denoted a, b, c, and d):
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
where 1 means value is observed in that period, and . is not. The objective of the code is to create 1+2+3=6 new variables such that the new dataset is:
Obs a b c d b_count_2 c_count_2 c_count_3 d_count_2 d_count_3 d_count_4
1 1 1 . 1 1 0 0 0 0 0
2 1 1 . . 1 0 0 0 0 0
3 . . 1 1 0 0 0 1 0 0
4 . 1 1 . 0 1 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1
Now, why is this helpful? Well, because now I can run a set of summarize commands to get a very nice description of the dataset. The code to print this information in one go would be something like this:
local list a b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local list `var'_counter_* // group of sequence variables for each period
foreach var2 of local list { // loop over each element of the list
quietly sum `var'_counter_`var2' if `var'_counter_`var2' == 1 // sum the number of individuals with value = 1 with sequence of length var2 in period var
di as text "Wave `var' has a sequence of length `var2' with " as result r(N) as text " observations." // print result
}
}
For the above example, this produces the following output:
"Wave 'b' has a sequence of length 2 with 3 observations."
"Wave 'c' has a sequence of length 2 with 2 observations."
"Wave 'c' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 2 with 2 observations."
"Wave 'd' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 4 with 1 observations."
This gives me a nice summary of the trade-offs I'm having between a wider panel and a longer panel.
If you insist on doing this with data in wide form, it is very inefficient to create extra variables just to count patterns of missing values. You can create a single string variable that contains the pattern for each observation. Then, it's just a matter of extracting from this pattern variable what you are looking for (i.e. patterns of consecutive periods up to the current wave). You can then loop over lengths of the matching patterns and do counts. Something like:
* create some fake data
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* build the pattern of missing data
gen pattern = ""
foreach pre in a b c d e f g {
qui replace pattern = pattern + cond(mi(`pre'pay), " ", "`pre'")
}
list
qui foreach pre in b c d e f g {
noi dis "{hline 80}" _n as res "Wave `pre'"
// the longest substring without a space up to the wave
gen temp = regexs(1) if regexm(pattern, "([^ ]+`pre')")
noi tab temp
// loop over the various substring lengths, from 2 to max length
gen len = length(temp)
sum len, meanonly
local n = r(max)
forvalues i = 2/`n' {
count if length(temp) >= `i'
noi dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
drop temp len
}
If you are open to working in long form, then here is how you would identify spells with contiguous data and how to loop to get the info you want (the data setup is exactly the same as above):
* create some fake data in wide form
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* reshape to long form
gen id = _n
reshape long #pay, i(id) j(wave) string
* identify spells of contiguous periods
egen wavegroup = group(wave), label
tsset id wavegroup
tsspell, cond(pay < .)
drop if mi(pay)
foreach pre in b c d e f g {
dis "{hline 80}" _n as res "Wave `pre'"
sum _seq if wave == "`pre'", meanonly
local n = r(max)
forvalues i = 2/`n' {
qui count if _seq >= `i' & wave == "`pre'"
dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
}
I echo #Dimitriy V. Masterov in genuine puzzlement that you are using this dataset shape. It can be convenient for some purposes, but for panel or longitudinal data such as you have, working with it in Stata is at best awkward and at worst impracticable.
First, note specifically that
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
will only ever be evaluated in terms of the first observation, i.e. as if you had coded
if apay[1] != . & bpay[1] != .
This is documented here. Even if it is what you want, it is not usually a pattern for others to follow.
Second, and more generally, I haven't tried to understand all of the details of your code, as what I see is the creation of a vast number of variables even for tiny datasets as in your sketch. For a series T periods long, you would create a triangular number [(T - 1)T]/2 of new variables; in your example (17 x 18)/2 = 153. If someone had series 100 periods long, they would need 4950 new variables.
Note that because of the first point just made, these new variables would pertain with your strategy only to individual variables like pay and individual panels. Presumably that limitation to individual panels could be fixed, but the main idea seems singularly ill-advised in many ways. In a nutshell, what strategy do you have to work with these hundreds or thousands of new variables except writing yet more nested loops?
Your main need seems to be to identify spells of non-missing and missing values. There is easy machinery for this long since developed. General principles are discussed in this paper and an implementation is downloadable from SSC as tsspell.
On Statalist, people are asked to provide workable examples with data as well as code. See this FAQ That's entirely equivalent to long-standing requests here for MCVE.
Despite all that advice, I would start by looking at the Stata command xtdescribe and associated xt tools already available to you. These tools do require a long data shape, which reshape will provide for you.
Let me add another answer based on the example now added to the question.
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
The aim of this answer is not to provide what the OP asks but to indicate how many simple tools are available to look at patterns of non-missing and missing values, none of which entail the creation of large numbers of extra variables or writing intricate code based on nested loops for every new question. Most of those tools require a reshape long.
. clear
. input a b c d
a b c d
1. 1 1 . 1
2. 1 1 . .
3. . . 1 1
4. . 1 1 .
5. 1 1 1 1
6. end
. rename (a b c d) (y1 y2 y3 y4)
. gen id = _n
. reshape long y, i(id) j(time)
(note: j = 1 2 3 4)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 5 -> 20
Number of variables 5 -> 3
j variable (4 values) -> time
xij variables:
y1 y2 ... y4 -> y
-----------------------------------------------------------------------------
. xtset id time
panel variable: id (strongly balanced)
time variable: time, 1 to 4
delta: 1 unit
. preserve
. drop if missing(y)
(7 observations deleted)
. xtdescribe
id: 1, 2, ..., 5 n = 5
time: 1, 2, ..., 4 T = 4
Delta(time) = 1 unit
Span(time) = 4 periods
(id*time uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
2 2 2 2 3 4 4
Freq. Percent Cum. | Pattern
---------------------------+---------
1 20.00 20.00 | ..11
1 20.00 40.00 | .11.
1 20.00 60.00 | 11..
1 20.00 80.00 | 11.1
1 20.00 100.00 | 1111
---------------------------+---------
5 100.00 | XXXX
* ssc inst xtpatternvar
. xtpatternvar, gen(pattern)
* ssc inst groups
. groups pattern
+------------------------------------+
| pattern Freq. Percent % <= |
|------------------------------------|
| ..11 2 15.38 15.38 |
| .11. 2 15.38 30.77 |
| 11.. 2 15.38 46.15 |
| 11.1 3 23.08 69.23 |
| 1111 4 30.77 100.00 |
+------------------------------------+
. restore
. egen npresent = total(missing(y)), by(time)
. tabdisp time, c(npresent)
----------------------
time | npresent
----------+-----------
1 | 2
2 | 1
3 | 2
4 | 2
----------------------

Condense multiple rows to single row with counts based on unique values in sqlite

I am trying to condense a table which contains multiple rows per event to a smaller table which contains counts of key sub-events within each event. Events are defined based on unique combinations across columns.
As a specific example, say I have the following data involving customer visits to various stores on different dates with different items purchased:
cust date store item_type
a 1 Main St 1
a 1 Main St 2
a 1 Main St 2
a 1 Main St 2
b 1 Main St 1
b 1 Main St 2
b 1 Main St 2
c 1 Main St 1
d 2 Elm St 1
d 2 Elm St 3
e 2 Main St 1
e 2 Main St 1
a 3 Main St 1
a 3 Main St 2
I would like to restructure the data to a table that contains a single line per customer visit on a given day, with appropriate counts. I am trying to understand how to use SQLite to condense this to:
Index cust date store n_items item1 item2 item3 item4
1 a 1 Main St 4 1 3 0 0
2 b 1 Main St 3 1 2 0 0
3 c 1 Main St 1 1 0 0 0
4 d 2 Elm St 2 1 0 1 0
5 e 2 Main St 2 2 0 0 0
6 a 3 Main St 2 1 1 0 0
I can do this in excel for this trivial example (begin with sumproduct( cutomer * date) as suggested here, followed by cumulative sum on this column to generate Index, then countif and countifs to generate desired counts).
Excel is poorly suited to doing this for thousands of rows, so I am looking for a solution using SQLite.
Sadly, my SQLite kung-fu is weak.
I think this is the closest I have found, but I am having trouble understanding exactly how to adapt it.
When I tried a more basic approach to begin by generating a unique index:
CREATE UNIQUE INDEX ui ON t(cust, date);
I get:
Error: indexed columns are not unique
I would greatly appreciate any help with where to start. Many thanks in advance!
To create one result record for each unique combination of column values, use GROUP BY.
The number of records in the group is available with COUNT.
To count specific item types, use a boolean expression like item_type=x, which returns 0 or 1, and sum this over all records in the group:
SELECT cust,
date,
store,
COUNT(*) AS n_items,
SUM(item_type = 1) AS item1,
SUM(item_type = 2) AS item2,
SUM(item_type = 3) AS item3,
SUM(item_type = 4) AS item4
FROM t
GROUP BY cust,
date,
store

Resources