How to put information to next columns - pivot-table

I want to put information from row to column in way which is similar to pivot table in Excel. Command would be like: If there is same Part (first column) put other rows behind first set of information. It means - for first row of Part 1927 -> behind last information 'Limitation FP' - 'No Limit' -> put information with same Part FP 2. Is there a solution for this?
Data source:
|Part |Limitation |Final Pr.('FP')|Status FP |Limitation FP |
|------|------------|---------------|----------|---------------|
|1927 |No |2927 |OK |No Limit |
|1927 |No |3927 |OK |NULL |
|1927 |No |4927 |NOT OK |One only |
Result:
|Part |Limitation |FP |Status FP |Limitation FP |FP 2| Status FP 2|Limitation FP 2|FP 3|Status FP 3|Limitation FP 3|
|------|------------|----|----------|---------------|----|------------|---------------|----|-----------------|---------------|
|1927 |No |2927|OK |No Limit |3927|OK |NULL |4927|NOT OK-----------|One only
I don't know how to put information in 'kind of' pivot table structure in SQL.

Related

Use excel to summarise data from a column by identifier

I have a spreadsheet with a column called MRN (the identifier) and the drugs administered next to them. There are duplicates of the MRN in column A that correspond to different courses of drugs. What I'm hoping to do is to summarise all the drugs administered associated with one MRN in one line, removing all duplicates. It looks something like this.
| | A | B |
| 1 | MRN Item
| 2 | 1 cefoTAXime
| 3 | 1 ampicillin
| 4 | 1 cefoTAXime
| 5 | 1 vancomycin
| 6 | 1 cefTRIaxone
| 7 | 2 ampicillin
| 8 | 2 vancomycin
| 9 | 2 vancomycin
I have 3 different formulas. The first is to produce a list of MRNs that are all unique. The second is to pull all drugs by MRN and list them in one line. The third is to remove duplicates from this list. They are below (in order).
{=IFERROR(INDEX($A$2:$A$2885, MATCH(0,COUNTIF(D$1:$D1, $A$2:$A$2885),0 )),"")}
{=INDEX($A$2:$B$2885,SMALL(IF($A$2:$A$2885=$D2,ROW($A$2:$A$2885)),COLUMN(D:D))-4,2)}
{=IFERROR(INDEX($E$2:$AE$2, MATCH(0,COUNTIF(D$3:$D3, $E$2:$AE$2),0 )),"")}
*I know that I can edit the second one by adding IF(ISERROR ...) to remove NA and print blanks if drug not found, but want to keep the formulas as simple as possible at this time.
My problem is that second formula isn't pulling all the drugs by MRN, and in an ideal world I would be able to combine the second and third formula into one, but I am not sure how to. Here is a link to a test file that shows my issue and the formulas in action.
https://1drv.ms/x/s!ApoCMYBhswHzhooXnumW2iV7yx-JaA
I appreciate that there may be a better way to do this using python/R, and if that's possible then I'm more than happy to try, but I couldn't make any headway. Thanks for your help and suggestions.
If you could deal with a count of the number of courses per drug per MRN, you can do this with Power Query (aka Get & Transform in Excel 2016)
Starting with the data you provided on your worksheet, the results would look like:
M-Code
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"MRN", Int64.Type}, {"Item", type text}}),
#"Grouped Rows" = Table.Group(#"Changed Type", {"MRN"}, {{"Count", each _, type table}}),
#"Expanded Count" = Table.ExpandTableColumn(#"Grouped Rows", "Count", {"MRN", "Item"}, {"Count.MRN", "Count.Item"}),
#"Pivoted Column" = Table.Pivot(#"Expanded Count", List.Distinct(#"Expanded Count"[Count.Item]), "Count.Item", "Count.MRN", List.NonNullCount)
in
#"Pivoted Column"

Excel Lookup IP addresses in multiple ranges

I am trying to find a formula for column A that will check an IP address in column B and find if it falls into a range (or between) 2 addresses in two other columns C and D.
E.G.
A B C D
+---------+-------------+-------------+------------+
| valid? | address | start | end |
+---------+-------------+-------------+------------+
| yes | 10.1.1.5 | 10.1.1.0 | 10.1.1.31 |
| Yes | 10.1.3.13 | 10.1.2.16 | 10.1.2.31 |
| no | 10.1.2.7 | 10.1.1.128 | 10.1.1.223 |
| no | 10.1.1.62 | 10.1.3.0 | 10.1.3.127 |
| yes | 10.1.1.9 | 10.1.4.0 | 10.1.4.255 |
| no | 10.1.1.50 | … | … |
| yes | 10.1.1.200 | | |
+---------+-------------+-------------+------------+
This is supposed to represent an Excel table with 4 columns a heading and 7 rows as an example.
I can do a lateral check with
=IF(AND((B3>C3),(B3 < D3)),"yes","no")
which only checks 1 address against the range next to it.
I need something that will check the 1 IP address against all of the ranges. i.e. rows 1 to 100.
This is checking access list rules against routes to see if I can eliminate redundant rules... but has other uses if I can get it going.
To make it extra special I can not use VBA macros to get it done.
I'm thinking some kind of index match to look it up in an array but not sure how to apply it. I don't know if it can even be done. Good luck.
Ok, so I've been tracking this problem since my initial comment, but have not taken the time to answer because just like Lana B:
I like a good puzzle, but it's not a good use of time if i have to keep guessing
+1 to Lana for her patience and effort on this question.
However, IP addressing is something I deal with regularly, so I decided to tackle this one for my own benefit. Also, no offense, but getting the MIN of the start and the MAX of the end is wrong. This will not account for gaps in the IP white-list. As I mentioned, this required 15 helper columns and my result is simply 1 or 0 corresponding to In or Out respectively. Here is a screenshot (with formulas shown below each column):
The formulas in F2:J2 are:
=NUMBERVALUE(MID(B2,1,FIND(".",B2)-1))
=NUMBERVALUE(MID(B2,FIND(".",B2)+1,FIND(".",B2,FIND(".",B2)+1)-1-FIND(".",B2)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2)+1)+1,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)-1-FIND(".",B2,FIND(".",B2)+1)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)+1,LEN(B2)))
=F2*256^3+G2*256^2+H2*256+I2
Yes, I used formulas instead of "Text to Columns" to automate the process of adding more information to a "living" worksheet.
The formulas in L2:P2 are the same, but replace B2 with C2.
The formulas in R2:V2 are also the same, but replace B2 with D2.
The formula for X2 is
=SUMPRODUCT(--($P$2:$P$8<=J2)*--($V$2:$V$8>=J2))
I also copied your original "valid" set in column A, which you'll see matches my result.
You will need helper columns.
Organise your data as outlined in the picture.
Split address, start and end into columns by comma (ribbon menu Data=>Text To Columns).
Above the start/end parts, calculate MIN FOR START, and MAX FOR END for all split text parts (i.e. MIN(K5:K1000) .
FORMULAS:
VALIDITY formula - copy into cell D5, and drag down:
=IF(AND(B6>$I$1,B6<$O$1),"In",
IF(OR(B6<$I$1,B6>$O$1),"Out",
IF(B6=$I$1,
IF(C6<$J$1, "Out",
IF( C6>$J$1, "In",
IF( D6<$K$1, "Out",
IF( D6>$K$1, "In",
IF(E6>=$L$1, "In", "Out"))))),
IF(B6=$O$1,
IF(C6>$P$1, "Out",
IF( C6<$P$1, "In",
IF( D6>$Q$1, "Out",
IF( D6<$Q$1, "In",
IF(E6<=$R$1, "In", "Out") )))) )
)))

Loop across many datasets to get one summary table

I have about 100 datasets in Stata. I want to loop across all of them to get one summary table for the proportion of people across all datasets who are taking a drug aceinhib. I can write code which produces a table for each dataset, but what I want is a summary of all these tables in one table.
Here is an example using just 5 datasets:
forval i=1/5 {
capture use "FILEADDRESS\FILENAME`i'", clear
table aceinhib
capture save "FILEADDRESS\NEW_FILENAME`i'", replace
}
This gives me:
----------------------
aceinhib | Freq.
----------+-----------
0 | 1578935
1 | 138,961
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 5671774
1 | 421,732
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 2350391
1 | 198,875
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 884,660
1 | 51,087
----------------------
----------------------
aceinhib | Freq.
----------+-----------
0 | 1470388
1 | 130,614
----------------------
What I want is:
----------------------
aceinhib | Freq.
----------+-----------
0 | 11956148
1 | 941269
----------------------
-- namely, the combined results of the 5 tables above.
Consider this pattern:
scalar a = 0
scalar b = 0
quietly forval i = 1/1000 {
sysuse auto, clear
count if foreign
scalar a = scalar(a) + r(N)
count if !foreign
scalar b = scalar(b) + r(N)
}
gen double count = cond(_n == 1, scalar(a), cond(_n == 2, scalar(b), .))
gen which = cond(_n == 1, "Foreign", cond(_n == 2, "Domestic", ""))
list which count in 1/2
Just cumulate counts from one file to another. For the real problem, don't read in the same dataset, repeatedly, but different files in a loop.
Perhaps this will point you in a useful direction.
clear
tempfile working
save `working', emptyok
forval i=1/5{
quietly use "FILEADDRESS\FILENAME`i'", clear
* replace "somevariable" with the name of a variable that is never missing
collapse (count) N=somevariable, by(aceinhib)
append using `working'
quietly save `working', replace
}
use `working', clear
collapse (sum) N, by(aceinhib)
list
If all files have the same structure, you could append them into one file before your table command. The following solutions also rely on aceinhib being coded as 0/1. If the files are not too large to append, it could be as simple as:
use "FILEADDRESS\FILENAME1", clear
forvalues i = 2/100 {
append using "FILEADDRESS\FILENAME`i'"
}
table aceinhib
If the resulting data file from append is too large, and there are no weights involved, you may continue as you have and employ the replace option for table:
forvalues i = 1/100 {
use "FILENAME`i'", clear
table aceinhib, replace
rename table1 freq
save "NEW_FILENAME`i'"
}
use "NEW_FILENAME1", clear
forvalues i = 2/100 {
append using "NEW_FILENAME`i'"
}
collapse (sum) freq, by(aceinhib)
list
Note that this approach will create data files containing the individual frequency tables. A third approach relies on storing the results of tab into a matrix for each iteration of the loop, and adding them to another matrix to store the cumulative freq of 0/1 values for aceinhib in each dataset:
mat b = (0\0)
forvalues i = 1/100 {
use "`FILENAME`i''", clear
tab aceinhib, matcell(aceinhib`i')
mat aceinhib = aceinhib + aceinhib`i'
}
mat list aceinhib
This is how I would approach the problem, although there may be cleaner solutions leveraging user written packages or other base Stata functionality that I haven't included here.

SAS - Do a loop while a column value doesn't change

I'm trying via SAS guide to use a loop (via PROC LOOP) to create a new column with a increment ID whenever the value of a specific column changes.
Just for example I'm looking for something like this:
Date | Name | Status | ID
------------------------------------------
20150101 | Tiago | Single | 1
20150102 | Tiago | Single | 1
20150103 | Tiago | Married | 2
20150104 | Tiago | Divorced | 3
20150105 | Tiago | Divorced | 3
20150106 | Tiago | Married | 4
In this case, the new column will be the ID, that will increment whenever the status changes along the records. With this I can then group by name, to see every change that occurred in time (even if they are repeated).
This question seems a little bit confused. If the original data is already sorted with the sample data provided, a data step like this could do.
data new;
set test;
by status notsorted;
if first.status then id + 1;
run;
The notsorted option is used to keep the original data. first.status will be True for the first appearance of status. id + 1 is a summary statement. The variable in the summary statement is not initialized to missing.
And by the way, what is PROC LOOP?
To expand Dajun's answer to work with grouping by name.
/* Sort so that name forms groups and id will go in date order */
proc sort data = test;
by name date;
run;
data want;
set test;
/* Tell SAS we want to know when the value of name or status changes */
by name status notsorted;
/* Reset the ID for each group */
if first.name then id = 0;
/* iterate the ID as per DaJun */
if first.status then id + 1;
run;

Are discrete string values repeated on disk for each duplication?

I have to store a definite set of string values in a column in a large table. You're probably wondering why I don't use another look-up table and set a FK-PK relationship; well imagine there's a good reason for that.
Does Oracle use an compression mechanism for such columns? Or, is there any way to make it use one?
If the answer is negative does Oracle just stores the exact characters for every duplication of values? Can you provide a reference?
As with dates Oracle does not compress data for you.
Setting up a simple environment:
create table test ( str varchar2(100) );
insert all
into test values ('aaa')
into test values ('aba')
into test values ('aab')
into test values ('abb')
into test values ('bbb')
select * from dual;
and using DUMP(), which returns the datatype, the length in bytes and the internal representation of the data, you can see what is stored using this query:
select str, dump(str)
from test
The answer is that in every case 3 bytes are stored.
+-----+-----------------------+
| STR | DUMP(STR) |
+-----+-----------------------+
| aaa | Typ=1 Len=3: 97,97,97 |
| aba | Typ=1 Len=3: 97,98,97 |
| aab | Typ=1 Len=3: 97,97,98 |
| abb | Typ=1 Len=3: 97,98,98 |
| bbb | Typ=1 Len=3: 98,98,98 |
+-----+-----------------------+
SQL Fiddle
As jonearles suggests in the linked answer you can use table compression to reduce the amount of stored bytes, but there are a number of trade offs. Declare your table as follows instead:
create table test ( str varchar2(100) ) compress;
Please note all the warnings in the documentation, and jonearles' answer; there are too many to post here.
It's highly unlikely that you need to save a few bytes in this manner.

Resources