Equivalent of Excel Pivoting in Stata - pivot-table

I have been working with country-level survey data in Stata that I needed to reshape. I ended up exporting the .dta to a .csv and making a pivot table in in Excel but I am curious to know how to do this in Stata, as I couldn't figure it out.
Suppose we have the following data:
country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
I would like the data to be reformatted as such:
country sum_1 sum_2
A 4 4
B 3 2
First I tried a simple reshape wide command but got the error that "values of variable response not unique within country" before realizing reshape without additional steps wouldn't work anyway.
Then I tried generating new variables conditional on the value of response and trying to use reshape following that... the whole thing turned into kind of a mess so I just used Excel.
Just curious if there is a more intuitive way of doing that transformation.

If you just want a table, then just ask for one:
clear
input str1 country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
end
tabulate country response
| response
country | 1 2 | Total
-----------+----------------------+----------
A | 4 4 | 8
B | 3 2 | 5
-----------+----------------------+----------
Total | 7 6 | 13
If you want the data to be changed to this, reshape is part of the answer, but you should contract first. collapse is in several ways more versatile, but your "sum" is really a count or frequency, so contract is more direct.
contract country response, freq(sum_)
reshape wide sum_, i(country) j(response)
list
+-------------------------+
| country sum_1 sum_2 |
|-------------------------|
1. | A 4 4 |
2. | B 3 2 |
+-------------------------+
In Stata 16 up, help frames introduces frames as a way to work with multiple datasets in the same session.

Related

how to query different columns and stack into new table in google sheet

Hi I have columns like so, where it's auto fill every rows.
Where column BCD is from source a, column EFG from source b and HIJ from source c
sheet data
A
B
C
D
E
F
G
H
I
J
1
Date
Name
Cost
Date
Name
Cost
Date
Name
Cost
2
2022-01-02
Alan
5
2022-01-03
James
6
2022-01-02
Timmy
5
3
2022-01-02
Hana
5
2022-01-03
Paul
6
2022-01-02
Jane
5
into
summary sheet
A
B
C
D
E
1
Date
Name
Cost
Source
2
2022-01-02
Alan
5
sourceA
3
2022-01-02
Hana
5
sourceA
4
2022-01-03
James
6
sourceB
5
2022-01-03
Paul
6
sourceB
6
2022-01-02
Timmy
5
sourceC
7
2022-01-02
Jane
5
sourceC
How do I achieve this with formula query, stacking it on top one another.
Source is using if but then how do you detect last row and used it for the if.
the rows for each source might be different.
this is array: {} inside of it you can use comma , to put something next to each other or semicolon ; to put something under something else. eg. having:
={1,2;3,4}
will yield:
A B
------+-------+
1 | 1 | 2
------+-------+
2 | 3 | 4
in that manner you can do:
={QUERY(B:D);
QUERY(E:G);
QUERY(H:J)}
side note: if your locale is non-english then comma , in array is replaced by backslash \
Use array notation to combine the ranges, and combine them with either filter() or query() to remove the empty rows.
Filter docs
Query docs

Running Delta Issue in Google Data Studio

Data
Datapull | product | value
8/30 | X | 2
8/30 | Y | 3
8/30 | Y | 4
9/30 | X | 5
9/30 | Y | 6
Report
date range dimension: datapull
dimensions: data pull & product
metric: running delta of record count
Chart
For 8/30, The totals to start for product Y are right but Product X shows nothing when the first row of data has an entry for product X in 8/30.
The variances in 9/30 are wrong too.
Can someone please let me know if there is a way to do running deltas with 2 dimensions? This is not calculating correctly.
Using the combination of Breakdown dimension and Running delta calculation brakes the chart!
If you not mind to create a field for each Breakdown dimension (product), this will work:
case when product="X" then 1 else 0 end
And put each of these fields into 'Metric':

Loop for Data Management

For the dataset like the following
I want to change the value of G7_A6 which is either 1 or 2 based on B1_04. For a given ID (there are two same IDs) G7_A6 takes either only 1 or only 2.
I need to use loops in Stata [enter image description here][1] because I have a very large dataset and typing individual IDs is cumbersome
replace G7_A6="2" if B1_04=="3" | B1_04 =="4" | B1_04 == "5" | B1_04 =="6" |B1_04 =="7"
replace G7_A6="1" if B1_04=="2"
The link of picture is here
[1]: https://i.stack.imgur.com/Yv7RI.png
You do not need a loop program for this, as each relationship is row-specific. The code you are using would solve your problem but is a bit cumbersome. A cleaner version would be:
replace G7_A6 = 2 if B1_04 != 1 | B1_04 != 2
replace G7_A6 = 1 if B1_04 == 1 | B1_04 == 2
This will give you the following data:
id
name
B1_04
G7_A6
1
sam
2
1
1
margaret
1
1
9
Jim
5
2
9
Cinderella
1
1
You did not post a question either: you simply said you need a program to do what you already posted.

How can I use the LAG function and return the same value if the subsequent value in a row is duplicated?

I am using the LAG function to move my values one row down.
However, I need to use the same value as previous if the items in source column is duplicated:
ID | SOURCE | LAG | DESIRED OUTCOME
1 | 4 | - | -
2 | 2 | 4 | 4
3 | 3 | 2 | 2
4 | 3 | 3 | 2
5 | 3 | 3 | 2
6 | 1 | 3 | 3
7 | 4 | 1 | 1
8 | 4 | 4 | 1
As you can see, for instance in ID range 3-5 the source data doesn't change and the desired outcome should be fed from the last row with different value (so in this case ID 2).
Sql server's version of lag supports an expression in the second argument to determine how many rows back to look. You can replace this with some sort of check to not look back e.g.
select lagged = lag(data,iif(decider < 0,0,1)) over (order by id)
from (values(0,1,'dog')
,(1,2,'horse')
,(2,-1,'donkey')
,(3,2,'chicken')
,(4,23,'cow'))f(id,decider,data)
This returns the following list
null
dog
donkey
donkey
chicken
Because the decider value on the row with id of 2 was negative.
Well, first lag may not be the tool for the job. This might be easier to solve with a recursive CTE. Sql and window functions work over set. That said, our goal here is to come up with a way of describing what we want. We'd like a way to partition our data so that sequential islands of the same value are part of the same set.
One way we can do that is by using lag to help us discover if the previous row was different or not.
From there, we can now having a running sum over these change events to create partitions. Once we have partitions, we can assign a row number to each element in the partition. Finally, once we have that, we can now use the row number to look
back that many elements.
;with d as (
select * from (values
(1,4)
,(2,2)
,(3,3)
,(4,3)
,(5,3)
,(6,1)
,(7,4)
,(8,4)
)f(id,source))
select *,lag(source,rn) over (order by Id)
from (
select *,rn=row_number() over (partition by partition_id order by id)
from (
select *, partition_id = sum(change) over (order by id)
from (
select *,change = iif(lag(source) over (order by id) != source,1,0)
from d
) source_with_change
) partitioned
) row_counted
As an aside, this an absolutely cruel interview question I was asked to do once.

How to force Postgresql "cartesian product" behavior when unnest'ing multiple arrays in select?

Postgresql behaves strangely when unnesting multiple arrays in the select list:
select unnest('{1,2}'::int[]), unnest('{3,4}'::int[]);
unnest | unnest
--------+--------
1 | 3
2 | 4
vs when arrays are of different lengths:
select unnest('{1,2}'::int[]), unnest('{3,4,5}'::int[]);
unnest | unnest
--------+--------
1 | 3
2 | 4
1 | 5
2 | 3
1 | 4
2 | 5
Is there any way to force the latter behaviour without moving stuff to the from clause?
The SQL is generated by a mapping layer and it will be very much easier for me to implement the new feature I am adding if I can keep everything in the select.
https://www.postgresql.org/docs/10/static/release-10.html
Set-returning functions are now evaluated before evaluation of scalar
expressions in the SELECT list, much as though they had been placed in
a LATERAL FROM-clause item. This allows saner semantics for cases
where multiple set-returning functions are present. If they return
different numbers of rows, the shorter results are extended to match
the longest result by adding nulls. Previously the results were cycled
until they all terminated at the same time, producing a number of rows
equal to the least common multiple of the functions' periods.
(emphasis mine)
I tested with version 12.12 of Postgres and as mentioned by OP in a comment, having two unnest() works as expected when you move those to the FROM clause like so:
SELECT a, b
FROM unnest('{1,2}'::int[]) AS a,
unnest('{3,4}'::int[]) AS b;
Then you get the expected table:
a | b
--+--
1 | 3
1 | 4
2 | 3
2 | 4
As we can see, in this case the arrays have the same size.
In my case, the arrays come from a table. You can first name the name and then name column inside the unnest() calls like so:
SELECT a, b
FROM my_table,
unnest(col1) AS a,
unnest(col2) AS b;
You can, of course, select other columns as required. This is a normal cartesian product.

Resources