Running Delta Issue in Google Data Studio - google-data-studio

Data
Datapull | product | value
8/30 | X | 2
8/30 | Y | 3
8/30 | Y | 4
9/30 | X | 5
9/30 | Y | 6
Report
date range dimension: datapull
dimensions: data pull & product
metric: running delta of record count
Chart
For 8/30, The totals to start for product Y are right but Product X shows nothing when the first row of data has an entry for product X in 8/30.
The variances in 9/30 are wrong too.
Can someone please let me know if there is a way to do running deltas with 2 dimensions? This is not calculating correctly.

Using the combination of Breakdown dimension and Running delta calculation brakes the chart!
If you not mind to create a field for each Breakdown dimension (product), this will work:
case when product="X" then 1 else 0 end
And put each of these fields into 'Metric':

Related

Comparing rows in spark dataframe to obtain a new column

I'm a beginner in spark and I'm dealing with a large dataset (over 1.5 Million rows and 2 columns). I have to evaluate the Cosine Similarity of the field "features" beetween each row. The main problem is this iteration beetween the rows and finding an efficient and rapid method. I will have to use this method with another dataset of 42.5 Million rows and it would be a big computational problem if I won't find the most efficient way of doing it.
| post_id | features |
| -------- | -------- |
| Bkeur23 |[cat,dog,person] |
| Ksur312kd |[wine,snow,police] |
| BkGrtTeu3 |[] |
| Fwd2kd |[person,snow,cat] |
I've created an algorithm that evaluates this cosine similarity beetween each element of the i-th and j-th row but i've tried using lists or creating a spark DF / RDD for each result and merging them using the" union" function.
The function I've used to evaluate the cosineSimilarity is the following. It takes 2 lists in input ( the lists of the i-th and j-th rows) and returns the maximum value of the cosine similarity between each couple of elements in the lists. But this is not the problem.
def cosineSim(lista1,lista2,embed):
#embed = hub.KerasLayer(os.getcwd())
eps=sys.float_info.epsilon
if((lista1 is not None) and (lista2 is not None)):
if((len(lista1)>0) and (len(lista2)>0)):
risultati={}
for a in lista1:
tem = a
x = tf.constant([tem])
embeddings = embed(x)
x = np.asarray(embeddings)
x1 = x[0].tolist()
for b in lista2:
tem = b
x = tf.constant([tem])
embeddings = embed(x)
x = np.asarray(embeddings)
x2 = x[0].tolist()
sum = 0
suma1 = 0
sumb1 = 0
for i,j in zip(x1, x2):
suma1 += i * i
sumb1 += j*j
sum += i*j
cosine_sim = sum / ((sqrt(suma1))*(sqrt(sumb1))+eps)
risultati[a+'-'+b]=cosine_sim
cosine_sim=0
risultati=max(risultati.values())
return risultati
The function I'm using to iterate over the rows is the following one:
def iterazione(df,numero,embed):
a=1
k=1
emp_RDD = spark.sparkContext.emptyRDD()
columns1= StructType([StructField('Source', StringType(), False),
StructField('Destination', StringType(), False),
StructField('CosinSim',FloatType(),False)])
first_df = spark.createDataFrame(data=emp_RDD,
schema=columns1)
for i in df:
for j in islice(df, a, None):
r=cosineSim(i[1],j[1],embed)
if(r>0.45):
z=spark.createDataFrame(data=[(i[0],j[0],r)],schema=columns1)
first_df=first_df.union(z)
k=k+1
if(k==numero):
k=a+1
a=a+1
return first_df
The output I desire is something like this:
| Source | Dest | CosinSim |
| -------- | ---- | ------ |
| Bkeur23 | Ksur312kd | 0.93 |
| Bkeur23 | Fwd2kd | 0.673 |
| Ksur312kd | Fwd2kd | 0.76 |
But there is a problem in my "iterazione" function.
I ask you to help me finding the best way to iterate all over this rows. I was thinking also about copying the column "features" as "features2" and applying my function using WithColumn but I don't know how to do it and if it will work. I want to know if there's some method to do it directly in a spark dataframe, avoiding the creation of other datasets and merging them later, or if you know some method more rapid and efficient. Thank you!

Equivalent of Excel Pivoting in Stata

I have been working with country-level survey data in Stata that I needed to reshape. I ended up exporting the .dta to a .csv and making a pivot table in in Excel but I am curious to know how to do this in Stata, as I couldn't figure it out.
Suppose we have the following data:
country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
I would like the data to be reformatted as such:
country sum_1 sum_2
A 4 4
B 3 2
First I tried a simple reshape wide command but got the error that "values of variable response not unique within country" before realizing reshape without additional steps wouldn't work anyway.
Then I tried generating new variables conditional on the value of response and trying to use reshape following that... the whole thing turned into kind of a mess so I just used Excel.
Just curious if there is a more intuitive way of doing that transformation.
If you just want a table, then just ask for one:
clear
input str1 country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
end
tabulate country response
| response
country | 1 2 | Total
-----------+----------------------+----------
A | 4 4 | 8
B | 3 2 | 5
-----------+----------------------+----------
Total | 7 6 | 13
If you want the data to be changed to this, reshape is part of the answer, but you should contract first. collapse is in several ways more versatile, but your "sum" is really a count or frequency, so contract is more direct.
contract country response, freq(sum_)
reshape wide sum_, i(country) j(response)
list
+-------------------------+
| country sum_1 sum_2 |
|-------------------------|
1. | A 4 4 |
2. | B 3 2 |
+-------------------------+
In Stata 16 up, help frames introduces frames as a way to work with multiple datasets in the same session.

make x in a cell equal 8 and total

I need an excel formula that will look at the cell and if it contains an x will treat it as a 8 and add it to the total at the bottom of the table. I have done these in the pass and I am so rusty that I cannot remember how I did it.
Generally, I try and break this sort of problem into steps. In this case, that'd be:
Determine if a cell is 'x' or not, and create new value accordingly.
Add up the new values.
If your values are in column A (for example), in column B, fill in:
=if(A1="x", 8, 0) (or in R1C1 mode, =if(RC[-1]="x", 8, 0).
Then just sum those values (eg sum(B1:B3)) for your total.
A | B
+---------+---------+
| VALUES | TEMP |
+---------+---------+
| 0 | 0 <------ '=if(A1="x", 8, 0)'
| x | 8 |
| fish | 0 |
+---------+---------+
| TOTAL | 8 <------ '=sum(B1:B3)'
+---------+---------+
If you want to be tidy, you could also hide the column with your intermediate values in.
(I should add that the way your question is worded, it almost sounds like you want to 'push' a value into the total; as far as I've ever known, you can really only 'pull' values into a total.)
Try this one for total sum:
=SUMIF(<range you want to sum>, "<>" & <x>, <range you want to sum>)+ <x> * COUNTIF(<range you want to sum>, <x>)

Excel - VLOOKUP to return each result in an array, not just the first

I am currently working between two workbooks.
In Workbook A I have the following data.
A ... D E F ... N
1.| ID | Name | Desc | Prod | Country|
2.| 12345 | Apple| Fruit| 10| US|
3.| 12346 | Celery| Veg| 150| US|
4.| 12347 | Mint| Herb| 25| FR|
I have been using the following formula from AHC in Workbook B, the aim is to perform a VLOOKUP which grabs all the ID's but only if the Country = "US".
=VLOOKUP("US", CHOOSE({2,1},Workbook A.xlsx!Table1[ID], Workbook A.xlsx!Table1[Country]), 2, FALSE)
This formula works well, however, my problem comes because the formula will only ever return the first instance in the array. For example, if I include this formula in Workbook B, Col A it will look like this:
A
1.|ID of US|
2.| 12345 |
3.| 12345 |
4.| 12345 |
5.| 12345 |
6.| 12345 |
7.| 12345 |
How would I make this formula work so that it returns each ID which matches "US", not just the first occurrence of a match?
In B2 put this formula:
You might need to adjust the rows of the ranges (I went till 100).
={ISERROR(INDEX(D$2:D$100,SMALL(IF(N$2:N$100=$A2,ROW(D1)),ROW(N1))),"")}
NOTE:
Step 1) Insert the formula only in Cell B2 without the {}
Step 2) Once the formula is inserted mark the entire formula and press Ctrl + Shift + Enter so the formula will get the {}
Step 3) Drag it down the rows as far as you need it to get the list.

Regression loop and store coefficients

I am going (1) to loop a regression over a certain criterion many times; and (2) to store a certain coefficient from each regression. Here is an example:
clear
sysuse auto.dta
local x = 2000
while `x' < 5000 {
xi: regress price mpg length gear_ratio i.foreign if weight < `x'
est sto model_`x'
local x = `x' + 100
}
est dir
I just care about one predictor, say mpg here. I want to extract coefficients of mpg from each result into one independent file (any file is OK, .dta would be great) to see if there is a trend as the threshold for weight increases. What I am doing now is to useestout to export the results, something like:
esttab * using test.rtf, replace se stats(r2_a N, labels(R-squared)) starl(* 0.10 ** 0.05 *** 0.01) nogap onecell title(regression tables)
estout will export everything and I need to edit the results. This works well for regressions with few predictors, but my real dataset has more than 30 variables and the regression will loop at least 100 times (I have a variable Distance with range from 0 to 30,000: it has the role of weight in the example). Therefore, it is really difficult for me to edit the results without making mistakes.
Is there any other efficient way to solve my problem? Since my case is not looping over a group variable, but over a certain criterion. the statsby function seems not working well here.
As #Todd has already suggested, you can just choose the particular results you care about and use postfile to store them as new variables in a new dataset. Note that a forval loop is more direct than your while code, while using xi: is superseded by factor variable notation in recent versions of Stata. (I have not changed that just in case you are using some older version.) Note evaluation of saved results such as _b[_cons] on the fly and the use of parentheses () to stop negative signs being evaluated. Some code examples elsewhere store results temporarily in local macros or scalars, which is quite unnecessary.
sysuse auto.dta, clear
tempname myresults
postfile `myresults' threshold intercept gradient se using myresults.dta
quietly forval x = 2000(200)4800 {
xi: regress price mpg length gear_ratio i.foreign if weight < `x'
post `myresults' (`x') (`=_b[_cons]') (`=_b[mpg]') (`=_se[mpg]')
}
postclose `myresults'
use myresults
list
+---------------------------------------------+
| thresh~d intercept gradient se |
|---------------------------------------------|
1. | 2000 -3699.55 -296.8218 215.0348 |
2. | 2200 -4175.722 -53.19774 54.51251 |
3. | 2400 -3918.388 -58.83933 42.19707 |
4. | 2600 -6143.622 -58.20153 38.28178 |
5. | 2800 -11159.67 -49.21381 44.82019 |
|---------------------------------------------|
6. | 3000 -6636.524 -51.28141 52.96473 |
7. | 3200 -7410.392 -58.14692 60.55182 |
8. | 3400 -2193.125 -57.89508 52.78178 |
9. | 3600 -1824.281 -103.4387 56.49762 |
10. | 3800 -1192.767 -110.9302 51.6335 |
|---------------------------------------------|
11. | 4000 5649.41 -173.9975 74.51212 |
12. | 4200 5784.363 -147.4454 71.89362 |
13. | 4400 6494.47 -93.81158 80.81586 |
14. | 4600 6494.47 -93.81158 80.81586 |
15. | 4800 5373.041 -95.25342 82.60246 |
+---------------------------------------------+
statsby (a command, not a function) is just not designed for this problem at all, so it is not a question of whether it works well.
I would suggest you look at help postfile for an example of how to aggregate the results. I agree that statsby may not be the best approach. Evaluating the interaction between mpg and weight on price may help address what would seem to be a classic question of interaction.

Resources