Looping dataframes to mmer model, saving the output in 1 dataframe - loops

Badly needed advise on how I can execute a code where all the dataframes will be use in a a sommer model. In addition, the file output varcomp should be stored in another data frame in order to compute the h2.
To give you an idea here's the scenario.
I have 388 dataframes each having 3 columns, 2 columns are random effects variable (REV) whereas the other column is the phenotype.
the code I am using is
ans1<-mmer(Phenotype~1, random= C1 + C2 + C1:C2, data= dataframe1)
summary(ans1)$VarComp
The last code will give you the variance components of the model.
I need to save the Varcomp object in a dataframe where the Phenotype is the column name in order to compute the h2 at the end of the analysis.
Thank you so much

Related

Google Data Studio : how to obtain a SUM related to a COUNT_DISTINCT?

I have a dataset including 3 columns :
ID transac (The unique ID of the transaction - Dimension)
Source (The source of the transaction - Dimension)
Amount € (The amount of the transaction - Stat)
screenshot of my dataset
To Count the number of transactions (for one or more sources), i use COUNT_DISTINCT function
I want to make the sum of the transactions amounts (for one or more sources). But i don't want to additionate the amounts of the transactions with the same ID !
Is there a way to do this calcul with a DataStudio function ?
Thanks for your answers. :-)
EDIT : I saw that we could do this type of calculation via SQL here and I would like to do this in DataStudio (so that I don't have to pre-calculate the amounts per source.)
IMO, your dataset contains wrong data. Each value should be relative only to that line, but this is not the case: if the total is =20, each line should describe the participation of that line to the total. With 4 sources, each line should be =5 or something else that sums 20.
To solve it in DataStudio, you need something like CALCULATE function in PowerBI, but currently DataStudio doesn't support this feature.
But there are some options to consider to repair your data:
If you're sure there are always 4 sources, just create a new calculated field with the expression Amount/4 and SUM it. It is not an elegant solution, but it works.
If your data source is Google Sheets, you can easily repair the data using formulas, like in this example:
Link to spreadsheet
For this spreadsheet, I used this formula in adjusted_amount column: =C2/COUNTIF(A:A,A2). With this column in DataStudio, just use the usual SUM aggregation function to summarize it correctly.

sum of variables in SPSS-statistics25 for multiple external files

I have 50 external EXCEL files. For each of these files, let's say #I, I import data as it follows in the SYNTAX of SPSS-statistics25:
GET DATA /TYPE=XLSX
/FILE='file#I.xlsx'
/SHEET=name 'Sheet2'
/CELLRANGE=full
/READNAMES=on
/ASSUMEDSTRWIDTH=32767.
EXECUTE.
DATASET NAME DataSet1 WINDOW=FRONT.
Then, I rank the variables included in #I file (WA CI) and I select one single case, at most, as it follows:
RANK VARIABLES= WA CI (D)
/RANK
/PRINT=YES
/TIES=LOW.
COUNT SVAR= RWA RCI (1).
SELECT IF( SVAR=2).
EXECUTE.
The task is the following:
I should print the sum of values of RWA looping on each EXCEL file #I. RWA can have value 1 or can be empty. If there are not selected cases (RWA is empty), the contribution to the sum of values should be 0. The final outcome should be the number of times RWA and RCI have the same TOP rank out of 50 Excel files.
How can I do this in a smart way?
Since I can't see the real data files, the following is a little in the dark, but I think it should be a viable strategy (you might as well try :)):
* first defining a macro to stack all the files together.
define stackFiles ()
GET DATA /TYPE=XLSX /FILE='file1.xlsx'
/SHEET=name 'Sheet2' /CELLRANGE=full /READNAMES=on /ASSUMEDSTRWIDTH=32767 /keep WA CI.
compute source=1.
exe.
dataset name gen.
!do !i=2 !to 40
GET DATA /TYPE=XLSX /FILE=!concat("'file", !i, ".xlsx'")
/SHEET=name 'Sheet2' /CELLRANGE=full /READNAMES=on /ASSUMEDSTRWIDTH=32767/keep WA CI.
compute source=!i.
exe.
add files /file=gen /file=*.
exe.
!doend.
!enddefine.
* now run the macro.
stackFiles .
* now for the rest of the analysis.
* first split the data by source file, then rank and select.
sort cases by source.
split file by source.
RANK VARIABLES= WA CI (D) /RANK /PRINT=YES /TIES=LOW.
COUNT SVAR= RWA RCI (1).
SELECT IF SVAR=2.
EXECUTE.
At this point you have up to 40 rows remaining - 0 or 1 from each original file. You can count or sum using descriptives RWA.

PowerBI (DAX) - Countif Contains Column Value (thousands of rows)

Relatively new to PBI and need a hand with counting totals of delimited values in a single column. So my source column looks something like:
ID Code
1 abc1|bcd2
2 def2|abc1|ghi3
3 bcd2
I've created a new table based on the same query that takes just this column and splits it into individual rows by the pipe delimiter:
Individual Codes
abc1
bcd2
def2
ghi3
Now I'd like to plot the number of occurences of each individual code in the original code column. I had intended on doing this using a calculated column, but I don't know if that's even the best approach. So having something like:
Individual
Codes Counts
abc1 2
bcd2 2
def2 1
ghi3 1
If it's possible to relate the tables, I'm not sure how. I've tried filter approaches similar to this but that's caused crashes. The current source data has maybe 50k rows (with 8k individual codes), but potentially these values could be 10-100x larger so I imagine it's best to avoid something that's creating filters of the source data for each row of the Individual Code table.
Much appreciated!
Original Data
Then you can split Your code column into new rows using delimiter
and then you can group your rows based on Code and count rows as below
and when you come back to Visualizations you can have your desired output

Generating Working Hours using SQL Server Query

I have this data and I need to generate a query that will give the output below
You can do this kind of groupings of rows with 2 separate row_number()s. Have 1 for all the data, ordered by date and second one ordered by code and date. To get the groups separated from the data, use the difference between these 2 row_number()s. When it changes, then it's a new block of data. You can then use that number in group by and take the minimum / maximum dates for each of them.
For the final layout you can use pivot or sum + case, most likely you want to have a new row_number for getting the rows aligned properly. Depending if you can have data missing / not matching you'll need probably additional checks.

SSIS Export all data from one table into multiple files

I have a table called customers which contains around 1,000,000 records. I need to transfer all the records to 8 different flat files which increment the number in the filename e.g cust01, cust02, cust03, cust04 etc.
I've been told this can be done using a for loop in SSIS. Please can someone give me a guide to help me accomplish this.
The logic behind this should be something like "count number of rows", "divide by 8", "export that amount of rows to each of the 8 files".
To me, it will be more complex to create a package that loops through and calculates the amount of data and then queries the top N segments or whatever.
Instead, I'd just create a package with 9 total connection managers. One to your Data Database (Source) and then 8 identical Flat File Connection managers but using the patterns of FileName1, Filename2 etc. After defining the first FFCM, just copy, paste and edit the actual file name.
Drag a Data Flow Task onto your Control Flow and wire it up as an OLE/ADO/ODBC source. Use a query, don't select the table as you'll need something to partition the data on. I'm assuming your underlying RDBMS supports the concept of a ROW_NUMBER() function. Your source query will be
SELECT
MT.*
, (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) % 8 AS bucket
FROM
MyTable AS MT;
That query will pull back all of your data plus assign a monotonically increasing number from 1 to ROWCOUNT which we will then apply the modulo (remainder after dividing) operator to. By modding the generated value by 8 guarantees us that we will only get values from 0 to 7, endpoints inclusive.
You might start to get twitchy about the different number bases (base 0, base 1) being used here, I know I am.
Connect your source to a Conditional Split. Use the bucket column to segment your data into different streams. I would propose that you map bucket 1 to File 1, bucket 2 to File 2... finally with bucket 0 to file 8. That way, instead of everything being a stair step off, I only have to deal with end point alignment.
Connect each stream to a Flat File Destination and boom goes the dynamite.
You could create a rownumber with a Script Component (don't worry very easy): http://microsoft-ssis.blogspot.com/2010/01/create-row-id.html
or you could use a rownumber component like http://microsoft-ssis.blogspot.com/2012/03/custom-ssis-component-rownumber.html or http://www.sqlis.com/post/Row-Number-Transformation.aspx
For dividing it in 8 files you could use the Balanced Data Distributor or the Conditional Split with a modulo expression (using your new rownumber column):

Resources