Slow loading of parquet files to snowflake - snowflake-cloud-data-platform

Slow loading of parquet files to snowflake - snowflake-cloud-data-platform

I have a sample table of 3000 integer columns + id one.
There are 256k rows where only the id column is filled with numbers, the rest is nulls.
I made an export to parquet format, there were made two files: 678kB (72k rows) and 815kB (184k rows).
The export was made with:
COPY INTO '#test/256k_rows_parquet'
FROM x4
file_format = (type=parquet)
Then I made a new table from the first one as:
CREATE TABLE x5 AS SELECT * FROM x4 LIMIT 0
I loaded the parquet file with
COPY INTO x5
(
id, A_1 [...] A3000
)
FROM
(
$1:_COL_0, $1:_COL_1 [...] $1:_COL_3000
FROM
#test/256k_rows_parquet[...]
)
The problem is that the loading the file with 72k rows took 53s while loading the other file with 184k rows took 12 minutes. I'm using the smallest warehouse.

Related

Weighted Average w/ Array Formula & Query That Pulls From A Separate Sheet

Link To Sheet
So I've got an array formula which I've included below. I need to adjust this so that it becomes a weighted average based on variables stored on a sheet titled Variables.
Current Formula:
=ARRAYFORMULA(QUERY(
{PROPER(ADP!A3:A),ADP!E3:S;
PROPER(ADP!J3:J),ADP!S3:S;
PROPER(ADP!Z3:Z),ADP!AG3:AG},
"select Col1, Sum(Col2)
where
Col2 is not null and
Col1 is not null
group by Col1
order by Sum(Col2)
label
Col1 'PLAYER',
Sum(Col2) 'ADP AVG'"))
Here's what I thought would work but doesn't:
=ARRAYFORMULA(QUERY(
{PROPER(ADP!A3:A),ADP!E3:E*(Variables!$F$11/Variables!$F$14);
PROPER(ADP!J3:J),ADP!S3:S*(Variables!$F$12/Variables!$F$14);
PROPER(ADP!Z3:Z),ADP!AG3:AG*(Variables!$F$13/Variables!$F$14)},
"select Col1, Sum(Col2)
where
Col2 is not null and
Col1 is not null
group by Col1
order by Sum(Col2)
label
Col1 'PLAYER',
Sum(Col2) 'ADP AVG'"))
What I'm trying to get is the value pulled in K to be multiplied by the value in VariablesF11, the value pulled in Y to be multiplied by VariablesF12, and the value in AL multiplied by the variables in F13. And have that numerator divided by the value in VariablesF14.

After our extensive chat, I'm providing here the answer we came up with, just on the chance it might somehow help someone else. But the issue in your case was less about the technicalities of the formula, and more about the structuring of multiple data sources, and the associated logic to pull the data together.
Here is the main formula:
={"Adjusted
Ranking
by " & Variables!F21;
arrayformula(
if(A2:A<>"",
( if(((D2:D>0) * Source1Used),D2:D,Variables!$F$21)*Variables!$F$12
+ if(((F2:F>0) * Source2Used),F2:F,Variables!$F$21)*Variables!$F$13
+ if(((H2:H>0) * Source3Used),H2:H,Variables!$F$21)*Variables!$F$14
+ if(((J2:J>0) * Source4Used),J2:J,Variables!$F$21)*Variables!$F$15
+ if(((L2:L>0) * Source5Used),L2:L,Variables!$F$21)*Variables!$F$16
+ if(((N2:N>0) * Source6Used),N2:N,Variables!$F$21)*Variables!$F$17 )) / Variables!$F$18) }
A2:A is the list of players' names. The D2:D>0 is a test of whether that player has a rating obtained from a particular data source.
Source1Used is a named range for a tickbox cell, where the user can indicate whether that data source is to be included in the calculations.
This formula creates an average value, using from 1 to 6 possible sources, user selectable.
The formula that gave the rating value for one specific source is as follows:
={"Rating in
Source1";ArrayFormula(if(A2:A<>"",if(C2:C,vlookup(A2:A,indirect("ADP!$" & ADP!E3 & "$10:" & ADP!E5),ADP!E6-ADP!E4+1,0),0),""))}
This takes a name in column A, checks if it is listed in a specific source's data, and if so, it pulls back the rating value from the data source. INDIRECT is used since the column locations for each data source may vary, but are obtained from a fixed table, in cells ADP!E3 and E5. E4 and E6 are the numeric values of the column letters.

Space consumed by a particular column's data and impact on deleting that column

I am using Oracle 12c database in my project and I have a column "Name" of type "VARCHAR2(128 CHAR) NOT NULL ". I have approximately 25328687 rows in my table.
Now I don't need the "Name" column so I want to delete it. When I calculated the total size of the data in this column(using lengthb and vsize) for all the rows it was approximately 1.07 GB.
Since the max size of the data in this column is specified, isn't all the rows will be allocated 128 bytes for this column (ignoring unicode for simplicity) and the total space consumed by this column should be 128 * number of rows = 3242071936 bytes or 3.24 GB.

Oracle Varchar2 allocate memory dynamically (definition says variable length string data type)
Char datatype is fixed length string data type.
create table x (a char(5), b varchar2(5));
insert into x value ('RAM', 'RAM');
insert into x value ('RAMA', 'RAMA');
insert into x value ('RAMAN', 'RAMAN');
SELECT * FROM X WHERE length(a) = 3; -> this will return 0 record
SELECT * FROM X WHERE length(b) = 3; -> this will return 1 record (RAM)
SELECT length(a) len_a, length(b) len_b from x ;
o/p will be like below
len_a | len_b
-------------
5 | 3
5 | 4
5 | 5

Oracle do dynamic allocation for varchar2 .
So a string of 4 char will take 5 bytes one for the length and 4 bytes for 4 char , if one-byte character set .

As the other answers say, the storage that a VARCHAR2 column uses is VARying. To get an estimate of the actual amount, you can use
1) The data dictionary
SELECT column_name, avg_col_len, last_analyzed
FROM ALL_TAB_COL_STATISTICS
WHERE owner = 'MY_SCHEMA'
AND table_name = 'MY_TABLE'
AND column_name = 'MY_COLUMN';
The result avg_col_len is the average column length. Mulitply it by your number of rows 25328687 and you get an estimate of roughly how many bytes this column uses. (If last_analyzed is NULL or very old compared to the last big data change, you'll have to refresh the optimizer stats with DBMS_STATS.GATHER_TABLE_STATS('MY_SCHEMA','MY_TABLE') first.
2) Count yourself in sample
SELECT sum(s), count(*), avg(s), stddev(s)
FROM (
SELECT vsize(my_column) as s
FROM my_schema.my_table SAMPLE (0.1)
);
This calculates the storage size of a 0.1 percent sample of your table.
3) To know for sure, I'd do a test of with a subset of the data
CREATE TABLE my_test TABLESPACE my_scratch_tablespace NOLOGGING AS
SELECT * FROM my_schema.my_table SAMPLE (0.1);
-- get the size of the test table in megabytes
SELECT round(bytes/1024/1024) as mb
FROM dba_segments WHERE owner='MY_SCHEMA' AND segment_name='MY_TABLE';
-- now drop the column
ALTER TABLE my_test DROP (my_column);
-- and measure again
SELECT round(bytes/1024/1024) as mb
FROM dba_segments WHERE owner='MY_SCHEMA' AND segment_name='MY_TABLE';
-- check how much space will be freed up
ALTER TABLE my_test MOVE;
SELECT round(bytes/1024/1024) as mb
FROM dba_segments WHERE owner='MY_SCHEMA' AND segment_name='MY_TABLE';
You could improve the test by using the same PCTFREE and COMPRESSION levels on your test table.

How to split a column with 2 data type in 2 columns (sql server)

I have a huge .CSV file with information about triathlon races (People, Times, Country, Overalltime...etc) all in varchar...
The problem is that one column (Overalltime) stores datatime and varchar types.
The varchars are (DNS,DNF,DQ) while datatimes are (09:09:30) for example.
When I am creating the table, I have a column like this:
overalltime
-------------
09:09:30
09:10:22
DNF
DNS
But I want to split that column in the table, to have two columns. One with the datetime values and another one with the varchar columns.
What will be the best way to split that column?

One approach is to use a case expression to conditionally break your values into columns:
-- Ussing CASE to split rows into columns.
WITH SampleData AS
(
-- Provides sample data to play with.
SELECT
r.overalltime
FROM
(
VALUES
('09:09:30'),
('09:09:30'),
('DNF'),
('DNS')
) AS r(overalltime)
)
SELECT
CASE WHEN ISDATE(overalltime) = 1 THEN overalltime ELSE NULL END AS [Time],
CASE WHEN overalltime = 'DNS' THEN 1 ELSE 0 END AS DNS,
CASE WHEN overalltime = 'DNF' THEN 1 ELSE 0 END AS DNF
FROM
SampleData
;
Returns:
Time DNS DNF
09:09:30 0 0
09:09:30 0 0
NULL 0 1
NULL 1 0

im no pro but the best way to do it would be to filter it out before importing into the database. you use .csv file then it wont be a problem to split them into datatime for one column and 2ndcolumn for varchar. then upload it when you already have two separate columns

Find valid combinations based on matrix

I have a in CALC the following matrix: the first row (1) contains employee numbers, the first column (A) contains productcodes.
Everywhere there is an X that productitem was sold by the corresponding employee above
| 0302 | 0303 | 0304 | 0402 |
1625 | X | | X | X |
1643 | | X | X | |
...
We see that product 1643 was sold by employees 0303 and 0304
What I would like to see is a list of what product was sold by which employees but formatted like this:
1625 | 0302, 0304, 0402 |
1643 | 0303, 0304 |
The reason for this is that we need this matrix ultimately imported into an SQL SERVER table. We have no access to the origins of this matrix. It contains about 50 employees and 9000+ products.
Thanx for thinking with us!

try something like this
;with data as
(
SELECT *
FROM ( VALUES (1625,'X',NULL,'X','X'),
(1643,NULL,'X','X',NULL))
cs (col1, [0302], [0303], [0304], [0402])
),cte
AS (SELECT col1,
col
FROM data
CROSS apply (VALUES ('0302',[0302]),
('0303',[0303]),
('0304',[0304]),
('0402',[0402])) cs (col, val)
WHERE val IS NOT NULL)
SELECT col1,
LEFT(cs.col, Len(cs.col) - 1) AS col
FROM cte a
CROSS APPLY (SELECT col + ','
FROM cte B
WHERE a.col1 = b.col1
FOR XML PATH('')) cs (col)
GROUP BY col1,
LEFT(cs.col, Len(cs.col) - 1)

I think there are two problems to solve:
get the product codes for the X marks;
concatenate them into a single, comma-separated string.
I can't offer a solution for both issues in one step, but you may handle both issues separately.
1.
To replace the X marks by the respective product codes, you could use an array function to create a second table (matrix). To do so, create a new sheet, copy the first column / first row, and enter the following formula in cell B2:
=IF($B2:$E3="X";$B$1:$E$1;"")
You'll have to adapt the formula, so it covers your complete input data (If your last data cell is Z9999, it would be =IF($B2:$Z9999="X";$B$1:$Z$1;"")). My example just covers two rows and four columns.
After modifying it, confirm with CTRL+SHIFT+ENTER to apply it as array formula.
2.
Now, you'll have to concatenate the product codes. LO Calc lacks a feature to concatenate an array, but you could use a simple user-defined function. For such a string-join function, see this answer. Just create a new macro with the StarBasic code provided there and save it. Now, you have a STRJOIN() function at hand that accepts an array and concatenates its values, leaving empty values out.
You could add that function using a helper column on the second sheet and apply it by dragging it down. Finally, to get rid of the cells with the single product IDs, copy the complete second sheet, paste special into a third sheet, pasting only the values. Now, you can remove all columns except the first one (employee IDs) and the last one (with the concatenated product ids).

I created a table in sql for holding the data:
CREATE TABLE [dbo].[mydata](
[prod_code] [nvarchar](8) NULL,
[0100] [nvarchar](10) NULL,
[0101] [nvarchar](10) NULL,
[and so on...]
I created the list of columns in Calc by copying and pasting them transposed. After that I used the concatenate function to create the columnlist + datatype for the create table statement
I cleaned up the worksheet and imported it into this table using SQL Server's import wizard. Cleaning meant removing unnecessary rows/columns. Since the columnnames were identical mapping was done correctly for 99%.
Now I had the data in SQL Server.
I adapted the code MM93 suggested a bit:
;with data as
(
SELECT *
FROM dbo.mydata <-- here i simply referenced the whole table
),cte
and in the next part I uses the same 'worksheet' trick to list and format all the column names and pasted them in.
),cte
AS (SELECT prod_code, <-- had to replace col1 with 'prod_code'
col
FROM data
CROSS apply (VALUES ('0100',[0100]),
('0101', [0101] ),
(and so on... ),
The result of this query was inserted into a new table and my colleagues and I are querying our harts out :)
PS: removing the 'FOR XML' clause resulted in a table with two columns :
prodcode | employee
which containes al the unique combinations of prodcode + employeenumber which is a lot faster and much more practical to query.

Using R to save results of a lm model to a database

I'm trying to take the results of a linear regression performed in R and store those results in a database.
Specifically, what I'm after is the data in coef(summary(myModel). I can turn that into a dataframe and use sqlSave(), but the coefficient names are not a column in the dataframe. How to I get the coefficients and the variable names into a single dataframe that can be saved using sqlSave()?
For clarity, I'm trying to store the data in a database table that has the columns:
VariableName, Estimate, StdError, tValue, pValue
Is there an easier way to prepare this data to be stored in a database? As an example here's what the results of coef(summary(myModel)) gives:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.52729727 2.623035966 19.64414439 1.941150e-58
factor(person)507 -0.73663931 2.627215539 -0.28038785 7.793456e-01
factor(person)713 -5.18612049 3.317899029 -1.56307363 1.189390e-01
TransCnt 0.02658798 0.005682853 4.67863266 4.132888e-06
factor(Month)5 0.67908563 1.119655304 0.60651312 5.445673e-01
factor(Month)6 2.09595623 1.169658148 1.79193915 7.400639e-02
factor(Month)7 2.91204838 1.333483558 2.18379024 2.964109e-02

datOut <- summary(myModel)$coef
datOut <- cbind(VariableName=rownames(datOut), datOut)
rownames(datOut) <- NULL
If you want to add your own column names:
colnames(datOut) <- c("VariableName", "Estimate", "StdError", "tValue", "pValue")
datOut

The table produced by summary.lm is a matrix. You can coerce toa dataframe with as.data.frame
df.coef <- as.data.frame( coef(summary(myModel)) )
The column names should be coerced to column names that have no spaces or quotes.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Slow loading of parquet files to snowflake - snowflake-cloud-data-platform

Related

Weighted Average w/ Array Formula & Query That Pulls From A Separate Sheet

Space consumed by a particular column's data and impact on deleting that column

How to split a column with 2 data type in 2 columns (sql server)

Find valid combinations based on matrix

Using R to save results of a lm model to a database

Categories

Resources