Data import and processing - CSV - sql-server

I have a number of data sources stored as CSVs and a database table structure in 3NF on SQL Server 2008r2. Some of the columns within the data sources are unrequired.
What I would like to do is to insert the data from each data source into a separate temporary table where data types (and ideally column names) are declared based on the data being inserted. If this is not possible importing the data as the char datatype would help.
From here I would like to select the relevant data into my standard database structure and once complete drop the temporary tables.
I think I will have to use some cast statements to ensure the data is in the right format. I am happy to write a stored procedure to do this although the data is published periodically and occasionally the structure is changed so it will require some manual work.
I have tried to do this with a BULK INSERT but I receive Msg 208 regarding an invalid object name for the temporary table.
USE db1;
GO
BULK INSERT #TempDetails FROM 'C:\Documents\Data\Text Files\Details.txt'
WITH
(
DATAFILETYPE = 'char'
, FIELDTERMINATOR = ','
, ROWTERMINATOR = '\n'
);
Returns
Msg 208, Level 16, State 82, Line 2
Invalid object name '#TempDetails'.
What is the best way for me to achieve this?
My fall back alternative is to import the CSVs into Access and then import that into SQL for the processing but I would like this in code so it is easy to replicate.

Related

Problems importing timestamp from Parquet files

I'm exporting data into Parquet files and importing it into Snowflake. The export is done with python (using to_parquet from pandas) on a Windows Server machine.
The exported file has several timestamp columns. Here's the metadata of one of these columns (ParquetViewer):
I'm having weird issues trying to import the timestamp columns into Snowflake.
Attempt 1 (using the copy into):
create or replace table STAGING.DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0),
"ExitDate" TIMESTAMP_NTZ(9)
);
copy into STAGING.DIM_EMPLOYEE
from #S3
pattern='dim_Employee_.*.parquet'
file_format = (type = parquet)
match_by_column_name = case_insensitive;
select * from STAGING.DIM_EMPLOYEE;
The timestamp column is not imported correctly:
It seems that Snowflake assumes that the value in the column is in seconds and not in microseconds and therefore converts incorrectly.
Attempt 2 (using the external tables):
Then I created an external table:
create or replace external table STAGING.EXT_DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (CAST(GET($1, 'ExitDate') AS TIMESTAMP_NTZ(9)))
)
location=#S3
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE;
The data is still incorrect - still the same issue (seconds instead of microseconds):
Attempt 3 (using the external tables, with modified TO_TIMESTAMP):
I've then modified the external table definition to specifically define that microseconds are used TO_TIMESTAMP_TNZ with scale parameter 6:
create or replace external table STAGING.EXT_DIM_EMPLOYEE_V2(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (TO_TIMESTAMP_NTZ(TO_NUMBER(GET($1, 'ExitDate')), 6))
)
location=#CHICOREE_D365_BI_STAGE/
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE_V2;
Now the data is correct:
But now the "weird" issue appears:
I can load the data into a table, but the load is quite slow and I get a Querying (repair) message during the load. However, at the end, the query is executed, albeit slow:
I want to load the data from stored procedure, using SQL script. When executing the statement using the EXECUTE IMMEDIATE an error is returned:
DECLARE
SQL STRING;
BEGIN
SET SQL := 'INSERT INTO STAGING.DIM_EMPLOYEE ("EmployeeID", "ExitDate") SELECT "EmployeeID", "ExitDate" FROM STAGING.EXT_DIM_EMPLOYEE_V2;';
EXECUTE IMMEDIATE :SQL;
END;
I have also tried to define the timestamp column in an external table as a NUMBER, import it and later convert it into timestamp. This generates the same issue (returning SQL execution internal error in SQL script).
Has anyone experienced an issue like this - it seems to me like a bug?
Basically - my goal is to generate insert/select statements dynamically and execute them (in stored procedures). I have a lot of files (with different schemas) that need to be imported and I want to create an "universal logic" to load these Parquet files into Snowflake.
As confirmed in the Snowflake Support ticket you opened, this issue got resolved when the Snowflake Support team enabled an internal configuration for Parquet timestamp logical types.
If anyone encounters a similar issue please submit a Snowflake Support ticket.

Migrating from SQL Server to Hive Table using flat file

I am migrating my data from SQL Server to Hive using following steps but there is data issue with the resulting table. I tried various options including checking datatype, Using csvSerde but not able to get data aligned properly in respective columns. I followed following steps:
Export SQL Server data to flat file with fields separated by comma.
Create external table in Hive as given below and load data.
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable (
r_date timestamp
, v_nbr varchar(12)
, d_account int
, d_amount decimal(19,4)
, a_account varchar(14)
)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable/mytable.txt' OVERWRITE INTO TABLE myschema.mytable;
There is issue with data with all combination I could try.
I also tried OpenCSVSerde but the result was worse than simple text file. I also tried by changing delimiter to semicolon but no luck.
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties ( "separatorChar" = ",") stored as textfile
location 'gs://mybucket/myschema.db/mytable/';
Can you please suggest some robust approach so that I don't have to deal with data issue.
Note: Currently I don't have option of connecting my SQL Server table with Sqoop.

Bulk insert CSV file from Azure blob storage to SQL managed instance

I have CSV file on Azure blob storage. It has 4 columns in it without headers and one blank row at starting. I am inserting CSV file into SQL managed instance by bulkinsert and I have 5 columns in the database table. I don't have 5th column in CSV file.
Therefore it is throwing this error:
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 5 (uId2)
As I want to insert that 4 columns from CSV file to table in database and I want that 5th column in table as NULL.
I am using this code:
BULK INSERT testing
FROM 'test.csv'
WITH (DATA_SOURCE = 'BULKTEST',
FIELDTERMINATOR = ',',
FIRSTROW = 0,
CODEPAGE = '65001',
ROWTERMINATOR = '0x0a'
);
Want that 5th row as NULL in database table, if there are 4 columns in CSV file.
Sorry, we achieve that in bulk insert. None of other ways according my experience.
Azure SQL managed instance is also not supported as dataset in Data Factory Data flow. Otherwise we can using Data Flow derived column to create a new column to mapping to the Azure SQL database.
The best way is that you editor your csv file: just add new column as header in you csv files.
Hope this helps.

Is there any script to copy data from distribution database to another user created database?

Due to limited size of distribution db I wish to shift or copy data from there to another database using script.Is there any script that copies data from distribution database(System Database) to another user created database using a job at specific interval but newly copied data should append to previously copied data?
I have tried to copy data using insert script that select values from source db and copies into destination db where my source is system db table and destination is user defined db table.But it showed me this error: Msg 8101, Level 16, State 1, Line 6 An explicit value for the identity column in table 'classicmodels.dbo.MStracer_tokens' can only be specified when a column list is used and IDENTITY_INSERT is ON.
SET IDENTITY_INSERT classicmodels.[dbo].MStracer_tokens ON
INSERT INTO classicmodels.[dbo].MStracer_tokens
SELECT *
FROM [distribution].[dbo].MStracer_tokens
SET IDENTITY_INSERT classicmodels.[dbo].MStracer_tokens OFF
I want the same data from source to be copied on to destination.Destination only stores this copied data.

Bulk Insert does not insert data

I want to perform a Bulk Insert for data I get via Stream. Here I get Survey data where each row are the information and answers of a person. I consumed the stream via .net and saved the data row by row each ending with a vblf (I checked if this exist via Word and could see that after each dataset there is a new line). The data are comma separated. First off I created a table with 1000 columns, since I do not know yet how many data will come in but for sure there is not dataset longer than 500 yet and even in the future it will definitly not get longer than 1000 and if so, I can extend the table. Here the Table I created:
The first two datasets looks like this:
"4482359","12526","2014 Company","upload by","new upload","Anonymous","User","anonymous#company.org","","222.222.222.222","1449772662000","undefined","","951071","2015","","2","3","1","5","1","1","3","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","5","1","3","3","3","3","1","2","3","1","3","5","1","","Here ppl can type in some text.","1"
"4482360","12526","2014 Company","upload by","new upload","Anonymous","User","anonymous#company.org","","222.222.222.222","1449772662000","undefined","","951071","2015","","2","5","1","","2","2","3","4","3","1","4","4","4","4","3","3","","4","3","1","4","3","1","4","4","4","3","3","4","4","4","4","3","4","4","4","4","4","4","5","2","3","4","1","3","2","2","5","1","3","","2","","","2"
Now I want to do a Bulk Insert using this command:
USE MyDatabase
BULK INSERT insert_Table FROM 'C:\new.txt'
With (FIRSTROW = 2, FIELDTERMINATOR = ',', ROWTERMINATOR = '\n')
The command runs through and does not thow an error but I get the message 0 rows affected and there is no data in the datatable. Does anyone has an idea what I am doing wrong here?

Resources