I have a Parquet table on S3 that I'm trying to load into a Snowflake table with a single variant column, and then select the fields of this variant column into a table with many columns.
This Parquet table has a field called full_latency; it is an int32. I can confirm this by inspecting any of the Parquet files, or by querying the same table in Redshift. E.g., using the pqrs tool, I see this:
...
...
full_latency: 18123
...
And I can use the same tool to show the schema of the Parquet file, and confirm full_latency to be of type int32.
When I load it into Snowflake, however, this value is converted to some sort of duration type:
select v:full_latency
from my_table
limit 1
Row V:FULL_LATENCY
1 "00:00:18"
There are many int64 columns that are not being changed to whatever format this is. My copy command is extremely simple:
copy into my_table
from #my_table/dt=20211023/
pattern='.*/.*/.*/.*';
My question: why is this happening and how can I make it not happen?
Related
I'm trying to insert into an on-premises SQL database table called PictureBinary:
PictureBinary table
The source of the binary data is a table in another on-premises SQL database called DocumentBinary:
DocumentBinary table
I have a file with all of the Id's of the DocumentBinary rows that need copying. I feed those into a ForEach activity from a Lookup activity. Each of these files has about 180 rows (there are 50 files fed into a new instance of the pipeline in parallel).
Lookup and ForEach Activities
So far everything is working. But then, inside the ForEach I have another Lookup activity that tries to get the binary info to pass into a script that will insert it into the other database.
Lookup Binary column
And then the Script activity would insert the binary data into the table PictureBinary (in the other database).
Script to Insert Binary data
But when I debug the pipeline, I get this error when the binary column Lookup is reached:
ErrorCode=DataTypeNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column: coBinaryData,The data type ByteArray is not supported from the column named coBinaryData.,Source=,'
I know that the accepted way of storing the files would be to store them on the filesystem and just store the file path to the files in the database. But we are using a NOP database that stores the files in varbinary columns.
Also, if there is a better way of doing this, please let me know.
I tried to reproduce your scenario in my environment and got similar error
As per Microsoft document Columns with datatype Byte Array Are not supported in lookup activity is might be the main cause of error.
To workaround this as Follow below steps:
As you explained your case you have a file in which all the Id's of the DocumentBinary rows that need copy in destination are stored. To achieve this, you can simply use Copy activity with the Query where you copy records where the DocumentBinary in column is equal to the Id stored in file
First, I took lookup activity from where I can get Id's of the DocumentBinary rows stored in file
Then I took ForEach I passed the output of lookup activity to ForEach activity.
After this I took Copy activity in forEach activity
Select * from DocumentBinary
where coDocumentBinaryId = '#{item().PictureId}'
In source of copy activity select Use query as Query and pass above query with your names
Now go to Mapping Click on Import Schema then delete unwanted columns and map the columns accordingly.
Note: For this, columns in both tables are of similar datatypes either both uniqueidenntifier or both should be int
Sample Input in file:
Output (Copied only picture id contend in file from source to destination):
I have been searching on the internet for a solution to my problem but I can not seem to find any info. I have a large single text file ( 10 million rows), I need to create an SSIS package to load these records into different tables based on the transaction group assigned to that record. That is Tx_grp1 would go into Tx_Grp1 table, Tx_Grp2 would go into Tx_Grp2 table and so forth. There are 37 different transaction groups in the single delimited text file, records are inserted into this file as to when they actually occurred (by time). Also, each transaction group has a different number of fields
Sample data file
date|tx_grp1|field1|field2|field3
date|tx_grp2|field1|field2|field3|field4
date|tx_grp10|field1|field2
.......
Any suggestion on how to proceed would be greatly appreciated.
This task can be solved with SSIS, just with some experience. Here are the main steps and discussion:
Define a Flat file data source for your file, describing all columns. Possible problems here - different data types of fields based on tx_group value. If this is the case, I would declare all fields as strings long enough and later in the dataflow - convert its type.
Create a OLEDB Connection manager for the DB you will use to store the results.
Create a main dataflow where you will proceed the file, and add a Flat File Source.
Add a Conditional Split to the output of Flat file source, and define there as much filters and outputs as you have transaction groups.
For each transaction group data output - add Data Conversion for fields if necessary. Note - you cannot change data type of existing column, if you need to cast string to int - create a new column.
Add for each destination table an OLEDB Destination. Connect it to proper transaction group data flow, and map fields.
Basically, you are done. Test the package thoroughly on a test DB before using it on a production DB.
I have been using ISQL (SQLAnywhere 12) to import data from CSVs into existing tables using INPUT INTO and never ran into a problem. Today I needed to import data into a table containing an auto-increment column, however, and thought I just needed to leave that column blank, so I tried it with a file containing only 1 row of data (to be safe). Turns out it imported with a 0 in the auto-increment field instead of the next integer value.
Looking at the Sybase documentation, it seems like I should be using LOAD TABLE instead, but the examples look a bit complex.
My questions are the following...
The documentation says the CSV file needs to be on the database server and not the client. I do not have access to the database server itself - can I load the file from within ISQL remotely?
How do I define the columns of the table I'm loading into? What if I am only loading data into a few columns and leaving the rest as NULL?
To confirm, this will leave existing data in the table as-is and simply add to it using whatever is in the CSV?
Many thanks in advance.
Yes. Check out the online documentation for LOAD TABLE - you can use the USING CLIENT FILE clause.
You can specify the column names in parens after the table name, i.e. LOAD TABLE mytable (col1, col2, col3) USING CLIENT FILE 'mylocalfile.txt'. Any columns not listed here will be set to NULL if the column is nullable or the equivalent to an empty string if it's not - this is why your autoincrement column was set to 0. You can use the DEFAULTS ON clause to get what you want.
Yes, existing data in the table is not affected.
I have created a table having two columns with datatype varbinary(max). I am saving pdf files in binary format in these columns. There is no issue while inserting the pdf files in these columns. But when I am selecting even a single record with only one column of type varbinary in select query it takes around one minute to fetch the record. The size of pdf file inserted is of 1MB. Here is the sql query to fetch single record:
select binarypdffile from gm_packet where jobpacketid=1
Kindly suggest if there is a way to improve the performance with varbinary datatype.
Could you try and time the following queries:
SELECT cnt = COUNT(*) INTO #test1 FROM gm_packet WHERE jobpacketid = 1
SELECT binarypdffile INTO #test2 FROM gm_packet WHERE jobpacketid = 1
The first one tests how long it takes to find the record. If it's slow, add an index on the jobpacketid field. Assuming these values come in sequentially I wouldn't worry about performance as records get added in the future. Otherwise you might need to rebuild the index once in a while.
The second tests how long it takes to fetch the data from the table (and store it back into another table). Since no data goes out of 'the system' it should show 'raw database performance' without any "external" influence.
Neither should be very long. If they aren't but it still takes 'a long time' to run your original query in SSMS and get the binary data in the grid, then I'm guessing it's either a network issue (wifi?) or SSMS simply is very bad at representing the blob in the GUI; it's been noticed before =)
I have a table in my source data base with 100 columns on it .Where i use ssis package to load data from source data base to destination data base table.
Where table in source and destination table are same .But in some times in Destination table address fields will be changed and or new data types are added.
So how can i find new columns are changed data types in destination table while compared with the source table.
Is there any stored procedure to find missing columns or address fields and changed data types.
i can check manually but its killing my time, where i have 50 tabels where each table consists of 100 to 200 columns.
Please some one help me to find ?
You can grab all the table names from sys.tables and all the column names from sys.columns.
You can write code to perform this operation on the destination and source databases; However, you might waste alot of time re-inventing the wheel.
http://www.red-gate.com/products/sql-development/sql-compare/
Just get a trial version of SQL-Compare from the sales team at red gate. If you like it, just buy it. It only costs $495.
Probably a-lot less money than trying to write something yourself!