I have an s3 folder that keeps on getting new files. These files could also have duplicates based on url column.
s3file1.csv - lastmodified 2022-03-01 at 10 UTC
url name
http://a/ jai
http://b/ nitu
s3file2.csv lastmodified 2022-03-01 at 12 UTC
url name
http://a/ aron
http://b/ max
I create my external table as:
create external table test
(
url VARCHAR as (nullif(value:c1,'')::VARCHAR)
refershed_on TIMESTAMP_LTZ(9) as CURRENT_TIMESTAMP()
)
with location = #test_stage
file_format = test_format
auto_refersh=true
pattern = '.*s3file[.]csv';
The issue is that I have duplicates in the table test based on url. And the refreshed_on date is also same for all the rows. How can I remove the duplicates and keep only the entry with latest last modified date unique on url?
The final table test should be having just s3file2.csv data but it has both files data
You will need to add ETL/ELT process to dedup your data. It is an external table, Snowflake will just read the files as they are. If there are duplicates, then the result will have duplicates.
If you capture the filename in your external table, you could add a view over that table that has a window function that only has the latest records. However, this will be using the external table every time you query it, so the performance not be as good. It would be better to just ingest the data and process the data as it comes in incrementally to update the records accordingly. It doesn't sound like the underlying data is being managed in a way that leads to a good external table use-case.
Related
I am new to AzureSynapse and am technically a Data Scientist whos doing a Data Engineering task. Please help!
I have some xlsx files containing raw data that I need to import into an SQL database table. The issue is that the raw data does not have a uniqueidentifer column and I need to add one before inserting the data into my SQL database.
I have been able to successfully add all the rows to the table by adding a new column on the Copy Data command and setting it to be #guid(). However, this sets the guid of every row to the same value (not unique for each row).
GUID mapping:
DB Result:
If I do not add this mapping, the pipeline throws an error stating that it cannot import a NULL Id into the column Id. Which makes sense as this column does not accept NULL values.
Is there a way to have AzureSynapse analystics read in a raw xlsx file and then import it into my DB with a unique identifier for each row? If so, how can I accomplish this?
Many many thanks for any support.
Giving dynamic content to a column in this way would generate the same value for entire column.
Instead, you can generate a new guid for each row using a for each activity.
You can retrieve the data from your source excel file using a lookup activity (my source only has name column). Give the output array of lookup activity to for each activity.
#activity('Lookup1').output.value
Inside for each, since you already have a linked service, create a script activity. In this script activity, you can create a query with dynamic content to insert values into the destination table. The following is the query I built using dynamic content.
insert into demo values ('#{guid()}','#{item().name}')
This allows you to iterate through source rows, insert each row individually while generating new guid every time
You can follow the above procedure to build a query to insert each row with unique identifier value. The following is an image where I used copy data to insert first 2 rows (same as yours) and inserted the next 2 rows using the above procedure.
NOTE: I have taken Azure SQL database for demo, but that does not affect the procedure.
Background
I'm using Azure data factory v2 to load data from on-prem databases (for example SQL Server) to Azure data lake gen2. Since I'm going to load thousands of tables, I've created a dynamic ADF pipeline that loads the data as-is in the source based on parameters for schema, table name, modified date (for identifying increments) and so on. This obviously means I can't specify any type of schema or mapping manually in ADF. This is fine since I want the data lake to hold a persistent copy of the source data in the same structure. The data is loaded into ORC files.
Based on these ORC files I want to create external tables in Snowflake with virtual columns. I have already created normal tables in Snowflake with the same column names and data types as in the source tables, which I'm going to use in a later stage. I want to use the information schema for these tables to dynamically create the DDL statement for the external tables.
The issue
Since column names are always UPPER case in Snowflake, and it's case-sensitive in many ways, Snowflake is unable to parse the ORC file with the dynamically generated DDL statement as the definition of the virtual columns no longer corresponds to the source column name casing. For example it will generate one virtual column as -> ID NUMBER AS(value:ID::NUMBER)
This will return NULL as the column is named "Id" with a lower case D in the source database, and therefore also in the ORC file in the data lake.
This feels like a major drawback with Snowflake. Is there any reasonable way around this issue? The only options I can think of is to:
1. Load the information schema from the source database to Snowflake separately and use that data to build a correct virtual column definition with correct cased column names.
2. Load the records in their entirety into some variant column in Snowflake, converted to UPPER or LOWER.
Both options add a lot of complexity or even messes up the data. Is there any straight forward way to only return the column names from an ORC file? Ultimately I would need to be able to use something like Snowflake's DESCRIBE TABLE on the file in the data lake.
Unless you set the parameter QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE you can declare your column in the casing you want:
CREATE TABLE "MyTable" ("Id" NUMBER);
If your dynamic SQL carefully uses "Id" and not just Id you will be fine.
Found an even better way to achieve this, so I'm answering my own question.
With the below query we can get the path/column names directly from the ORC file(s) in the stage with a hint of the data type from the source. This filters out colums that only contains NULL values. Will most likely create some type of data type ranking table for the final data type determination for the virtual columns we're aiming to define dynamically for the external tables.
SELECT f.path as "ColumnName"
, TYPEOF(f.value) as "DataType"
, COUNT(1) as NbrOfRecords
FROM (
SELECT $1 as "value" FROM #<db>.<schema>.<stg>/<directory>/ (FILE_FORMAT => '<fileformat>')
),
lateral flatten(value, recursive=>true) f
WHERE TYPEOF(f.value) != 'NULL_VALUE'
GROUP BY f.path, TYPEOF(f.value)
ORDER BY 1
I have created 7 tables in SQL Server database, and this tables will be historical tables, that means data will be loaded daily without replacing the old data.
I have created a view by joining these tables. And my requirement here is, when ever the data is loaded in tables, the new data (current day data) should be loaded into the views replacing the old data, and it should be done when ever the table data is loaded.
Can any one please provide me an SQL query for this job?
All you have to do is create a default column on the table (named insertionDate, for example), which default value is current date, which will be the insertion date.
I recommend this approach beacause the default column prevents schema erros.
Then create a view using this column as filter via getdate() such as ... where insertionDate = getdate()
I am trying to use change tracking to copy data incrementally from a SQL Server to an Azure SQL Database. I followed the tutorial on Microsoft Azure documentation but I ran into some problems when implementing this for a large number of tables.
In the source part of the copy activity I can use a query that gives me a change table of all the records that are updated, inserted or deleted since the last change tracking version. This table will look something like
PersonID Age Name SYS_CHANGE_OPERATION
---------------------------------------------
1 12 John U
2 15 James U
3 NULL NULL D
4 25 Jane I
with PersonID being the primary key for this table.
The problem is that the copy activity can only append the data to the Azure SQL Database so when a record gets updated it gives an error because of a duplicate primary key. I can deal with this problem by letting the copy activity use a stored procedure that merges the data into the table on the Azure SQL Database, but the problem is that I have a large number of tables.
I would like the pre-copy script to delete the deleted and updated records on the Azure SQL Database, but I can't figure out how to do this. Do I need to create separate stored procedures and corresponding table types for each table that I want to copy or is there a way for the pre-copy script to delete records based on the change tracking table?
You have to use a LookUp activity before the Copy Activity. With that LookUp activity you can query the database so that you get the deleted and updated PersonIDs, preferably all in one field, separated by comma (so its easier to use in the pre-copy script). More information here: https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
Then you can do the following in your pre-copy script:
delete from TableName where PersonID in (#{activity('MyLookUp').output.firstRow.PersonIDs})
This way you will be deleting all the deleted or updated rows before inserting the new ones.
Hope this helped!
In the meanwhile the Azure Data Factory provides the meta-data driven copy task. After going through the dialogue driven setup, a metadata table is created, which has one row for each dataset to be synchronized. I solved this UPSERT problem by adding a stored procedure as well as a table type for each dataset to be synchronized. Then I added the relevant information in the metadata table for each row like this
{
"preCopyScript": null,
"tableOption": "autoCreate",
"storedProcedure": "schemaname.UPSERT_SHOP_SP",
"tableType": "schemaname.TABLE_TYPE_SHOP",
"tableTypeParameterName": "shops"
}
After that you need to adapt the sink properties of the copy task like this (stored procedure, table type, table type parameter name):
#json(item().CopySinkSettings).storedProcedure
#json(item().CopySinkSettings).tableType
#json(item().CopySinkSettings).tableTypeParameterName
If the destination table does not exist, you need to run the whole task once before adding the above variables, because auto-create of tables works only as long as no stored procedure is given in the sink properties.
I have a Person database in SQL Server with tables like address, license, relatives etc. about 20 of them. All the tables have id parameter that is unique per person. There are millions of records in these tables. I need to combine theses records of the person using their common id parameter, and convert to a json table file with some column name changes. This json file then gets pushed to kafka through a producer. If I can get the example with the kafka producer as item writer- fine, but real problem is understanding the strategy and specifics on how to utilize spring batch item reader, processor, and item writer to create the composite json file. This is my first Spring batch application so I am relatively new to this.
I am hoping for the suggestions on the implementation strategy using a composite reader or processor to use person id as the cursor, and query each table using the id for each table , convert the resulting records to json and aggregate it to a composite, relational json file with root table PersonData that feeds to kafka cluster.
Basically I have one data source, same database for the reader. I plan to use Person table to fetch id and other records unique for the person, and use id as the where clause for 19 other tables. convert each resultset from the table to json, and composite the json object at the end and write to kafka.
We had such an requirement in a project and solved it with the following approach.
In Splitflow, that run parallel, we had a step for ever table that loaded the data of the table in the file, sorted by common id (this is optional, but it is easier for testing, if you have the data in files).
Then we implemented our own "MergeReader".
This mergereader had FlatFileItemReaders for every file/table (let's call them dataReaders). All these FlatFileItemReaders were wrapped with a SingleItemPeekableItemReader.
The logic for the read method of the MergeReader is as follows:
public MyContainerPerId read() {
// you need a container to store the items, that belong together
MyContainerPerId container = new MyContainerPerId();
// peek through all "dataReaders" to find the lowest actual key
int lowestId = searchLowestKey();
for (Reader dataReader : dataReaders) {
// I assume, that more than one entry in a table can belong to
// the same person id
wihile (dataReader.peek().getId() == lowestId) {
{
container.add(dataReader.read());
}
}
// the container contains all entries from all tables
// belonging to the same person id
return container;
}
If you need restart capability, you have implement ItemStream in a way, that it keeps track of the current readposition for every dataReader.
I used the Driving Query Based ItemReaders usage pattern described here to solve this issue.
Reader: just a default implementation of JdbcCursoritemReader with sql to fetch
the unique relational id (e.g. select id from person -)
Processor: Uses this long id as the input and a dao implemented by me using
jdbcTemplate from spring fetches data through queries against each of
the table for a specific id (e.g. select * from license where id=) and map results in list format to a POJO
of Person - then convert to json object (using Jackson) and then to
string
Writer: either write the file out with json string or publish json string to a
topic in case of kafka use
We went through similar exercise migrating 100mn + rows from multiple tables as a form of JSON so that we can post it to a message bus.
The idea is create a view, de-normalize the data and read from that view using JdbcPagingItemReader.Reading from one source has less overhead.
When you de-normalize the data make sure you do not get multiple rows for master table.
Example - SQL server -
create or alter view viewName as
select master.col1 , master.col2,
(select dep1.col1,
dep1.col2
from dependent1 dep1
where dep1.col3 = master.col3 for json path
) as dep1
from master master;
The above will give you dependent table data in a json String with one row for each master table data. Once you retrieve the data you can use GSON or Jackson to convert it as POJO.
We tried to avoid JdbcCursoritemReader as it will pull all data in memory and read one by one from it. It does not support pagination.