How do I combine table.include.list and column.include.list in Debezium connector configuration if I need to snapshot part of one table and full data from another table?
Example connector configuration:
"table.include.list":"schema.table1, schema.table2",
"column.include.list":"schema.table1.col1, schema.table1.col2"
With this I get all columns of table1 in kafka.
Only managed how to do it using different connectors.
I want to get all columns from table2 and 2 columns from table1 using one connector, is it possible?
This should work:
"table.include.list":"schema\.table1,schema\.table2",
"column.include.list":"schema\.table1\.{col1|col2},schema\.table2\..*"
From Debezium documentation for column.include.list configuration attribute:
An optional, comma-separated list of regular expressions that match the fully-qualified names of columns to include in change event record values. Fully-qualified names for columns are of the form databaseName.tableName.columnName.
So you have to configure 2 things:
fully qualified table name in the table_include_list attribure (which is also a comma-separated list of regular expressions)
fully qualified column names in the column_include_list attribure
If you want to include all columns from table, then configuration of column names in the column_include_list attribute is not necessary.
For example:
{
...
"table.include.list": "schema\\.table1,schema\\.table2",
"column.include.list": "schema\\.table1\\.(col1|col2)"
...
}
Column exclusion is configured in the same way, but using the column_exclude_list attribute.
See details here (i.e. for MySQL connector, other connectors are similar): https://debezium.io/documentation/reference/stable/connectors/mysql.html#mysql-property-table-include-list
Related
I have a database on SQL Server on premises and need to regularly copy the data from 80 different tables to an Azure SQL Database. For each table the columns I need to select from and map are different - example, TableA - I need columns 1,2 and 5. For TableB I need just column 1. The tables are named the same in the source and target, but the column names are different.
I could create multiple Copy data pipelines and select the source and target data sets and map to the target table structures, but that seems like a lot of work for what is ultimately the same process repeated.
I've so far created a meta table, which lists all the tables and the column mapping information. This table holds the following data:
SourceSchema, SourceTableName, SourceColumnName, TargetSchema, TargetTableName, TargetColumnName.
For each table, data is held in this table to map the source tables to the target tables.
I have then created a lookup which selects each table from the mapping table. It then does a for each loop and does another lookup to get the source and target column data for the table in the foreach iteration.
From this information, I'm able to map the Source table and the Sink table in a Copy Data activity created within the foreach loop, but I'm not sure how I can dynamically map the columns, or dynamically select only the columns I require from each source table.
I have the "activity('LookupColumns').output" from the column lookup, but would be grateful if someone could suggest how I can use this to then map the source columns to the target columns for the copy activity. Thanks.
In your case, you can use the expression in the mapping setting.
It needs your provide an expression and it's data should like this:{"type":"TabularTranslator","mappings":[{"source":{"name":"Id"},"sink":{"name":"CustomerID"}},{"source":{"name":"Name"},"sink":{"name":"LastName"}},{"source":{"name":"LastModifiedDate"},"sink":{"name":"ModifiedDate"}}]}
So you need to add a column named as Translator in your meta table, and it's value should be like the above JSON data. Then use this expression to do mapping:#item().Translator
Reference: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping
Background
I'm using Azure data factory v2 to load data from on-prem databases (for example SQL Server) to Azure data lake gen2. Since I'm going to load thousands of tables, I've created a dynamic ADF pipeline that loads the data as-is in the source based on parameters for schema, table name, modified date (for identifying increments) and so on. This obviously means I can't specify any type of schema or mapping manually in ADF. This is fine since I want the data lake to hold a persistent copy of the source data in the same structure. The data is loaded into ORC files.
Based on these ORC files I want to create external tables in Snowflake with virtual columns. I have already created normal tables in Snowflake with the same column names and data types as in the source tables, which I'm going to use in a later stage. I want to use the information schema for these tables to dynamically create the DDL statement for the external tables.
The issue
Since column names are always UPPER case in Snowflake, and it's case-sensitive in many ways, Snowflake is unable to parse the ORC file with the dynamically generated DDL statement as the definition of the virtual columns no longer corresponds to the source column name casing. For example it will generate one virtual column as -> ID NUMBER AS(value:ID::NUMBER)
This will return NULL as the column is named "Id" with a lower case D in the source database, and therefore also in the ORC file in the data lake.
This feels like a major drawback with Snowflake. Is there any reasonable way around this issue? The only options I can think of is to:
1. Load the information schema from the source database to Snowflake separately and use that data to build a correct virtual column definition with correct cased column names.
2. Load the records in their entirety into some variant column in Snowflake, converted to UPPER or LOWER.
Both options add a lot of complexity or even messes up the data. Is there any straight forward way to only return the column names from an ORC file? Ultimately I would need to be able to use something like Snowflake's DESCRIBE TABLE on the file in the data lake.
Unless you set the parameter QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE you can declare your column in the casing you want:
CREATE TABLE "MyTable" ("Id" NUMBER);
If your dynamic SQL carefully uses "Id" and not just Id you will be fine.
Found an even better way to achieve this, so I'm answering my own question.
With the below query we can get the path/column names directly from the ORC file(s) in the stage with a hint of the data type from the source. This filters out colums that only contains NULL values. Will most likely create some type of data type ranking table for the final data type determination for the virtual columns we're aiming to define dynamically for the external tables.
SELECT f.path as "ColumnName"
, TYPEOF(f.value) as "DataType"
, COUNT(1) as NbrOfRecords
FROM (
SELECT $1 as "value" FROM #<db>.<schema>.<stg>/<directory>/ (FILE_FORMAT => '<fileformat>')
),
lateral flatten(value, recursive=>true) f
WHERE TYPEOF(f.value) != 'NULL_VALUE'
GROUP BY f.path, TYPEOF(f.value)
ORDER BY 1
We have enabled versioning of database records in order to maintain multiple versions of product configurations for our customers. To achieve this, we have created 'Version' column in all our tables with default entry 'core_version'. Customers can create a new copy of the same records by changing one or two column values and say that as 'customer_version1'. So, the PK of all our tables are (ID column and Version).
Something like this:
Now, the version column will act as an identifier, when performing CRUD operations via application as well as when executing sql queries directly in DB, to ensure against which version of records the CRUD operation update should happen.
Is there any way to achieve this in Oracle & SQL server? A Default filter for the "Version" column at Schema level that should get added as a mandatory where clause on performing/executing any query operation.
Say, If want only "Core_version" records. Then, Select * from employee; should return me only 3 records respective to core_version without having the version column filter explicitly in query.
I have a simple flow of extracting rows from Oracle Table and putting it into Hbase via NiFi.
To Extract Data from DB I am using "QueryDataBase Table" and put to HBase I am using "PutHbase Record" Processor.
Usually, whatever is the primary key of my Table I am using it as a "Row Identifier Field" in putHbaseRecord.
My problem is arising when there is Composite Primary Key, As Row Identifier Field property in putHbase Record processor is not taking multiple columns.
Any help in this will be really Helpful.
Thanks
Unfortunately this is not currently possible with PutHBaseRecord. It would require a code change to the processor to allow the specifying multiple field names for the row id, and then it would have to get them and from each record and concatenate them together to form the row id value.
It might be better to make the property be a record path expression that creates the row id. This way if you want a single value you just put something like '/field1' and if you wanted a composite value you'd do something like "concat('/field1', '/field2')".
My main Access table contains 95 rows. One column is a name field with a unique name in each field. Two other tables also have a name column, but he name field from each of these tables contain one or more names separated by comma and a space. These tables are different lengths too, one has 99 rows the other has 33.
I need to link the data from these tables to a comprehensive form. To do this I think I want to make a cross tab query using the value in the Main table name field. It will need to search the name field of the other tables to see if one of the listed names match.
Please help.
Are You looking for this:
SELECT * FROM mainTable, Tble99Rows, Tbl33Rows
WHERE InStr(mainTable.Name, Tble99Rows.Name) AND InStr(mainTable.Name, Tble33Rows.Name)
?
Note it could be inaccurate, for example: it will link records with name Max and Maxine.
For proper Table Joining, Follow Data-Base normalization rules, in our case the First rule: all the attributes in a relation must have atomic domains. The values in an atomic domain are indivisible units.
EDIT:
Please read more: Is storing a delimited list in a database column really that bad?