How to make Clickhouse Materialized View work correctly on a cluster with 6 nodes

How to make Clickhouse Materialized View work correctly on a cluster with 6 nodes - distributed

There is 6 nodes in my clickhouse cluster. I create local tables(with ReplicatedReplacingMergeTree engine) and distributed tables on the cluster. A Materialized View on node 'cdh03' is created to join data from tables and write the result to the distributed table. I then create a Materialized View on node 'cdh06', which select data from the distributed table and push them to mysql, with a mysql engine.
However, since data is correctly writen to the distributed table, and could be seen on other nodes, the Materialized View on node 'cdh06' writes nothing to mysql.
I tried to create the Materialized View with the same query on other nodes, but only on the node 'cdh03' it works correctly. What's wrong with this? And what can I do to make it work?

Related

Can we implement symmetricds in databases which are identical but, tables have different PK id for same tables

Can I implement symmetricDS in identical databases?
My scenerio
I have to databases:
Database A
Database B
Whatever data change happens in either one of them should reflect in the other:
Current situation:
Even though the DB are identical, database B have less tables that database A
Consider a table tableA from database A and same table in database B
But pk id for same records are actually different in two tables
Can i expand and implement symmetricDS if i want to expand to a third database
Currently i am using a mapping table and API to handle datasync.
Can i move to symmetricDS for syncing data

Yes, go ahead
SymemtricDs allows for bidirectional synchronization of databases
Only the tables of database B will be configured for synchronization. The extra tables from database A might be added to the mix using table transformation.
As long as there are constraints of uniqueness on columns in, for example, database A that are PKs in database B that will not be a problem.
You can add as many types and instances of those types of databases. Bear in mind that the graph of database relationships must satisfy the definition of a tree.

Mirror table vs materialized view

From this excellent video "Microservices Evolution: How to break your monolithic database by Edson Yanaga" I know that there are different ways to split chunk of data as separate db for microservice:
View
Materialized View
Mirror Table using Trigger
Mirror Table using Transactional Code
Mirror Table using ETL tools
Event Sourcing
Could you please explain me the difference between mirrored table and materialized view?
I'm confused due to both of them are stored on disk...

My understanding is :-
Mirrored tables
Mirrored tables are generally an exact copy of the another, source table. Same structure and the same data. Some database platforms allow triggers to be created on the source table which will perform updates on the source table to the mirror table. If the database platform does not provide this functionality, or if the Use Case dictates, you may perform the update in transactional code instead of a trigger.
Materialized Views
A Materialized View contains the result of a query. With a regular database view, when the underlying table data changes, querying the view reflects those changes. However, with a materialized view the data is current only at the point in time of creation (or refresh) of the Materialized view. In simple terms, a materialized view is a snapshot of data at a point in time.

SSIS Cross-DB "WHERE IN" Clause (or Equivalent) in Azure

I'm currently trying to build a data flow in SSIS to select all records from a mapping table where an ID column exists in the related Item table. There are two complications:
The two tables are currently in different databases on different servers.
The databases are in Azure, for which I've read Linked Servers are not supported.
To be more clear, the job to migrate data from Staging environment to Production. I only want to push lookup records into prod if the associated Item IDs are in there. Here's some psudo-TSQL to give a clear goal of what I'm trying to achieve:
SELECT *
FROM [Staging_Server].[SourceDB].[dbo].[Lookup] L
WHERE L.[ID] IN (
SELECT P.[Item]
FROM [Production_Server].[TargetDB].[dbo].[Item] P
)
I haven't found a good way to create this in SSIS. I think I've created a work-around that involves sorting both tables and performing a merge join, but sorting both sides is an unnecessary hit on performance. I'm looking for a more direct and intuitive design for this seemingly simple data flow.

Doing this in a data flow, you'd have your Source query, sans filter, fed into a Lookup Component which is the subquery.
The challenge with this is SSIS is likely on-premises so that means you are going to pull all of your data out of Stage Azure to the server running SSIS and push it back to the Prod Azure instance.
That's a lot of network activity and as I'm reading the Azure pricing guide, I guess as long as you have the appropriate DTUs, you'd be fine. Back in the day, you were charged for Reads and not Writes so the idiom was to just push all your data to target server and then do the comparison there, much as ElendaDBA mentions. Only suggestion I'd make on the implementation is to avoid temporary tables or ad-hoc creation/destruction of them. Just implement as a physical table and truncate and reload prior to transmission to production.

You could create a temp table on staging server to copy production data into. Then you could create a query joining those two tables. After SSIS package runs, you could delete the temp table on staging server

Is it possible to connect to an existing relational database (Postgresql) using Neo4j?

as explained in previous posts I have a Postgresql relational database with about 10M rows. I would like to connect Neo4J directly to this existing database if it's possible and define the nodes as being specific columns. I already tried different solutions: first of all I used the batch importer with a CSV file of my database, then I created a flexible script with Groovy (again using a CSV file). These methods work but they imply the creation of a CSV file which isn't ideal in my case. Is there a possibiliy to connect to my DB directly with Neo4j ? Thanks

To link a node in Neo4j to a row in your relational database you typically store the row's primary key into a property on that node. The other way to link from relational db to a graph node you create a unique identifier for the node, store it as a property and create a index on it. Store that identifier in the relational database.
In any case you need some client side logic when going over db boundaries. E.g. you do a graph traversal, return back the primary keys stored in node properties. With those run a select for that pk in your relational db.
Martin Fowler's Nosql destilled has a chapter on polyglott persistence.
In Neo4j you can write unmanaged extension which might act as a integration point.

SSIS Moving Content From Views from one database to Another

One question though lets say publisher database had 100 tables and I use Transactional Replication to move the data from those 100 tables to Subscriber Database that would be fine.
But lets say I don't want the 100 tables but i want to create 3-4 Views which contain the key information I want from those 100 tables. How Would I achieve this.
1) Firstly I guess the views need to be created on the publisher database
2) Secondly Do i need to create then 3/4 Tables in the Subscriber database which have the same columns as the view from publisher database.
3) What sort of replication or maybe even SSIS or something to move the data from the publisher view to subscriber database

Replication probably wouldn't be viable or as performant an option as creating a SSIS package for transferring data from those views and into the small set of tables in the remote database. SSIS's strongest feature is it's ability to transfer large volumes of data quickly from a source and into a destination. With a little upkeep, you could potentially just transfer the differences between the two databases at some scheduled interval and have a fairly flexible solution.

SSIS will be the better solution. You would create the tables on your target database. Then, you can create the SSIS pacakge(s) to populate the target tables.
SSIS can use queries on tables or views. And, it can also execute a stored procedure to retrieve the data.