SymmetricDS File Synchronization Router - symmetricds

I have 3 nodes (corp, store-1, store-2). I want to do file synchronization that filtering based on node's external.id. So the file is synchronizing to the right nodes (not to all nodes). I read the docs that column match router can do filtered synchronization, but the example is in database synchronization.
How to do that in file synchronization? Thanks.

use the same router linked through file_trigger_router to your file_trigger for filtering which files go to which target nodes
here's an example: http://www.symmetricds.org/doc/3.5/html-single/user-guide.html#filesync-example-2
INSERT INTO sym_file_trigger
(trigger_id,base_dir,recurse,includes_files,excludes_files,sync_on_create,
sync_on_modified,sync_on_delete,before_copy_script,after_copy_script,create_time,
last_update_by,last_update_time)
VALUES
('node_specific','/filesync/server/nodes',1,null,null,1,1,1,'',null,
current_timestamp,'example',current_timestamp);
INSERT INTO sym_file_trigger_router
(trigger_id,router_id,enabled,initial_load_enabled,target_base_dir,
conflict_strategy,create_time,last_update_by,last_update_time)
VALUES
('node_specific','router_files_to_node',1,1,'/filesync/clients','SOURCE_WINS',
current_timestamp,'example',current_timestamp);
INSERT INTO sym_router
(router_id,target_catalog_name,target_schema_name,target_table_name,
source_node_group_id,target_node_group_id,router_type,router_expression,
sync_on_update,sync_on_insert,sync_on_delete,create_time,last_update_by,
last_update_time)
VALUES
('router_files_to_node',null,null,null,'server','client','column',
'RELATIVE_DIR = :NODE_ID ',1,1,1,current_timestamp,'example', current_timestamp);

Related

Arangodb update properties depend on edge type

I am trying to use AQL to update the whole node collection , named Nodes, depend on the type of edges they have
.
Requirement:
Basically, if 2 entity in Nodes has relation type= "Same", they would be updated with unique groupid properties (same for more than 2)
This would only run one time in the beginning (to populate groupid)
My concept approach:
Use AQL
For each entity inside Node, query out all connectable nodes with type=SAME
Generate an groupid and Update all of them
Write to an lookup object those id
For next entity, do a lookup, skip the entity if their id is there.
What I tried
FOR v,e,p
In 1..10
ANY v
EntityRelationTest
OPTIONS {uniqueVertices:"global",bfs:true}
FILTER p.edges[*].relationType[0]== "EQUALS"
UPDATE v WITH { typeName2:"test1"} IN EntityTest
return NEW
But I am quite new to arangodb AQL, is something like above possible?
In the end, what I use is a customize traversal object running directly inside Foxx in order to get the best of both world: performance and correctness. It seemed that we cannot do the above with only AQL

Need help understanding alternatives to scd in SSIS

I am working on a data warehouse project that will involve integrating data from multiple source systems. I have set up an SSIS package that populates the customer dimension and uses the slowly changing dimension tool to keep track of updates to the customer.
I'm running into some issues. Take this example:
Source system A might have a record like that looks like this:
First Name, Last Name, Zipcode
Jane, Doe, 14222
Source system B might have a record for the same client that looks like this:
First Name, Last Name, Zipcode
Jane, Doe, Unknown
If I first import the record from system A, I'll have the first name, last name, and ethnicity. Great. Now, if I import the client record from system B, I can do fuzzy matching to recognize that this is the same person and use the slowly changing dimension tool to update the information. But in this case, I'm going to lose the zipcode because the 'unknown' will overwrite the valid data.
I am wondering if I am approaching this problem in the wrong way. The SCD tool doesn't seem to offer any way of selectively updating attributes based on whether the new data is valid or not. Would a merge statement work better? Am I making some kind of fundamental design mistake that I'm not seeing?
Thanks for any advice!
In my experience the built-in SCD tool is not flexible enough to handle this requirement.
Either a couple of MERGE statements, or a series of UPDATE and INSERT statements will probably give you most flexibility with logic, and performance.
There are probably models out there for MERGE statement for SCD Type 2 but here is the pattern I use:
Merge Target
Using Source
On Target.Key = Source.Key
When Matched And
Target.NonKeyAttribute <> Source.NonKeyAttribute
Or IsNull(Target.NonKeyNullableAttribute, '') <> IsNull(Source.NonKeyNullableAttribute, '')
Then Update Set SCDEndDate = GetDate(), IsCurrent = 0
When Not Matched By Target Then
Insert (Key, ... , SCDStartDate, IsCurrent)
Values (Source.Key, ..., GetDate(), 1)
When Not Matched By Source Then
Update Set SCDEndDate = GetDate(), IsCurrent = 0;
Merge Target
Using Source
On Target.Key = Source.Key
-- These will be the changing rows that were expired in first statement.
When Not Matched By Target Then
Insert (Key, ... , SCDStartDate, IsCurrent)
Values (Source.Key, ... , GetDate(), 1);

clone some relationships according to a condition

I exported two tables named Keys and Acc tables as CSV files from SQL Server and imported them successfully to Neo4J by using the commands below.
CREATE INDEX ON :Keys(IdKey)
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///C:/Keys.txt' AS line
MERGE (k:Keys { IdKey: line[0] })
SET k.KeyNam=line[1], k.KeyLib=line[2], k.KeyTyp=line[3], k.KeySubTyp=line[4]
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///C:/Acc.txt' AS line
MERGE (callerObject:Keys { IdKey : line[0] })
MERGE (calledObject:Keys { IdKey : line[1] })
MERGE (callerObject)-[rc:CALLS]->(calledObject)
SET rc.AccKnd=line[2], rc.Prop=line[3]
Keys stands for the source code objects, Acc stands for relations among them. I imported these two tables three times for three different application projects. So to maintain IdKey property being unique for three applications, I concatenated a five character prefix to IdKey to identify the Object for Application while exporting from sql server because we can not create index based on multiple fields as I learnt from manuals. Now my aim is constructing the relations among applications. For example:
Node1 is a source code object of Application1
Node2 is another source code object of Application1
Node3 is a source code object of Application2
There is already a CALL relation created from Node1 to Node2 because of the record in Acc already imported.
The Name of the Node2 is equal to name of Node3. So we can say that Node2 and Node3 are in fact the same source codes. So we should create a relation from Node1 to Node3. To realize it, I wrote a command below. But I want to be sure that it is correct. Because I do not know how long it will execute.
MATCH (caller:Keys)-[rel:CALLS]->(called:Keys),(calledNew:Keys)
WHERE calledNew.KeyNam = called.KeyNam
and calledNew.IdKey <> called.IdKey
CREATE (caller)-[:CALLS]->(calledNew)
This following query should be efficient, assuming you also create an index on :Keys(KeyNam).
MATCH (caller:Keys)-[rel:CALLS]->(called:Keys)
WITH caller, COLLECT(called.KeyNam) AS names
MATCH (calledNew:Keys)
WHERE calledNew.KeyNam IN names AND NOT (caller)-[:CALLS]->(calledNew)
CREATE (caller)-[:CALLS]->(calledNew)
Cypher will not use an index when doing comparisons directly between property values. So this query puts all the called names for each caller into a names collection, and then does a comparison between calledNew.KeyNam and the items in that collection. This causes the index to be used, and will speed up the identification of potential duplicate called nodes.
This query also does a NOT (caller)-[:CALLS]->(calledNew) check, to avoid creating duplicate relationships between the same nodes.

Creating a Neo4j Graph Database Using LOAD CSV

I have a CSV file containing the data that I want to convert into a graph database using Neo4j. The Columns in the file are in the following format :
Person1 | Person2 | Points
Now the ids in Person1 and Person2 are redundant , so I am using a Merge statement instead. But I am not getting the correct results.
For a sample dataset , the output seems to be correct , but when I import my dataset consisting of 2M rows, it somehow doesn't create the relationships.
I am putting the cypher code that I am using currently.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:C:/Users/yogi/Documents/Neo4j/default.graphdb/sample.csv" AS csvline
MERGE (p1:Person {id:toInt(csvline.id1)})
MERGE (p2:Person {id:toInt(csvline.id2)})
CREATE (p1)-[:points{count:toInt(csvline.c)}]->(p2)
Some things you should check:
are you using an index: CREATE INDEX ON :Person(id) should be run before the import
depending on the Neo4j version you're using, the statement might be subject to "eager-pipe" which basically prevents the periodic commit. For more on eager pipe, see http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/

What DBMS should I use to store openstreetmap as a graph?

Background:
I need to store the following data in a database:
osm nodes with tags;
osm edges with weights (that is an edge between two nodes extracted from 'way' from an .osm file).
Nodes that form edges, which are in the same 'way' sets should have the same tags as those ways, i.e. every node in a 'way' set of nodes which is a highway should have a 'highway' tag.
I need this structure to easily generate a graph based on various filters, e.g. a graph consisting only of nodes and edges which are highways, or a 'foot paths' graph, etc.
Problem:
I have not heard about the spatial index before, so I just parsed an .osm file into a MySQL database:
all nodes to a 'nodes' table (with respective coordinates columns) - OK, about 9,000,000 of rows in my case:
(INSERT INTO nodes VALUES [pseudocode]node_id,lat,lon[/pseudocode];
all ways to an 'edges' table (usually one way creates a few edges) - OK, about 9,000,000 of rows as well:
(INSERT INTO edges VALUES [pseudocode]edge_id,from_node_id,to_node_id[/pseudocode];
add tags to nodes, calculate weights for edges - Problem:
Here is the problematic php script:
$query = mysql_query('SELECT * FROM edges');
$i=0;
while ($res = mysql_fetch_object($query)) {
$i++;
echo "$i\n";
$node1 = mysql_query('SELECT * FROM nodes WHERE id='.$res->from);
$node1 = mysql_fetch_object($node1);
$tag1 = $node1->tags;
$node2 = mysql_query('SELECT * FROM nodes WHERE id='.$res->to);
$node2 = mysql_fetch_object($node2);
$tag2 = $node2->tags;
mysql_query('UPDATE nodes SET tags="'.$tag1.$res->tags.'" WHERE nodes.id='.$res->from);
mysql_query('UPDATE nodes SET tags="'.$tag2.$res->tags.'" WHERE nodes.id='.$res->to);`
Nohup shows the output for 'echo "$i\n"' each 55-60 seconds (which can take more than 17 years to finish if the size of the 'edges' table is more than 9,000,000 rows, like in my case).
Htop shows a /usr/bin/mysqld process which takes 40-60% of CPU.
The same problem exists for the script which tries to calculate the weight (the distance) of an edge (select all edges, take an edge, then select two nodes of this edge from 'nodes' table, then calculate the distance, then update the edges table).
Question:
How can I make this SQL updates faster? Should I tweak any of MySQL config settings? Or should I use PostgreSQL with PostGIS extension? Should I use another structure for my data? Or should I somehow utilize the spatial index?
If I understand you right there is two things to discuss.
First, your idea of putting the highway-tag on the start and stop nodes. A node can have more than one edge connected, where to put the tag from the second edge? Or third or fourth if it is a crossing? The reason the highway tag is putted in the edges table in the first place is that from a relational point of view that is where it belongs.
Second, to get the whole table and process it outside the database is not the right way. What a relational database is really good at is taking care of this whole process.
I have not worked with mysql, and I fully agree that you will probably get a lot more fun if migrating to PostGIS since PostGIS has a lot better spatial capabilities (even if you don't need any spatial capabilities for this particular task) from what I have heard.
So if we ignore the first problem and just for showing the concept say that there is only two edges connected to one node and that each node has two tag-fields. tag1 and tag2. Then it could look something like this in PostGIS:
UPDATE nodes set tag1=edges.tags from edges where nodes.id=edges.from;
UPDATE nodes set tag2=edges.tags from edges where nodes.id=edges.to;
If you disable the indexes that should be very fast.
Again,
if I have understood you right.
PostgreSQL
Openstreetmap itself uses PostgreSQL, so I guess that's recommended.
See: http://wiki.openstreetmap.org/wiki/PostgreSQL
You can see OSM's database schema at: http://wiki.openstreetmap.org/wiki/Database_Schema
So you can use the same fields, fieldtypes and indexes that OSM uses for maximum compatibility.
MySQL
If you want to import .osm files into a MySQL database, have a look at:
http://wiki.openstreetmap.org/wiki/OsmDB.pm
Here you will find perl code that will create MySQL tables, parse a OSM file and import it into your MySQL database.
Making it faster
If you are updating in bulk, you don't need to update the indexes after every update.
You can just disable the indexes, do all your updates and re-enable the index.
I'm guessing that should be a whole lot faster.
Good luck

Resources