How to bulk load vertices and edges to an existing AWS Neptune Graph - graph-databases

I have a running AWS Neptune graphDB which is being used in a production environment. I have since identified new vertices that I would like to add that will connect to specific existing vertices in the DB.
I have added the original set by splitting it up with the 'csv-to-neptune-bulk-format' script in https://github.com/awslabs/amazon-neptune-tools/tree/master/csv-to-neptune-bulk-format .
My question is, how can I bulk load my additional set in the most efficient way?
I have two ideas on how to appraoch this, but I'm hoping that someone knows a simpler way.
Approach 1 will be to use the above 'csv-to-neptune-bulk-format' script to split up the new additional set and then bulk load that. I will then have duplicate vertices of where the new set overlaps with the original as the above script will assign new vertex id's for the vertices where the new set will connect to the original set. I have a function to then merge these duplicate vertices. This approach can be quite resource intensive though.
Approach 2 will be to split up the additional set with the above script and then replace the connecting vertex's id's in the generated csv for the edges that will connect the original set with the additional set. So basically the edge csv will change from [~id,~label,~from,~to] to [~id,~label, complimenting vertex id's generated from the first bulkupload,~to].
I'm hoping that I've missed some documentation or logic somewhere that will allow me to use existing vertex id's to simply bulk load the new processed vertices csv and the edge csv that will connect the new vertices with original vertices. Any help or advice will be greatly appreciated.

The bulk loader can be used for more than just a first time load into an empty graph. You can use it to add new nodes and edges, and to update existing nodes and edges where you need to add new properties or replace the value of an existing (single cardinality) property.
I have not used the csv-to-neptune-bulk-format tool, I typically generate the Neptune CSV format for nodes and edges directly.
Can you say a bit more about the format the data you want to ingest is currently in and why you need to ETL it using that tool? If you can add a bit more info I will update this answer accordingly.

Related

How to modify the projection of a dataset in a ADF Dataflow

I want to optimize my dataflow reading just data I really need.
I created a dataset that maps a view on my database. This dataset is used by different dataflow so I need a generic projection.
Now I am creating a new dataflow and I want to read just a subset of the dataset.
Here how I created the dataset:
And that is the generic projection:
Here how I created the data flow. That is the source settings:
But now I want just a subset of my dataset:
It works but I think I am doing wrong:
I wanto to read data from my dataset (as you can see from source settings tab), but when I modify the projection I read from the underlying table (as you can see from source option). It seems an inconsistence. Which is the correct way to manage this kind of customization?
Thank you
EDIT
The solution proposed does not solve my problem. If I go in monitor and I analyze the exections that is what I saw...
Before I had applyed the solution proposed and with the solution I wrote above I got this:
As you can see I had read just 8 columns from database.
With the solution proposed, I get this:
And just then:
Just to be clear, the purpose of my question is:
How ca n I read only the data I really need instead of read all data and filter them in second moment?
I found a way (explained in my question) but there is an inconsistency with the configuration of the dataflow (I set a dataflow as input but in the option I write a query that read from db).
First import data as a Source.
You can use Select transformation in DataFlow activity to select CustomerID from imported dataset.
Here you can remove unwanted columns.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-select

Creating relationship Neo4J during import if value is 1

Hello all you helpful folks!
I have been tasked with converting our RDBM into a Graph database for testing purposes. I am using Neo4J and have been successful on importing various tables into their appropriate nodes. However, I have run into a slight hiccup when it comes to the department node. Certain department are partnered with a particular department. Within the RDBMS model, this is simple a column named: Is_Partner because this database was originally set up with one partner in mind (Hence the whole: Moving to a Graph database thing).
What I need to do is match all department with the Is_Partner value of 1 and assign a relationship to from the partner who has the value of 1 in Is_Partner and assign it to a specific partner (Edge: ABBR, Value: HR). I have written the script, but it tells me it's successful, but 0 edits are made...
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Department.csv" AS row
MATCH (partner:Department {DepartmentID: row.DepartmentID})
WHERE row.IS_PARTNER = "1"
MERGE (partner)-[:IS_PARTNER_OF]->(Department{ABBR: 'HR'});
I'm pretty new to Graph Databases, but I know Relational Databases quite well. Any help would be appreciated.
Thank you for your time,
Jim Perry
There are a few problems with your query. If you want to filter on CSV use WITH statement with a WHERE filter. Also you want to MERGE HR department node separately and then MERGE relationship separately.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Department.csv" AS row
WITH row WHERE row.IS_PARTNER = "1"
MATCH (partner:Department {DepartmentID: row.DepartmentID})
MERGE (dept:Department{ABBR: 'HR'}))
MERGE (partner)-[:IS_PARTNER_OF]->(dept);
If it still return no results/changes, check out if your MATCH statement return anything as this is usually the problem.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Department.csv" AS row
WITH row WHERE row.IS_PARTNER = "1"
MATCH (partner:Department {DepartmentID: row.DepartmentID})
RETURN partner

How to read and change series numbers in columns SSIS?

I'm trying to manipulate a column in SSIS which looks like below after i removed unwanted rows with derived column and conditional split in my data flow task. The source for this is a flatfile.
XXX008001161022061116030S1TVCO3057
XXX008002161022061146015S1PUAG1523
XXX009001161022063116030S1DVLD3002
XXX009002161022063146030S1TVCO3057
XXX009003161022063216015S1PUAG1523
XXX010001161022065059030S1MVMA3020
XXX010002161022065129030S1TVCO3057
XXX01000316102206515901551PPE01504
The first three numbers from the left (starting with "008" first row) represent a series, and the next three ("001") represent another number within the series. what i need is to change all of the first three numbers starting from "001" to the end.
The desired reslut would thus look like:
XXX001001161022061116030S1TVCO3057
XXX001002161022061146015S1PUAG1523
XXX002001161022063116030S1DVLD3002
XXX002002161022063146030S1TVCO3057
XXX002003161022063216015S1PUAG1523
XXX003001161022065059030S1MVMA3020
XXX003002161022065129030S1TVCO3057
XXX00300316102206515901551PPE01504
...
My potential solution would be to load the file to a temporary database table and query it with SQL from there, but i am trying to avoid this.
The final destination is a flatfile.
Does anybody have any ideas how to pull this off in SSIS? Other solutions are appreciated also.
Thanks in advance
I would definitely use the staging table approach and use windows functions to accomplish this. I could see a use case if SSIS was on another machine than the database engine and there was a need to offload the processing to the SSIS box.
In that case I would create a script transformation. You can process each row and make the necessary changes before passing the row to the output. You can use C# or VB.
There are many examples out there. Here is MSDN article - https://msdn.microsoft.com/en-us/library/ms136114.aspx

Is incremental adds are possible in neo4j?

I have a quick question. I have database of one million nodes and 4 million relationships. This all data in neo4j i have created with import csv command. Now after testing the graph database and analyzing the queries according to my need. Now i want to make a php program where the data will be automatically loaded and i will get the results in the end (according to my query). Now here is the question, as my data will update after 15 min. Is neo4j has a ability of incremental adds. Like to show which new relationships or nodes added in this specific time.i was thinking to use the time command to see which data was created in that time. Correct me if i am wrong. i only want to see the new addition. because i dont want neo4j to waste time on the calculation of already existing nodes/relationships.is there any other way to do that.
thanks in advance.
You could add a property to store a string of the date/time that the nodes are added. Then you could query for everything since the last date/time. I'm not 100% sure on the index performance of that, though.
However, if all you care about is showing the the most recently imported, you could have a boolean value with an index:
CREATE INDEX ON :Label(recently_added)
Then when you import your data you can unset all of the current ones and set the new ones like this:
MATCH (n:Label {recently_added: true})
REMOVE n.recently_added
Second query:
MATCH (n:Label {id: {id_list}})
SET n.recently_added = true
That is assuming that you have some sort of unique identifier on the nodes which you can use to set the ones which you just added.

Cross-reference Distance chart in SQL Server 2008 or Excel?

I would like to cross-reference construct a distance chart similar to the one here (example is a road-distance cross-reference chart) and, ideally, store the data in SQL Server 2008 (preferably the Express version). It needs these properties / abilities
Every column has a corresponding row with the same name (ie. not misspelled like my example).
Changing the value at one Row-Column intersection would update the mirror intersection (Column-Row) or the mirror data could be ignored.
The distance-values would need to be end-user editable.
The end-user would need to be able to add, delete or rename a column/row pair.
The end-user needs to be able to sort the columns and have the rows move automatically.
There could be hundreds of pairs.
a look-up query needs to find a distance given a start & destination (Row & Column)
The distance chart is reasonably straightforward to implement in Excel. Considering this, am I better off...
Using Excel as the user editing UI and then updating an SQL 'thing' with the new data?
Using Excel as the data-source even if it means performance issues with querying the data?
Using an as-yet undiscovered stroke of genius detailed here in an answer?
Sure looks like an Excel application to me, start to end. (heh)
I can't imagine your users typing enough data in to make performance an issue. Excel will only take 32757 rows by ditto columns. If that's enough, I'd say you're golden.

Resources