Speed up Neo4j relationship creation - database

I have a CSV file with 1 million rows and 3 columns (NODE_ID_1, PROPERTY_COLUMN, NODE_ID_2).
I also have an already existing Neo4j database containing Node label. I should create the relationship RELATED_TO between nodes. I use the following cypher query to create the relationship between nodes but it is too cumbersome (it takes more than a day to complete the rel creation). Do you have some tips to quickly generate the relationship?
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM $url AS row
WITH row {.*, PROPERTY: toFloat(row.PROPERTY_COLUMN)}
RETURN row",
"MATCH (src:Node {node_id : row['NODE_ID_1']}), (dst:Node {node_id : row['NODE_ID_2']})
MERGE (src)-[r:RELATED_TO]-(dst)
SET r += {property_column: row['PROPERTY_COLUMN']}
",
{batchSize: 1000, batchMode: "BATCH", parallel:false, params: {url: 'file:///path_to_file'} })

Do you have an index on :Node(node_id)? You NEED this for your MATCH operations to be performant.
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/

Related

PocketBase filter data by with multiple relation set

i have a collection where the field "users" contains the id's of two users.
how can i search for this dataset where this both user id's are.
i tried
users = ["28gjcow5t1mkn7q", "frvl86sutarujot"]
still don't work
This is a relation, so there must be a collection that allows you multiple, non unique results, so this table you are looking at to, is the the dataset, you can query the whole dataset on script with
// you can also fetch all records at once via getFullList
const records = await pb.collection('tlds').getFullList(200 /* batch size */, {
sort: '-created',
});
I sugest you to look into: js-sdk/pocketbase on github.

database design eloquent problem with query

I need to create an app who manage soccer sheets
I have actually a table who store the match with both teams
match :
-id
-dt_math
-club_home_id
-club_visitor_id
each team have a sheet to create the list of players.
So what i did, i created table match_sheet to store the both sheets from the teams.
match_sheet :
-id
-match_id
to store the players in each sheets i created the table match_sheet_player
match_sheet_player:
-id
-match_sheet_id
-player_id
Now i need to display only the matchs who have the both sheets in my view. and i don't know how to achieve that.
The first query that i made is that :
$matchs_sheets = MatchSheet::all();
$matchs = Match::whereIn('id', $matchs_sheets->pluck('match_id'))->orderByDesc('dt_match')->paginate(5);
But this return my the match even if there is one sheet but not both sheets. i really need to show the match onyl if there is the two sheets.
Update :
here my data for math_sheet
there is two records with 1659. 1659 is the id of the match. so i would like to only show match 1659 and not 1649 because there is only one record for this match
Assuming your model relationships are set up correctly, you can ask Laravel to get the matches only if the related model has a count of at least 2, using has(). For instance:
$matches = Match::whereIn('id', $ids)->has('matchSheet', '=', 2)...
Your relationship should be set up as e.g. this:
// on Match model
public function matchSheets()
{
return $this->hasMany(MatchSheet::class);
}
// on MatchSheet model
public function match()
{
return $this->belongsTo(Match::class);
}
Docs here: https://laravel.com/docs/5.6/eloquent-relationships#one-to-many - I really recommend reading through them, they'll save you huge amounts of time eventually!

Efficient file comparison of two large text files

In our use case, we get large snapshot text files (tsv, csv etc.) from our customer (size around 30GB) with millions of records. The data looks like this:
ItemId (unique), Title, Description, Price etc.
shoe-id1, "title1", "desc1", 10
book-id-2, "title2", "desc2", 5
Whenever, we get a snapshot from a customer, we need to compute a "delta":
Inserted - the records that were inserted (only present in latest file and not the previous one),
Updated - Same Id but different value in any of the other columns
Deleted (only present in previous file and not the latest one).
(The data may be out of order in subsequent files and is not really sorted on any column).
We need to be able to run this multiple times a day for different customers.
We currently store all our data from snapshot file 1 into SQL server (12 shards (partitioned by customerId), containing a billion rows in all) and compute diffs using multiple queries when snapshot file 2 is received. This is proving to be very inefficient (hours, deletes are particularly tricky). I was wondering if there are any faster solutions out there. I am open to any technology (e.g. hadoop, nosql database). The key is speed (minutes preferably).
Normally, the fastest way to tell if an id appears in a dataset is by hashing, so I would make a hash which uses the id as the key and an MD5 checksum or CRC of the remaining columns as the element stored at that key. That should relieve the memory pressure if your data has many columns. Why do I think that? Because you say you have GB of data for millions of records, so I deduce each record must be of the order of kilobytes - i.e. quite wide.
So, I can synthesise a hash of 13M old values in Perl and a hash of 15M new values and then find the added, changed and removed as below.
#!/usr/bin/perl
use strict;
use warnings;
# Set $verbose=1 for copious output
my $verbose=0;
my $million=1000000;
my $nOld=13*$million;
my $nNew=15*$million;
my %oldHash;
my %newHash;
my $key;
my $cksum;
my $i;
my $found;
print "Populating oldHash with $nOld entries\n";
for($i=1;$i<=$nOld;$i++){
$key=$i-1;
$cksum=int(rand(2));
$oldHash{$key}=$cksum;
}
print "Populating newHash with $nNew entries\n";
$key=$million;
for($i=1;$i<=$nNew;$i++){
$cksum=1;
$newHash{$key}=$cksum;
$key++;
}
print "Part 1: Finding new ids (present in newHash, not present in oldHash) ...\n";
$found=0;
for $key (keys %newHash) {
if(!defined($oldHash{$key})){
print "New id: $key, cksum=$newHash{rkey}\n" if $verbose;
$found++;
}
}
print "Total new: $found\n";
print "Part 2: Finding changed ids (present in both but cksum different) ...\n";
$found=0;
for $key (keys %oldHash) {
if(defined($newHash{$key}) && ($oldHash{$key}!=$newHash{$key})){
print "Changed id: $key, old cksum=$oldHash{$key}, new cksum=$newHash{$key}\n" if $verbose;
$found++;
}
}
print "Total changed: $found\n";
print "Part 3: Finding deleted ids (present in oldHash, but not present in newHash) ...\n";
$found=0;
for $key (keys %oldHash) {
if(!defined($newHash{$key})){
print "Deleted id: $key, cksum=$oldHash{$key}\n" if $verbose;
$found++;
}
}
print "Total deleted: $found\n";
That takes 53 seconds to run on my iMac.
./hashes
Populating oldHash with 13000000 entries
Populating newHash with 15000000 entries
Part 1: Finding new ids (present in newHash, not present in oldHash) ...
Total new: 3000000
Part 2: Finding changed ids (present in both but cksum different) ...
Total changed: 6000913
Part 3: Finding deleted ids (present in oldHash, but not present in newHash) ...
Total deleted: 1000000
For the purposes of testing, I made the keys in oldHash run from 0..12,999,999 and the keys in newHash run from 1,000,000..16,000,000 then I can tell easily if it worked since the new keys should be 13,000,000..16,000,000 and the deleted keys should be 0..999,999. I also made the checksums alternate between 0 and 1 so that 50% of the overlapping ids should appear different.
Having done it in a relatively simple way, I can now see that you only need the checksum part to find the altered ids, so you could do parts 1 and 3 without the checksum to save memory. You could also do part 2 one element at a time as you loaded the data so you wouldn't need to load all the old and all the new ids into memory up front. Instead, you would load the smaller of the old and the new datasets and then check for changes one id at a time as you read the other set of ids into memory which would make it less demanding of memory.
Finally, if the approach works, it could readily be re-done in C++ for example, to speed it up further and reduce further the memory demands.

clone some relationships according to a condition

I exported two tables named Keys and Acc tables as CSV files from SQL Server and imported them successfully to Neo4J by using the commands below.
CREATE INDEX ON :Keys(IdKey)
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///C:/Keys.txt' AS line
MERGE (k:Keys { IdKey: line[0] })
SET k.KeyNam=line[1], k.KeyLib=line[2], k.KeyTyp=line[3], k.KeySubTyp=line[4]
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///C:/Acc.txt' AS line
MERGE (callerObject:Keys { IdKey : line[0] })
MERGE (calledObject:Keys { IdKey : line[1] })
MERGE (callerObject)-[rc:CALLS]->(calledObject)
SET rc.AccKnd=line[2], rc.Prop=line[3]
Keys stands for the source code objects, Acc stands for relations among them. I imported these two tables three times for three different application projects. So to maintain IdKey property being unique for three applications, I concatenated a five character prefix to IdKey to identify the Object for Application while exporting from sql server because we can not create index based on multiple fields as I learnt from manuals. Now my aim is constructing the relations among applications. For example:
Node1 is a source code object of Application1
Node2 is another source code object of Application1
Node3 is a source code object of Application2
There is already a CALL relation created from Node1 to Node2 because of the record in Acc already imported.
The Name of the Node2 is equal to name of Node3. So we can say that Node2 and Node3 are in fact the same source codes. So we should create a relation from Node1 to Node3. To realize it, I wrote a command below. But I want to be sure that it is correct. Because I do not know how long it will execute.
MATCH (caller:Keys)-[rel:CALLS]->(called:Keys),(calledNew:Keys)
WHERE calledNew.KeyNam = called.KeyNam
and calledNew.IdKey <> called.IdKey
CREATE (caller)-[:CALLS]->(calledNew)
This following query should be efficient, assuming you also create an index on :Keys(KeyNam).
MATCH (caller:Keys)-[rel:CALLS]->(called:Keys)
WITH caller, COLLECT(called.KeyNam) AS names
MATCH (calledNew:Keys)
WHERE calledNew.KeyNam IN names AND NOT (caller)-[:CALLS]->(calledNew)
CREATE (caller)-[:CALLS]->(calledNew)
Cypher will not use an index when doing comparisons directly between property values. So this query puts all the called names for each caller into a names collection, and then does a comparison between calledNew.KeyNam and the items in that collection. This causes the index to be used, and will speed up the identification of potential duplicate called nodes.
This query also does a NOT (caller)-[:CALLS]->(calledNew) check, to avoid creating duplicate relationships between the same nodes.

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Resources