Neo4j: How do you find the first (shallowest) match for all relationship matches? - database

I have a data set that duplicates the commit history inside a git repository. Each Commit node has one or more parents, which are also Commit nodes. Commits have a commit_id property and have references to the files that changed in that commit. In other words:
ChangedFile<-[:CHANGED_IN]-Commit
Commit-[:CONTAINS]->ChangedFile
Commit-[:CHILD_OF]->Commit
I'm now trying to write a Cypher query that returns commit/file pairs where each commit contains the most recent change to the file. Since the graph has been designed to mimic git history with parent/child relationships, the query should support choosing a commit to start at, i.e. the HEAD.
Here's what I've got so far:
MATCH
(commit:Commit {commit_id: '460665895c91b2f9018e361b393d7e00dc86b418'}),
(file:ChangedFile)<-[:CHANGED_IN]-commit-[:CHILD_OF*]->(parent:Commit)
RETURN
file.path, parent.commit_id
Unfortunately this query returns all the commits that match at any number of levels deep within the [:CHILD_OF*] relationship. I want it to instead stop at the first match for each file. As things stand now, I end up seeing a bunch of duplicate file paths in the result set.
How do I tell Neo4j/Cypher to stop at the first match per file, regardless of depth? I've tried adding UNIQUE and a bunch of other things, but I can't seem to find something that works. Thanks in advance!

Maybe I'm misunderstanding your data model and what you are after, but why are you looking for variable length paths from the commit to its parent? Aren't you just looking for the parent?
MATCH
(commit:Commit {commit_id: '460665895c91b2f9018e361b393d7e00dc86b418'}),
(file:ChangedFile)<-[:CHANGED_IN]-commit-[:CHILD_OF]->(parent:Commit)
RETURN
file.path, parent.commit_id

Related

SQL command insert where not exists allows 2 identical records to be stored in database

I have a problem with an application that runs on a machine in my factory(where I work I mean).
Anyway the application creates unique numbers for packages which are tied to unique box numbers which they are loaded into. As such they should always be unique numbers.
Now it was seen in the report file that 2 box numbers where identical and the components within. That means that 2 numbers where loaded into the database table from where the report is generated.
Now the coding part
The programmers of the application want to solve this by using the
SELECT DISTINCT sql command
from the database to generate the file so that it will only ever write one version of the double registered number. (This is because we don't know how the number was put into the database twice) I don't want this solution because it is only treating one known symptom and not the cause. There might be other effects that we are not aware of.
I have suggested that they use
INSERT WHERE NOT EXISTS sql command
so that the same item can never be registered.
Now they return to me and say that that condition already exists....? I cant understand that is it possible?
The only scenario I can think of is that it is not just the field of the number but a combination of fields that they perform the INSERT WHERE NOT EXISTS.
Is it possible that a command can fail in another way? I have no access to the code so I cant give an example of it but my problem is a concrete one. I don't want to be railroaded by them so that is why I am asking you guys.
It seems probable that this is a concurrency problem.
The statement
insert into [] where not exists []
could be submitted twice at the same time and run in parallel. Both statements check their condition and then both insert. This may be more exposed by having long running transactions.
To uncover the bug (something will fail somewhere) actually tell the database that this value is meant to be unique. First you would ofcourse have to remove duplicates, then you could for example use a
create unique index on [ ]

How to archive records in multiple tables within one access database

I have a database that contains data from last 4-5 years. The database has grown quite large and the application that uses this database has been working really slowly. I am looking to archive some data from all the tables. This database has 16-17 tables that have relationships amongst them and I am looking for a way to perform the archive operation so that I can archive/remove data for couple of years. I tried reading about APPEND and DELETE queries but I am not sure how to apply them on multiple tables.
Another problem is this application was created by somebody else I don't have enough knowledge about the database and the way tables are structured. Any help/suggestions are much appreciated.
The first thing you need to do is gain an understanding of your dataset. if you truly "don't have enough knowledge about the database and the way tables are structured", you're setting yourself up for one huge failure.
The most important piece is to determine how the tables are inter-related. If the original designer was competent, he should have set up relationships within the database and enforced referrential integrity. You will need to look at those relationships (go to the Database Tools tab and choose Relationships), or determine what relationships should exist between your data.
Once you've determined how your data is related, you will need to set up new Archive tables to mirror all of the tables you wish to archive. The easiest way to do this is to right-click on the table, then right-click elsewhere in the pane and choose "Paste". You will get a box that looks like this:
Choose "Structure Only", since you just want to set the table up. I would give them the same name as the original tables, with "_Archive" tacked onto the end. This way, it will be easy to determine which tables you're working with.
Next up, determine which are your "parent" tables and which are your "children" tables. You do this by determining which fields contain relationships to each other, and how they're related. Any tables with a One-to-Many relationship can be considered "Parents" on the "One" side, and "Children" on the "Many" side.
After this, you will need to determine how you wish to archive. Usually, there is some date field within your table that you can use as a guide. Say, for instance, you have a field called "Order Date". You can choose to archive anything with an Order Date of anything before 1/1/2010. So, to do so, you will need to write an Append query. You will append everything to your new archive table where Order Date <= 12/31/2009. You do this first for the Children tables, and then to the Parent tables.
After this, you will write a Delete query. Essentially the same process as above, but you're deleting from your original tables since the data has already been written to your archive tables. You MUST delete from the children tables first. Then do the same for the parent tables.
You can now move all the archive tables into a new database, and zip it up to minimize space. Once that's complete, you can Repair & Compact your database and the size should be much smaller.
Always remember to make a copy first! If you make any mistakes, you can't undo them. Creating a copy allows you to go back and retry without losing any data.

Why does sql-server prevent inserting in WHEN MATCHED of merge?

Anyone know why sql server prevents inserting from within the WHEN MATCHED clause of a MERGE statement? I understand that the documentation only allows updates or deletes, I'm wondering why this is the case so I can understand merge better.
Look at this post for an example.
If you are trying to merge your source to your target, it does not make sense to insert a line if it was found in the target. You may want to update or delete it though. Inserting what is already there would create duplicates.
As you want to INSERT when you find a MATCH, i presume the condition of the ON-clause is met but another field is different. Consider including this field into the ON-clause with AND to differentiate between present rows and to be inserted rows.
Common sense says: if you already have it there (the record) why would you want to insert it again? Not to mention that normally the "matching" is on a non duplicated key.
If you are able to find a situation where a matched record needs to be inserted again, let us know to help you.
I think the case might be if I want to keep track of history. e.g. Telephone field; I want to see what was the telephone before you change it.

'tail -f' a database table

Is it possible to effectively tail a database table such that when a new row is added an application is immediately notified with the new row? Any database can be used.
Use an ON INSERT trigger.
you will need to check for specifics on how to call external applications with the values contained in the inserted record, or you will write your 'application' as a SQL procedure and have it run inside the database.
it sounds like you will want to brush up on databases in general before you paint yourself into a corner with your command line approaches.
Yes, if the database is a flat text file and appends are done at the end.
Yes, if the database supports this feature in some other way; check the relevant manual.
Otherwise, no. Databases tend to be binary files.
I am not sure but this might work for primitive / flat file databases but as far as i understand (and i could be wrong) the modern database files are encrypted. Hence reading a newly added row would not work with that command.
I would imagine most databases allow for write triggers, and you could have a script that triggers on write that tells you some of what happened. I don't know what information would be available, as it would depend on the individual database.
There are a few options here, some of which others have noted:
Periodically poll for new rows. With the way MVCC works though, it's possible to miss a row if there were two INSERTS in mid-transaction when you last queried.
Define a trigger function that will do some work for you on each insert. (In Postgres you can call a NOTIFY command that other processes can LISTEN to.) You could combine a trigger with writes to an unpublished_row_ids table to ensure that your tailing process doesn't miss anything. (The tailing process would then delete IDs from the unpublished_row_ids table as it processed them.)
Hook into the database's replication functionality, if it provides any. This should have a means of guaranteeing that rows aren't missed.
I've blogged in more detail about how to do all these options with Postgres at http://btubbs.com/streaming-updates-from-postgres.html.
tail on Linux appears to be using inotify to tell when a file changes - it probably uses similar filesystem notifications frameworks on other operating systems. Therefore it does detect file modifications.
That said, tail performs an fstat() call after each detected change and will not output anything unless the size of the file increases. Modern DB systems use random file access and reuse DB pages, so it's very possible that an inserted row will not cause the backing file size to change.
You're better off using inotify (or similar) directly, and even better off if you use DB triggers or whatever mechanism your DBMS offers to watch for DB updates, since not all file updates are necessarily row insertions.
I was just in the middle of posting the same exact response as glowcoder, plus another idea:
The low-tech way to do it is to have a timestamp field, and have a program run a query every n minutes looking for records where the timestamp is greater than that of the last run. The same concept can be done by storing the last key seen if you use a sequence, or even adding a boolean field "processed".
With oracle you can select an psuedo-column called 'rowid' that gives a unique identifier for the row in the table and rowid's are ordinal... new rows get assigned rowids that are greater than any existing rowid's.
So, first select max(rowid) from table_name
I assume that one cause for the raised question is that there are many, many rows in the table... so this first step will be taxing the db a little and take some time.
Then, select * from table_name where rowid > 'whatever_that_rowid_string_was'
you still have to periodically run the query, but it is now just a quick and inexpensive query

Safely deleting a Django model from the database using a transaction

In my Django application, I have code that deletes a single instance of a model from the database. There is a possibility that two concurrent requests could both try to delete the same model at the same time. In this case, I want one request to succeed and the other to fail. How can I do this?
The problem is that when deleting a instance with delete(), Django doesn't return any information about whether the command was successful or not. This code illustrates the problem:
b0 = Book.objects.get(id=1)
b1 = Book.objects.get(id=1)
b0.delete()
b1.delete()
Only one of these two delete() commands actually deleted the object, but I don't know which one. No exceptions are thrown and nothing is returned to indicate the success of the command. In pure SQL, the command would return the number of rows deleted and if the value was 0, I would know my delete failed.
I am using PostgreSQL with the default Read Commited isolation level. My understanding of this level is that each command (SELECT, DELETE, etc.) sees a snapshot of the database, but that the next command could see a different snapshot of the database. I believe this means I can't do something like this:
# I believe this wont work
#commit_on_success
def view(request):
try:
book = Book.objects.get(id=1)
# Possibility that the instance is deleted by the other request
# before we get to the next delete()
book.delete()
except ObjectDoesntExist:
# Already been deleted
Any ideas?
You can put the constraint right into the SQL DELETE statement by using QuerySet.delete instead of Model.delete:
Book.objects.filter(pk=1).delete()
This will never issue the SELECT query at all, just something along the lines of:
DELETE FROM Book WHERE id=1;
That handles the race condition of two concurrent requests deleting the same record at the same time, but it doesn't let you know whether your delete got there first. For that you would have to get the raw cursor (which django lets you do), .execute() the above DELETE yourself, and then pick up the cursor's rowcount attribute, which will be 0 if you didn't wind up deleting anything.

Resources