ArangoDB: Insert as function of query by example - query-optimization

Part of my graph is constructed using a giant join between two large collections, and I run it every time I add documents to either collection.
The query is based on an older post.
FOR fromItem IN fromCollection
FOR toItem IN toCollection
FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection
This takes about 55,000 seconds to complete for my dataset. I would absolutely welcome suggestions for making that faster.
But I have two related issues:
I need an upsert. Normally, upsert would be fine, but in this case, since I have no way of knowing the key up front, it wouldn't help me. To get the key up front, I would need to query by example to find the key of the otherwise identical, existing edge. That seems reasonable as long as it doesn't kill my performance, but I don't know how in AQL to construct my query conditionally so that it inserts an edge if the equivalent edge does not exist yet, but does nothing if the equivalent edge does exist. How can I do this?
I need to run this every time data gets added to either collection. I need a way to run this only on the newest data so that it doesn't try to join the entire collection. How can I write AQL that allows me to join only the newly inserted records? They're added with Arangoimp, and I have no guarantees on which order they'll be updated in, so I cannot create the edges at the same time as I create the nodes. How can I join only the new data? I don't want to spend 55k seconds every time a record is added.

If you run your query as written without any indexes, then it will have to do two nested full collection scans, as can be seen by looking at the output of
db._explain(<your query here>);
which shows something like:
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 3 - FOR fromItem IN fromCollection /* full collection scan */
3 EnumerateCollectionNode 9 - FOR toItem IN toCollection /* full collection scan */
4 CalculationNode 9 - LET #3 = (fromItem.`fromAttributeValue` == toItem.`toAttributeValue`) /* simple expression */ /* collections used: fromItem : fromCollection, toItem : toCollection */
5 FilterNode 9 - FILTER #3
...
If you do
db.toCollection.ensureIndex({"type":"hash", fields ["toAttributeValue"], unique:false})`
Then there will be a single full table collection scan in fromCollection and for each item found there is a hash lookup in the toCollection, which will be much faster. Everything will happen in batches, so this should already improve the situation. The db._explain() will show this:
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 3 - FOR fromItem IN fromCollection /* full collection scan */
8 IndexNode 3 - FOR toItem IN toCollection /* hash index scan */
To only work on recently inserted items in fromCollection is relatively easy: Simply add a timestamp of the import time to all vertices, and use:
FOR fromItem IN fromCollection
FILTER fromItem.timeStamp > #lastRun
FOR toItem IN toCollection
FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection
and of course put a skiplist index on the timeStamp attribute in fromCollection.
This should work beautifully to discover new vertices in the fromCollection. It will "overlook" new vertices in the toCollection that are linked to old vertices in fromCollection.
You can discover these by interchanging the roles of the fromCollection and the toCollection in your query (do not forget the index on fromAttributeValue in fromCollection) and remembering to only put in edges if the from vertex is old, like in:
FOR toItem IN toCollection
FILTER toItem.timeStamp > #lastRun
FOR fromItem IN fromCollection
FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
FILTER fromItem.timeStamp <= #lastRun
INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection
These two together should do what you want. Please find the fully worked example here.

Related

Long calculation times with XLOOKUP vs INDEX-MIN-COLUMN

I'm using this formula =IF(B24="","",IFERROR(INDEX(Sheet3!$C$3:$EE$3,,MIN(IF(Sheet3!$C$4:$EE$23=(Sheet2!C24&$K$18),COLUMN(Sheet3!$C:$EE)))-2),"NF")) to return a cell value in the top row of an array - a date in this case.
The search criteria is a combination of a unique project number and a 2 digit status alphanumerical code for the project. The array consists of 23 rows where combinations of the unique numbers are found, each with different status codes.
So essentially, I'm building a FILTERED project status dashboard that returns dates linked to the relevant project status.
The code above is inspired from ( LINK ) that uses a very similar layout, but it uses town suburbs linked to postal codes instead of project numbers and status codes. The formula works well (though, not entered as an array formula), but I don't have a single formula in the sheet, I have 3 300 occurrences of this formula.
The problem comes in when the user changes the FILTER - Excel recalculates the entire dashboard and that takes anywhere from 2 to 5 minutes to run. You hit the escape button and cancel the calculation after setting the filter, but Excel just starts calculating again after a few seconds. After that, Excel's response is sluggish and almost unusable. Yes - our hardware is pretty weak ...
I tried XLOOKUP as well, but can't set the "lookup_array" to an array ( Sheet3!$C$4:$EE$23 ) because it doesn't match the "return-array" ( Sheet3!$C$3:$EE$3 ) Concatenating the lookup arrays with & works, but then you'd have to do that for all 23 rows, and again, multiply that by 3 300.
I thought of creating a UDF, but the function will still be called every time Excel recalculates after filtering... 3 300 calls ...
Any ideas on how to make the INDEX version run faster, or make the XLOOKUP accept the lookup_array as Sheet3!$C$4:$EE$23 in the hopes that it'll run faster?
Thank you!
Not really an elegant solution, but it works.
I imported the dataset into a helper sheet, where I combined the cell value with the corresponding value in Column A for each row ( a name in this case ) and the date from row 1 for each column, using underscore as a delimiter.
This new data range was then given a unique name, EE in this case.
On a second helper sheet, using this formula =INDEX(Filtered,1+INT((ROW('Sheet1'!C3)-1)/COLUMNS(Filtered)),MOD(ROW('Sheet1'!C3)-1+COLUMNS(Filtered),COLUMNS(Filtered))+1) and drag it down till it returns an REF! error and going back one row before the error.
This transposes all the data into a single column G. Using =UNIQUE(SORT(FILTER(B3:B3240,B3:B3240<> "",""))) then gives me a filtered list of unique values in column H that I then run
=IF(H3="","",LEFT(H3, SEARCH("_",H3,1)-1)) for the first data value in I, and
=IF(H3="","",MID(H3, SEARCH("_",H3) + 1, SEARCH("_",H3,SEARCH("_",H3)+1) - SEARCH("_",H3) - 1)) for the middle data value in J, and
=IF(H3="","",IFERROR(TEXT(RIGHT(H3,5),"yyyy-mm-dd"),"NF")) for the last data value in K.
Then just run XLOOPUP across columns I, J and K.
Runs quick and easy and solves a few of the other issue I had as well.
The second data set has just over 35 000 rows - still works well and fast.

Keep order in a key-value use case with additions and deletions

I receive big 4 bit hashes list from an API that I need to store.
The list must always be ordered alphanumerically.
Here's how things are flowing :
I get the full list of hashes from the API (Approx 1,000,000). I store this list.
I request the API periodically to get a list of additions and deletions to the list.
For now I store the hashes list with a 'order' row in a table :
0 - 6717
1 - 7fcd
2 - 88c6
3 - 9e63
4 - dcb0
5 - fb44
Now let's say I receive the deletions :
1
4
And the additions :
7bd7
0e33
I need to delete the rows 1 and 4 :
0 - 6717
2 - 88c6
3 - 9e63
5 - fb44
And I need to add the additions to the list AND to rebuild the roder row to keep the alphaneumeric order, to be able to to this again :
0 - 0e33
1 - 6717
2 - 7bd7
3 - 88c6
4 - 9e63
5 - fb44
I need this for a PHP Symfony application, I have implemented this with MySQL but it's pretty slow to create the full list and to rebuild the id row...
As I have a key->value dataset Redis seems to be a good choice but there is no bulk rename function for the keys.
I am also thinking about MongoDB and to create one document for each hash but I'm not really sure.
what would you do ? Thanks
In relational data model (YesSQL world) a table is an non-ordered set of rows. Hence, the order of stored values is unpredictable "by design" in general case (I wouldn't say about clustered indexes here). The ordered list is only guaranteed when using ORDER BY clause
SELECT key, value FROM my_store WHERE ... ORDER BY key
For the performance purposes you need to have an index/primary key/unique constraint (depend on DMBS and database design) on the affected column(s). 1M of rows is a relatively small amount which shouldn't make some performance issues. Be aware also of data/index fragmentation when deletion and insertion are frequent.

Remove duplicate values based on timestamp

I would need your help with and SQL query that has to remove duplicate entries from a table, mostly using the datestamp column as a criteria in two passes.
Microsoft SQL DBMS is in question.
Here is a little more details:
Terminology: Module is basically a group of single machine workplaces onto which users operate.
Table:
ModNam column is fixed, there are 15 modules from M A01 to M A15, then goes the B row M B01 ... M B15 and so on until row F.
Pos column is irrelevant at the moment.
MdCod column represents a code of the machine being added to the position in the certain module. It can be replaced by another machine at any given time.
I have one query that will be inserting data into this table by copying entries from another table, every time a new machine is added to one of the positions.
Tricky part for me is a second query that should be comparing records in two phases and if:
1) Inside same module (first pass of the query represented with red color in the example pic attached):
ModNam value is the same, MdCod matches between the entries then the most recent datestamp decides the single one to stay and others duplicates get deleted
2) Inside other module (second pass of the query represented with purple color in the example pic attached):
ModNam values are different and MdCod matches between the entries then the most recent datestamp decides the single one to stay and others duplicates get deleted.
Please help and advise.
Example pic (updated):
Thank you all in advance.

Returning only specific rows (eg. every 10th: #1, #11, #21...) from query

I need to fetch only specific (kind of "nth rows") from a Solr index. For example, if the full result contains 10000 rows, I want to receive only the first and last row of each 100 item bucket.
items 1 and 100
items 101 and 200
items 201 and and 300...
This grouping is dynamic and dependent on the number of results. So, if there are only 5000 total result rows, bucket size is 50 instead of 100. I can calculate the actual indexes but the problem is how to fetch those from Solr.
There are no indexed fields that could be used directly as query parameters. In practise, I am doing a search "name starts with A" (or some other letter) and want to receive 1st item starting with A, 100th item starting with A, 101st item starting with A etc...
Query parameters http://wiki.apache.org/solr/CommonQueryParameters have "rows" and "start" but these can't skip items, so I would need to get each item with a separate query which is inefficient. I was also thinking about implementing a Filter Query which would just filter out items 2...99, 192...199 but I do not know how to implement that.
I don't know of an easy way to do this, but this will reduce the amount of data that needs to be passed back and forth: Do a regular query with the usual start and rows parameters, but tell Solr to only return the ID field of each document (via the fl parameter). In your client code, store the IDs of the first and last documents, and repeat the query with the next value for start. Once you reach the end of the search results, you have a list of the document IDs you want. Run a new query and give it the list of document IDs you want returned, and this time get the full documents.

Get "surrounding" rows in NHibernate query

I am looking for a way to retrieve the "surrounding" rows in a NHibernate query given a primary key and a sort order?
E.g. I have a table with log entries and I want to display the entry with primary key 4242 and the previous 5 entries as well as the following 5 entries ordered by date (there is no direct relation between date and primary key). Such a query should return 11 rows in total (as long as we are not close to either end).
The log entry table can be huge and retrieving all to figure it out is not possible.
Is there such a concept as row number that can be used from within NHibernate? The underlying database is either going to be SQlite or Microsoft SQL Server.
Edited Added sample
Imagine data such as the following:
Id Time
4237 10:00
4238 10:00
1236 10:01
1237 10:01
1238 10:02
4239 10:03
4240 10:04
4241 10:04
4242 10:04 <-- requested "center" row
4243 10:04
4244 10:05
4245 10:06
4246 10:07
4247 10:08
When requesting the entry with primary key 4242 we should get the rows 1237, 1238 and 4239 to 4247. The order is by Time, Id.
Is it possible to retrieve the entries in a single query (which obviously can include subqueries)? Time is a non-unique column so several entries have the same value and in this example is it not possible to change the resolution in a way that makes it unique!
"there is no direct relation between date and primary key" means, that the primary keys are not in a sequential order?
Then I would do it like this:
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time))
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time))
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
There is the risk of having several items with the same time.
Edit
This should work now.
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time)) // less or equal
.Add(Expression.Not(Expression.IdEq(middleItem.id))) // but not the middle
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time)) // greater
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
This should be relatively easy with NHibernate's Criteria API:
List<LogEntry> logEntries = session.CreateCriteria(typeof(LogEntry))
.Add(Expression.InG<int>(Projections.Property("Id"), listOfIds))
.AddOrder(Order.Desc("EntryDate"))
.List<LogEntry>();
Here your listOfIds is just a strongly typed list of integers representing the ids of the entries you want to retrieve (integers 4242-5 through 4242+5 ).
Of course you could also add Expressions that let you retrieve Ids greater than 4242-5 and smaller than 4242+5.
Stefan's solution definitely works but better way exists using a single select and nested Subqueries:
ICriteria crit = NHibernateSession.CreateCriteria(typeof(Item));
DetachedCriteria dcMiddleTime =
DetachedCriteria.For(typeof(Item)).SetProjection(Property.ForName("Time"))
.Add(Restrictions.Eq("Id", id));
DetachedCriteria dcAfterTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyGt("Time", dcMiddleTime));
DetachedCriteria dcBeforeTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyLt("Time", dcMiddleTime));
crit.AddOrder(Order.Asc("Time"));
crit.Add(Restrictions.Eq("Id", id) || Subqueries.PropertyIn("Id", dcAfterTime) ||
Subqueries.PropertyIn("Id", dcBeforeTime));
return crit.List<Item>();
This is NHibernate 2.0 syntax but the same holds true for earlier versions where instead of Restrictions you use Expression.
I have tested this on a test application and it works as advertised

Resources