Is there a way to improve query time across relations? - graph-databases

Is there a way to improve query time across relations? Yet to measure, but I'm currently modeling my data in Memgraph in such a way that I will have several versions, and a typical query will look like:
MATCH (:Version { major: $major, minor: $minor })--(t:Thing { name: $name })
Of course, I can create an index over major/minor and over the name, but I worry that given nearly all queries will actually look for a composite index + relation it'll still have some performance impact.
I will need to measure this, but wanted to see whether someone has some insight for me already, and if there is something I can do to improve this. Also, there are < 100 of :Version and > 100000 of :Thing, so essentially the degree of :Version will be very high)
Wondering if I should convert this to be a single field on :Thing, for better indexing

After exploring, I would say that the current model looks fine. It would be good to have an index on the name of the :Thing nodes, for sure, creating it on :Version is questionable and should be measured.
Use EXPLAIN/PROFILE query to see different options and test the performance
But I think
MATCH (t:Thing {name : $name})--(:Version {major: $major, minor: $minor})
with index on Thing(name) should be enough.
I did reverse the order but it shouldn't matter actually.

Related

A field with big array on mongodb

I am a beginner at Mongo and I made a data base with the following topology.
Some fields of metadata and one field that contain the experiment results.
experiment results- vector of integers with ~150,000 values
status = db.DataTest.insert_one(
{
"person_num" : num,
"life_cycle" : cycle,
"other_metadata" : meta_data,
"results_of_experiment": big_array
}
)
I inserted something like 7500 of those documents
Its occupied 8GB of memory and work really slowly for find operations.
I don't need those experiment results to search by them only the option to retrieve them from the DB as chunk of data.
Is there another solution to store on the DB the experiment results?
Is using "gridfs" is relevant to this case and not too complicated?
Based on your comments, the most common query is
db.DataTest.find( { "life_cycle": { $gt: 800 } }).limit(5)
Without an index on the life_cycle field, MongoDB is forced to do a collection scan. That is, fetch & evaluate all documents in your collection one by one. In a large collection, this will take a long time.
MongoDB does not create indexes automatically. You would have to observe your most common queries, and create indexes to support those queries. As far as I know, there is no automatic index creation in any database software; SQL, NoSQL, or otherwise.
Database indexing is a deep subject and cannot be explained in a short answer.
Having said that, if you create an index on the life_cycle field, it should improve your query times but only for the query you posted above. Other query types would likely require different indexes. You can do so in the mongo shell:
db.DataTest.createIndex({life_cycle: 1})
I encourage you to read these pages to understand more about indexing in MongoDB:
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/applications/indexes/
https://docs.mongodb.com/manual/tutorial/create-indexes-to-support-queries/

Referential Integrity with Neo4j

I am working on a project that uses a graph database to hold click data for a search engine. The nodes can be search terms or urls, and the edges hold a weight attribute, and a percentage of times that search led to someone clicking that URL.
Number of times the URL was clicked / Number of times term was searched
My issue is that when I update the edges, the percentage will be accurate, but if I later update the search term node and the searched count changes, the edge will no longer have the correct percentage. Is there a way in Neo4j to keep referential integrity? like a foreign key type thing?
The following info might be helpful.
If you stored the number of clicks instead of the percentage, there is no way to get inconsistent data. For example:
(:Term {id: 1, nSearches: 123})-[:HAS_URL {weight: 2, nClicks: 17}]->(:Url {id: 2})
With this data model, you'd calculate the percentage whenever you needed it.
For example, to find the 10 terms that have the highest percentage of visits to a specific URL:
MATCH (term:Term)-[r:HAS_URL]->(url:Url {id: 2})
RETURN url, term
ORDER BY r.nClicks/term.nSearches DESC
LIMIT 10;
But notice that the inverse query (find the 10 URLs that have the highest percentage of visits from a specific term) does not even require that you calculate the percentage! This is because in this case the percentages all have the same denominator. So, you can just use nClicks for sorting:
MATCH (term:Term {id: 1})-[r:HAS_URL]->(url:Url)
RETURN term, url
ORDER BY r.nClicks DESC
LIMIT 10;
Unfortunately no, neo4j doesn't support this. You can still do it, with one of two methods. I'll tell you what they both are, then make a recommendation.
Relative to your relational database, I don't think you're looking for a foreign key or "referential integrity" -- I think what you're looking for is more like a trigger. A trigger is like a function or procedure that executes when data changes. In your case, it'd probably be good to have trigger functions that re-calculated all of the weight percentages on incident edges.
Option 1 - The capable Max De Marzi has got you covered there with a description of how you can do triggers in neo4j. Spoiling the surprise, there's a TransactionEventHandler in the java API. When the right kind of transaction comes through, you can catch that and do extra stuff.
Option 2 - the server provides an extension/plugin mechanism so that you could write this on your own. This is a big hammer, it can do just about anything, but it's harder to wield, too.
I'd recommend you look into Max's post and the TransactionEventHandler. You might then implement public void afterCommit(TransactionData transactionData, Object o). In that method, you'd check out the transaction data to see if it was something of interest (not all transactions would be of interest). If the transaction updated a search term node or searched count changes, then I'd go do your recomputation, fix your weights, and you should be good.

Salesforce - SOQL Use mod() or similar math functions in SELECT?

I'm trying to query all Opportunities that have a Price that's not a whole number (no decimals) or if its price is not multiple of 10.
Im trying to find prices like: U$S 34,801.23 - U$S 56,103.69 - U$S 50,000.12 etc
But not : U$S 49,500.00 - U$S 19,110.00 etc
There are a lot of opportunities in the database, and i can go through them by code but wanted to know if any of you can think of a way to achieve at leaast a part of this by query.
SOQL won't let you do any calculations. Check docs here and here:
You must supply a native value—other field names or calculations are
not permitted
Generally speaking it's fieldName = value or datefieldname > YESTERDAY (few special literals for date handling).
Easiest would be to create a formula field in the record (think about them like calculated columns in views in normal databases) with your logic. It could be of text type, call it "weird price" ;) Here's the complete formula functions reference - you have MOD(), IF(condition, true, false) etc goodies.
The only caveat is that you can't GROUP BY formula (there also tricks to bypass that ;))
So yeah - if it's one time thing with poor reusability - filter with code. If you can accept the low cost (formulas don't use storage but you might need them for other purposes) - create a field.

SSIS/VB.NET Equivalent of SQL IN (anonymous array.Contains())

I've got some SQL which performs complex logic on combinations of GL account numbers and cost centers like this:
WHEN (#IntGLAcct In (
882001, 882025, 83000154, 83000155, 83000120, 83000130,
83000140, 83000157, 83000010, 83000159, 83000160, 83000161,
83000162, 83000011, 83000166, 83000168, 83000169, 82504000,
82504003, 82504005, 82504008, 82504029, 82530003, 82530004,
83000000, 83000100, 83000101, 83000102, 83000103, 83000104,
83000105, 83000106, 83000107, 83000108, 83000109, 83000110,
83000111, 83000112, 83000113, 83100005, 83100010, 83100015,
82518001, 82552004, 884424, 82550072, 82552000, 82552001,
82552002, 82552003, 82552005, 82552012, 82552015, 884433,
884450, 884501, 82504025, 82508010, 82508011, 82508012,
83016003, 82552014, 81000021, 80002222, 82506001, 82506005,
82532001, 82550000, 82500009, 82532000))
Overall, the whole thing is poorly performing in a UDF, especially when it's all nested and the order of the steps is important etc. I can't make it table-driven just yet, because the business logic is so terribly convoluted.
So I'm doing a little exploratory work in moving it into SSIS to see about doing it in a little bit of a different way. Inside my script task, however, I've got to use VB.NET, so I'm looking for an alternative to this:
Select Case IntGLAcct = 882001 OR IntGLAcct = 882025 OR ...
Which is obviously a lot more verbose, and would make it terribly hard to port the process.
Even something like ({90605, 90607, 90610} AS List(Of Integer)).Contains(IntGLAcct) would be easier to port, but I can't get the initializer to give me an anonymous array like that. And there are so many of these little collections, I'm not sure I can create them all in advance.
It really all NEEDS to be in one place. The business changes this logic regularly. My strategy was to use the udf to mirror their old "include" file, but performance has been poor. Now each of the functions takes just 2 or three parameters. It turns out that in a dark corner of the existing system they actually build a multi-million row table of all these results - even though the pre-calced table is not used much.
So my new experiment is to (since I'm still building the massive cross join table to reconcile that part of the process) go ahead and use the table instead of the code, but go ahead and populate this table during an SSIS phase instead of calling the udf 12 million times - because my udf version just basically stopped working within a reasonable time frame and the DBAs are not of much help right now. Yet, I know that SSIS can process these rows pretty efficiently - because each month I bring in the known good results dozens of multi-million row tables from the legacy system in minutes AND run queries to reconcile that there are no differences with the new versions.
The SSIS code would theoretically become the keeper of the business logic, and the efficient table would be built from that (based on all known parameter combinations). Of course, if I can simplify the logic down to a real logic table, that would be the ultimate design - but that's not really foreseeable at this point.
Try this:
Array.IndexOf(New Integer() {90605, 90607, 90610}, IntGLAcct) >-1
What if you used a conditional split transform on your incoming data set and then used expressions or something similar (I'm not sure if your GL Accounts are fixed or if you're going to dynamically pass them in) to apply to the results? You can then take the resulting data from that and process as necessary.

Autocomplete Dropdown - too much data, timing out

So, I have an autocomplete dropdown with a list of townships. Initially I just had the 20 or so that we had in the database... but recently, we have noticed that some of our data lies in other counties... even other states. So, the answer to that was buy one of those databases with all towns in the US (yes, I know, geocoding is the answer but due to time constraints we are doing this until we have time for that feature).
So, when we had 20-25 towns the autocomplete worked stellarly... now that there are 80,000 it's not as easy.
As I type I am thinking that the best way to do this is default to this state, then there will be much less. I will add a state selector to the page that defaults to NJ then you can pick another state if need be, this will narrow down the list to < 1000. Though, I may have the same issue? Does anyone know of a work around for an autocomplete with a lot of data?
should I post teh codez of my webservice?
Are you trying to autocomplete after only 1 character is typed? Maybe wait until 2 or more...?
Also, can you just return the top 10 rows, or something?
Sounds like your application is suffocating on the amount of data being returned, and then attempted to be rendered by the browser.
I assume that your database has the proper indexes, and you don't have a performance problem there.
I would limit the results of your service to no more than say 100 results. Users will not look at any more than that any how.
I would also only being retrieving the data from the service once 2 or 3 characters are entered which will further reduce the scope of the query.
Good Luck!
Stupid question maybe, but... have you checked to make sure you have an index on the town name column? I wouldn't think 80K names should be stressing your database...
I think you're on the right track. Use a series of cascading inputs, State -> County -> Township where each succeeding one grabs the potential population based on the value of the preceding one. Each input would validate against its potential population to avoid spurious inputs. I would suggest caching the intermediate results and querying against them for the autocomplete instead of going all the way back to the database each time.
If you have control of the underlying SQL, you may want to try several "UNION" queries instead of one query with several "OR like" lines in its where clause.
Check out this article on optimizing SQL.
I'd just limit the SQL query with a TOP clause. I also like using a "less than" instead of a like:
select top 10 name from cities where #partialname < name order by name;
that "Ce" will give you "Cedar Grove" and "Cedar Knolls" but also "Chatham" & "Cherry Hill" so you always get ten.
In LINQ:
var q = (from c in db.Cities
where partialname < c.Name
orderby c.Name
select c.Name).Take(10);

Resources