Orientdb slow performance for depth 3 - heap-memory

I'm doing some traversal on graph with a dataset of 11M edges , 20000 nodes .
Using OrientDB that is supposed ti be faster than Neo4j for depth , I'm seeing a slow performance.
The two queries used are :
TRAVERSE out("R") FROM #21:39161 MAXDEPTH 3 STRATEGY BREADTH_FIRST
And
traverse * from (select * from l where p='25440') while $depth<=3
The first query is returning
this error
[OServer]java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid3920.hprof ...
Heap dump file created [911684234 bytes in 26,730 secs]
While the second is taking forever loading for about half an hour.
Where is the problem ?

Related

A performance issue result from "limit 0" in TDengine database

limit 0 is suspected to cause the full table query BUG.
After switching from 2.6.0.32 to 3.0.2.1 today, it was found that the CPU usage of the three nodes (each node uses a CPU with 32 cores) exceeded 90%, while in the original 2.6.0.32 environment, the CPU usage has not yet More than 10%, the figure below is one of the nodes.
View through show queries and found that there are two select * from t XXX limit 0;
In the two environments, the discovery time difference is more than 30,000 times, and the 3.0.2.1 time is as follows:
2.6.0.32 time is as follows:
Next, after changing the statement to select * from t XXX limit 1 in the 3.0.2.1 environment, the time spent dropped from 74 seconds to 0.03 seconds and returned to normal, as shown in the figure below.
Finally, the comparison chart of the two environments is released (after the CPU usage rate of 3.0 is reduced, the query speed has been improved)
In addition, the configuration and table structure of the two environments are the same. In terms of details, the 18ms of the query in 2.6.0.32 is still lower than the 30ms of 3.0.2.1.

Redshift many small nodes vs less numbers of bigger nodes

Recently I have been facing cluster restart(outside maintenance window/arbitrary) in AWS Redshift that has been triggered from AWS end. They are not able to identify what is the exact root cause of this reboot. The error that AWS team captured is "out of object memory".
In the meantime, I am trying to scale up the cluster size to avoid this out of object memory(as a blind try), Currently I am using ds2.xlarge node type but I am not sure which of below I need to increase/choose?
Many smaller nodes (increase number of nodes in ds2.xlarge)
Few larger nodes (change to ds2.8xlarge and have less number but increased capacity)
Anyone faced similar issue in Redshift? Any advise?
Going with the configuration, for better performance in this case you should opt for ds2.8xlarge cluster type.
One ds2.xlarge cluster has 13 gb of RAM and 2 slice to perform your workload as compared with ds2.8xlarge which has 244 gb of RAM and 16 slices to perform your workloads.
Now even if you choose 8 ds2.xlarge nodes you will get max 104 GB memory against 244 GB in one node of ds2.8xlarge.
So you should go with ds2.8xlarge node type for handling memory issue along with large amount of storage

Why am I sometimes getting an OOM when getting all documents from a 800MB index with 8GB of heap?

I need to refresh an index governed by SOLR 7.4. I use SOLRJ to access it on a 64 bit Linux machine with 8 CPUs and 32GB of RAM (8GB of heap for the indexing part and 24GB for SOLR server). The index to be refreshed is around 800MB in size and counts around 36k documents (according to Luke).
Before starting the indexing process itself, I need to "clean" the index and remove the Documents that do not match an actual file on disk (e.g : a document had been indexed previously and has moved since then, so user won't be able to open it if it appears on the result page).
To do so I first need to get the list of Document in index :
final SolrQuery query = new SolrQuery("*:*"); // Content fields are not loaded to reduce memory footprint
query.addField(PATH_DESCENDANT_FIELDNAME);
query.addField(PATH_SPLIT_FIELDNAME);
query.addField(MODIFIED_DATE_FIELDNAME);
query.addField(TYPE_OF_SCANNED_DOCUMENT_FIELDNAME);
query.addField("id");
query.setRows(Integer.MAX_VALUE); // we want ALL documents in the index not only the first ones
SolrDocumentList results = this.getSolrClient().
query(query).
getResults(); // This line sometimes gives OOM
When the OOM appears on the production machine, it appears during that "index cleaning" part and the stack trace reads :
Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at org.noggit.CharArr.resize(CharArr.java:110)
at org.noggit.CharArr.reserve(CharArr.java:116)
at org.apache.solr.common.util.ByteUtils.UTF8toUTF16(ByteUtils.java:68)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:868)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:857)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:266)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:541)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:305)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:747)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:272)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:555)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:307)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:200)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:274)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:178)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:50)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:614)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)
I've aleady removed the content fields from the query because there were already OOMs, so I thought only storing "small" data would avoid OOMs, but they are still there. Moreover as I started the project for the customer we had only 8GB of RAM (so heap of 2GB), then we increased it to 20GB (heap of 5GB), and now to 32GB (heap of 8GB) and the OOM still appears, although the index is not that large compared to what is described in other SO questions (featuring millions of documents).
Please note that I cannot reproduce it on my dev machine less powerful (16GB RAM so 4GB of heap) after copying the 800 MB index from the production machine to my dev machine.
So to me there could be a memory leak. That's why I followed Netbeans post on Memory Leaks on my dev machine with the 800MB index. From what I see I guess there is a memory leak since indexing after indexing the number of surviving generation keeps increasing during the "index cleaning" (steep lines below) :
What should I do, 8GB of heap is already a huge quantity heap compared to the index characteristics ? So increasing the heap does not seem to make sense because the OOM only appears during the "index cleaning" not while actually indexing large documents, and it seems to be caused by the surviving generations, doesn't it ? Would creating a query object and then applying getResults on it would help the Garbage COllector ?
Is there another method to get all document paths ? Or maybe retrieving them chunk by chunk (pagination) would help even for that small amount of documents ?
Any help appreciated
After a while I finally came across this post. It exactly describe my issue
An out of memory (OOM) error typically occurs after a query comes in with a large rows parameter. Solr will typically work just fine up until that query comes in.
So they advice (emphasize is mine):
The rows parameter for Solr can be used to return more than the default of 10 rows. I have seen users successfully set the rows parameter to 100-200 and not see any issues. However, setting the rows parameter higher has a big memory consequence and should be avoided at all costs.
And this is what I see while retrieving 100 results per page :
The number of surviving generations has decreased dramatically although garbage collector's activity is much more intensive and computation time is way greater. But if this is the cost for avoiding OOM this is OK (see the program looses some seconds per index updates which can last several hours) !
Increasing the number of rows to 500 already makes the memory leak happens again (number of surviving generations increasing) :
Please note that setting the row number to 200 did not cause the number of surviving generations to increase a lot (I did not measure it), but did not perform much better in my test case (less than 2%) than the "100" setting :
So here is the code I used to retrieve all documents from an index (from Solr's wiki) :
SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
doCustomProcessingOfResults(rsp);
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
TL;DR : Don't use a number too large for query.setRows ie not greater than 100-200 as a higher number may very much likely cause an OOM.

B+ Tree CPU search time

I was just wondering how would you calculate the worst case time for an unclustered and clustered b+ tree?
For example, say I had 1,000,000 records (1 row = 100 bytes), disk pages are 4000 bytes, a key was 20 bytes, and the access time of a page is 40ms. How would I calculate using these variables the unclustered and clustered b+ tree worse case time?
I know that to calculate the height/levels of a b+ tree you use the following (i think):
logF(keys)
where F = number of praches branches.
With the height, you can use it to calculate the final worst case time, but I don't know how to do that... I've tried searching around but all I could fine was times for average cases or examples that weren't very clear.
Any help is appreciated!
I would say logF(keys) its the worst case to finding the page, but after that the worst case would be an unclustured index with all your rids pointing to different pages, wich mean
logF(keys) + N being N the number of rids in the index node.
so in the end it would be
H= height of the tree wich would be like 3 or 4.
H + N = 4 + (4000/20) = 204 I/Os
wich let say they are in memory and want to see the CPU time then it would be
CPU = 204*0.04 = 8.16 secs. althought 40ms for moving a page in memory its quite a lot of time i think (for disk reading could make sense) but i think the calculations are fine.

Which data structure should i use to represent a large amount of records each presents range of items?

I'm looking for software represenation for a very large amount of records ( more than 400K records)
Each record has two keys . one for lower bound and one for upper bound. These number represent a range. Also each record has some piece of information lets call it I . In other words , each record aggregates common item indexs , and has some common description about them.
My software is given an item number , and I have to retrive that info about it.
I thought about AVL , B- Tress or fibonaci . But i'm sure which will be the best for that large amount of record . I would deffinately go for AVL / balanced AVL for a small database.
From a data structure point of view, you search for an interval tree.
The wikipedia article is quite good. What you can do, is to augment a (balanced) binary search tree like AVL or Red-Black-Trees like. Interval trees based on binary search tree have an own section in the classical DS book by Cormen et al..
A good data structure scales well to large amounts of data. The complexity for the major directory operations are O(k + log n) where n is the number of intervals in the tree and k is the number of overlapping intervals in the range. This is usually pretty good. It grows slowly with the number of of interval items, except for cases where a lot or most intervals overlap all others.
If you cannot hold your data in main memory, a B-Tree would be a good choice.
Any database will do what you want just fine.
If you are searching on an index, the increase in look-up speed when going from 2 to 4 records is the same as going from 2 million to 4 million records...one more level to the tree...it's an exponential relationship.

Resources