Profiling Mnesia Queries - database

Our Mnesia DB is running slowly and we think it should be somewhat faster.
So we need to profile it and work out what is happening.
There are a number of options that suggest themselves:
run fprof and see where the time is going
run cprof and see which functions are called a lot
However these are both fairly standard performance monitoring style tools. The question is how do I actually do query profiling - which queries are taking the longest times. If we were an Oracle or MySQL shop we would just run a query profiler which would return the sorts of queries that were taking a long time to run. This is not a tool that appears to be available for Mnesia.
So the question is:
what techniques exist to profile Mnesia
what tools exist to profile Mnesia - none I think, but prove me wrong :)
how did you profile your queries and optimise your mnesia database installation
Expanded In The Light Of Discussion
One of the problems with fprof as a profiling tool is that it only tells you about the particular query you are looking at. So fprof tells me that X is slow and I tweak it down to speed it up. Then, low and behold, operation Y (which was fast enough) is now dog slow. So I profile up Y and realise that the way to make Y quick is to make X slow. So I end up doing a series of bilateral trade-offs...
What I actually need is a way to manage multilateral trade-offs. I now have 2 metric shed-loads of actual user activities logged which I can replay. These logs represent what I would like to optimize.
A 'proper' query analyser on an SQL database would be able to profile the structure of SQL statements, eg all statements with the form:
SELECT [fieldset] FROM [table] WHERE {field = *parameter*}, {field = *parameter*}
and say 285 queries of this form took on average 0.37ms to run
They magic answers are when it says: 17 queries of this form took 6.34s to run and did a full table scan on table X, you should put an index on field Y
When I have a result set like this over a representative set of user-activities I can then start to reason about trade-offs in the round - and design a test pattern.
The test pattern would be something like:
activity X would make queries A, C
and C faster but queries E and F
slower
test and measure
then approve/disapprove
I have been using Erlang long enough to 'know' that there is no query analyser like this, what I would like to know is how other people (who must have had this problem) 'reason' about mnesia optimization.

I hung back because I don't know much about either Erlang or Mnesia, but I know a lot about performance tuning, and from the discussion so far it sounds pretty typical.
These tools fprof etc. sound like most tools that get their fundamental approach from gprof, namely instrumenting functions, counting invocations, sampling the program counter, etc. Few people have examined the foundations of that practice for a long time. Your frustrations sound typical for users of tools like that.
There's a method that is less-known that you might consider, outlined here. It is based on taking a small number (10-20) of samples of the state of the program at random times, and understanding each one, rather than summarizing. Typically, this means examining the call stack, but you may want to examine other information as well. There are different ways to do this, but I just use the pause button in a debugger. I'm not trying to get precise timing or invocation counts. Those are indirect clues at best. Instead I ask of each sample "What is it doing and why?" If I find that it is doing some particular activity, such as performing the X query where it's looking for y type answer for the purpose z, and it's doing it on more than one sample, then the fraction of samples it's doing it on is a rough but reliable estimate of what fraction of the time it is doing that. Chances are good that it is something I can do something about, and get a good speedup.
Here's a case study of the use of the method.

Since Mnesia queries are just erlang functions I would imagine you can profile them the same way you would profile your own erlang code. http://www.erlang.org/doc/efficiency_guide/profiling.html#id2266192 has more information on the erlang profiling tools available.
Update As a test I ran this at home on a test mnesia instance and using fprof to trace an mnesia qlc query returned output a sample of which I'm including below. So it definitely includes more information than just the query call.
....
{[{{erl_lint,pack_errors,1}, 2, 0.004, 0.004}],
{ {lists,map,2}, 2, 0.004, 0.004}, %
[ ]}.
{[{{mnesia_tm,arrange,3}, 1, 0.004, 0.004}],
{ {ets,first,1}, 1, 0.004, 0.004}, %
[ ]}.
{[{{erl_lint,check_remote_function,5}, 2, 0.004, 0.004}],
{ {erl_lint,check_qlc_hrl,5}, 2, 0.004, 0.004}, %
[ ]}.
{[{{mnesia_tm,multi_commit,4}, 1, 0.003, 0.003}],
{ {mnesia_locker,release_tid,1}, 1, 0.003, 0.003}, %
[ ]}.
{[{{mnesia,add_written_match,4}, 1, 0.003, 0.003}],
{ {mnesia,add_match,3}, 1, 0.003, 0.003}, %
[ ]}.
{[{{mnesia_tm,execute_transaction,5}, 1, 0.003, 0.003}],
{ {erlang,erase,1}, 1, 0.003, 0.003}, %
[ ]}.
{[{{mnesia_tm,intercept_friends,2}, 1, 0.002, 0.002}],
{ {mnesia_tm,intercept_best_friend,2}, 1, 0.002, 0.002}, %
[ ]}.
{[{{mnesia_tm,execute_transaction,5}, 1, 0.002, 0.002}],
{ {mnesia_tm,flush_downs,0}, 1, 0.002, 0.002}, %
[ ]}.
{[{{mnesia_locker,rlock_get_reply,4}, 1, 0.002, 0.002}],
{ {mnesia_locker,opt_lookup_in_client,3}, 1, 0.002, 0.002}, %
[ ]}.
{[ ],
{ undefined, 0, 0.000, 0.000}, %
[{{shell,eval_exprs,6}, 0, 18.531, 0.000},
{{shell,exprs,6}, 0, 0.102, 0.024},
{{fprof,just_call,2}, 0, 0.034, 0.027}]}.

Mike Dunlavey's suggestion reminds me about redbug that allow you to sample calls in production systems. Think of it as an easy-to-use erlang:trace that doesnt give you enough rope to hang your production system.
Using something like this call should give you lots of stack traces to identify where your mnesia transactions are called from:
redbug:start(10000,100,{mnesia,transaction,[stack]}).
Its not possible to get call duration for these traces though.
If you have organized all mnesia lookups into modules that export an api to perform them, you could also use redbug to get a call-frequency on specific queries only.

Related

Is there a way to improve query time across relations?

Is there a way to improve query time across relations? Yet to measure, but I'm currently modeling my data in Memgraph in such a way that I will have several versions, and a typical query will look like:
MATCH (:Version { major: $major, minor: $minor })--(t:Thing { name: $name })
Of course, I can create an index over major/minor and over the name, but I worry that given nearly all queries will actually look for a composite index + relation it'll still have some performance impact.
I will need to measure this, but wanted to see whether someone has some insight for me already, and if there is something I can do to improve this. Also, there are < 100 of :Version and > 100000 of :Thing, so essentially the degree of :Version will be very high)
Wondering if I should convert this to be a single field on :Thing, for better indexing
After exploring, I would say that the current model looks fine. It would be good to have an index on the name of the :Thing nodes, for sure, creating it on :Version is questionable and should be measured.
Use EXPLAIN/PROFILE query to see different options and test the performance
But I think
MATCH (t:Thing {name : $name})--(:Version {major: $major, minor: $minor})
with index on Thing(name) should be enough.
I did reverse the order but it shouldn't matter actually.

Neo4j or MongoDB for relative attributes/relations

I wish to build a database of objects with various types of relations between them. It will be easier for me to explain by giving an example.
I wish to have a set of objects, each is described by a unique name and a set of attributes (say, height, weight, colour, etc.), but instead of values, these attributes may contain values which are relative to other objects. For example, I might have two objects, A and B, where A has height 1 and weight "weight of B + 2", and B has height "height of A + 3" and weight 4.
Some objects may have completely other attributes; for example, object C may represent a box, and objects A and B will be related to C by the relations "I appear x times in C".
Queries may include "what is the height of A/B" or what is the total weight of objects appearing in C with multiplicities.
I am a bit familiar with MongoDB, and fond of its simplicity. I heard of Neo4j, but never tried working with it. From its description, it sounds more suitable for my need (but I can't tell it is capable of the task). But is MongoDB, with its simplicity, suitable as well? Or perhaps a different database engine?
I am not sure it matters, but I plan to use python as the engine which processes the queries and their outputs.
Either can do this. I tend to prefer neo4j, but either way could work.
In neo4j, you'd create a graph consisting of a node (A) and its "base" (B). You could then connect them like this:
(A:Node { weight: "base+2" })-[:base]->(B:Node { weight: 2 })
Note that modeling in this way would make it possible to change the base relationship to point to another node without changing anything about A. The downside is that you'd need a mini calculator to expand expressions like "base+2", which is easy but in any case extra work.
Interpreting your question another way, you're in a situation where you'd probably want a trigger. Here's an article on neo4j triggers, how graphs handle this. If parsing that expression "base+2" at read time isn't what you want, and you want to actually set the value on A to be b.weight + 2, then you want a trigger. This would let you define some other function to be run when the graph gets updated in a certain way. In this case, when someone inserts a new :base relationship in the graph, you might check the base value (endpoint of the relationship) and add 2 to its weight, and set that new property value on the source of the relationship.
Yes, you can use either DBMS.
To help you decide, this is an example of how to support your uses cases in neo4j.
To create your sample data:
CREATE
(a:Foo {id: 'A'}), (b:Foo {id: 'B'}), (c:Box {id: 123}),
(h1:Height {value: 1}), (w4:Weight {value: 4}),
(a)-[:HAS_HEIGHT {factor: 1, offset: 0}]->(h1),
(a)-[:HAS_WEIGHT {factor: 1, offset: 2}]->(w4),
(b)-[:HAS_WEIGHT {factor: 1, offset: 0}]->(w4),
(b)-[:HAS_HEIGHT {factor: 1, offset: 3}]->(h1),
(c)-[:CONTAINS {count: 5}]->(a),
(c)-[:CONTAINS {count: 2}]->(b);
"A" and "B" are represented by Foo nodes, and "C" by a Box node. Since a given height or weight can be referenced by multiple nodes, this example data model uses shared Weight and Height nodes. The HAS_HEIGHT and HAS_WEIGHT relationships have factor and offset properties to allow adjustment of the height or weight for a particular Foo node.
To query "What is the height of A":
MATCH (:Foo {id: 'A'})-[ra:HAS_HEIGHT]->(ha:Height)
RETURN ra.factor * ha.value + ra.offset AS height;
To query "What is the ratio of the heights of A and B":
MATCH
(:Foo {id: 'A'})-[ra:HAS_HEIGHT]->(ha:Height),
(:Foo {id: 'B'})-[rb:HAS_HEIGHT]->(hb:Height)
RETURN
TOFLOAT(ra.factor * ha.value + ra.offset) /
(rb.factor * hb.value + rb.offset) AS ratio;
Note: TOFLOAT() is used above to make sure integer division, which would truncate, is never used.
To query "What is the total weight of objects appearing in C":
MATCH (:Box {id: 123})-[c:CONTAINS]->()-[rx:HAS_WEIGHT]->(wx:Weight)
RETURN SUM(c.count * (rx.factor * wx.value + rx.offset));
I have not used Mongo and decided not to after studying it. So filter my opinion with that in mind; users may find my comments easy to overcome. Mongo is not a true graph database. The user must create and manage the relationships. In Neo4j, relationships are "native" and robust.
There is a head to head comparison at this site:
[https://db-engines.com/en/system/MongoDB%3bNeo4j]
see also: https://stackoverflow.com/questions/10688745/database-for-efficient-large-scale-graph-traversal
There is a distinction between NoSQL (e.g., Mongo) and a true graph database. Many seem to assume that if it is not SQL then it's a graph database. This is not true. Most NoSQL data bases do not store relationships. The free book describes this in chapter 2.
Personally, I'm sold on Neo4j. It makes relationships, transerving graphs and collecting lists along the path easy and powerful.

what could be the reason for high "SQL Server parse and compile time"?

a little background first:
i have a legacy application with security rules inside the app.
to reuse the database model with an addition app on this model with integreated security model inside! the database, i deside to use views with the security rules inside the view sql.the logic works well but the perf was not really good (high io cause by scan some/many tbls).so i use indexed views instead of standard views for the basic coles i need for the security and add additional a view on top of the index view with security rules. works perfect when i see the io perf.but now o have a poor parse and compile time.
when i erase all buffers a simple sql against the top view deliver this timings:
"SQL Server-Analyse- und Kompilierzeit:
, CPU-Zeit = 723 ms, verstrichene Zeit = 723 ms.
-----------
7
(1 Zeile(n) betroffen)
#A7F38F33-Tabelle. Scananzahl 1, logische Lesevorgänge 7, physische Lesevorgänge 0, Read-Ahead-Lesevorgänge 0, logische LOB-Lesevorgänge 0, physische LOB-Lesevorgänge 0, Read-Ahead-LOB-Lesevorgänge 0.
xADSDocu-Tabelle. Scananzahl 1, logische Lesevorgänge 2, physische Lesevorgänge 0, Read-Ahead-Lesevorgänge 0, logische LOB-Lesevorgänge 0, physische LOB-Lesevorgänge 0, Read-Ahead-LOB-Lesevorgänge 0.
SQL Server-Ausführungszeiten:
, CPU-Zeit = 0 ms, verstrichene Zeit = 0 ms.
when i execute the same stmt again the parsetime is of course zero.
in past i see sometimes later with the same statement a very long parse time again when i reexecute it >1sec (no dml is done during this time!).now i have deactivate all statistics automatisations and never see this long parse times again.
but what could be the reason for so a long initial parse and compile time? this time is very huge and cause a very bad perf on the app itselve with this solution.
is there a way to look deeper inside the parse time to find the root cause for it?
the reason for poor compile time is the number of indexed views.
“The query optimizer may use indexed views to speed up the query execution. The view does not have to be referenced in the query for the optimizer to consider that view for a substitution.”
https://msdn.microsoft.com/en-us/library/ms191432(v=sql.120).aspx
this means, that the optimizer may check when parse a sql all! index from views.
i have a sample on my db, where i can see this behaviour, that a simple sql on base tables use the index from an indexed view.
as far as good, but when you reach a limit from about 500 idx the system escalate and the optimizer need at least more then 10 times more cpu and memory to calc the plan. this behaviour is nearly the same from version 2008 to 2014.

How to make datastore keys mapreduce-friendly(-er)?

Edit: See my answer. Problem was in our code. MR works fine, it may have a status reporting problem, but at least the input readers work fine.
I ran an experiment several times now and I am now sure that mapreduce (or DatastoreInputReader) has odd behavior. I suspect this might have something to do with key ranges and splitting them, but that is just my guess.
Anyway, here's the setup we have:
we have an NDB model called "AdGroup", when creating new entities
of this model - we use the same id returned from AdWords (it's an
integer), but we use it as string: AdGroup(id=str(adgroupId))
we have 1,163,871 of these entities in our datastore (that's what
the "Datastore Admin" page tells us - I know it's not entirely
accurate number, but we don't create/delete adgroups very often, so
we can say for sure, that the number is 1.1 million or more).
mapreduce is started (from another pipeline) like this:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I've tried to run this mapreduce several times today without changing anything in the code and without making changes to the datastore. Every time I ran it, mapper-calls counter had a different value ranging from 450,000 to 550,000.
Correct me if I'm wrong, but considering that I use the very basic DatastoreInputReader - mapper-calls should be equal to the number of entities. So it should be 1.1 million or more.
Note: the reason why I noticed this issue in the first place is because our marketing guys started complaining that "it's been 4 days after we added new adgroups and they still don't show up in your app!".
Right now, I can think of only one workaround - write all keys of all adgroups into a blobstore file (one per line) and then use BlobstoreLineInputReader. The writing to blob part would have to be written in a way that does not utilize DatastoreInputReader, of course. Should I go with this for now, or can you suggest something better?
Note: I have also tried using DatastoreKeyInputReader with the same code - the results were similar - mapper-calls were between 450,000 and 550,000.
So, finally questions. Is it important how you generate ids for your entities? Is it better to use int ids instead of str ids? In general, what can I do to make it easier for mapreduce to find all of my entities mapping them?
PS: I'm still in the process of experimenting with this, I might add more details later.
After further investigation we have found that the error was actually in our code. So, mapreduce actually works as expected (mapper is called for every single datastore entity).
Our code was calling some google services functions that were sometimes failing (the wonderful cryptic ApplicationError messages). Due to these failures, MR tasks were being retried. However, we have set a limit on taskqueue retries. MR did not detect nor report this in any way - MR was still showing "success" in the status page for all shards. That is why we thought that everything is fine with our code and that there is something wrong with the input reader.

App Engine Datastore Viewer, how to show count of records using GQL?

I would think this would be easy for an SQL-alike! What I want is the GQL equivalent of:
select count(*) from foo;
and to get back an answer something similar to:
1972 records.
And I want to do this in GQL from the "command line" in the web-based DataStore viewer. (You know, the one that shows 20 at a time and lets me see "next 20")
Anyway -- I'm sure it's brain-dead easy, I just can't seem to find the correct syntax. Any help would be appreciated.
Thanks!
With straight Datastore Console, there is no direct way to do it, but I just figured out how to do it indirectly, with the OFFSET keyword.
So, given a table, we'll call foo, with a field called type that we want to check for values named "bar":
SELECT * FROM foo WHERE type="bar" OFFSET 1024
(We'll be doing a quick game of "warmer, colder" here, binary style)
Let's say that query returns nothing. Change OFFSET to 512, then 256, 128, 64, ... you get the idea. Same thing in reverse: Go up to 2048, 4096, 8192, 16384, etc. until you see no records, then back off.
I just did one here at work. Started with 2048, and noticed two records came up. There's 2049 in the table. In a more extreme case, (lets say there's 3300 records), you could start with 2048, notice there's a lot, go to 4096, there's none... Take the midpoint (1024 between 2048 and 4096 is 3072) next and notice you have records... From there you could add half the previous midpoint (512) to get 3584, and there's none. Whittle back down half (256) to get 3328, still none. Once more down half (128) to get 3200 and there's records. Go up half of the last val (64) and there's still records. Go up half again (32) to 3296 - still records, but so small you can easily see there's exactly 3300.
The nice thing about this vs. Datastore statistics to see how many records are in a table is you can limit it by the WHERE clause.
I don't think there is any direct way to get the count of entities via GQL. However you can get the count directly from the dashbaord
;
More details - https://cloud.google.com/appengine/docs/python/console/managing-datastore
As it's stated in other questions, it looks like there is no count aggregate function in GQL. The GQL Reference also doesn't say there is the ability to do this, though it doesn't explicitly say that it's not possible.
In the development console (running your application locally) it looks like just clicking the "List Entities" button will show you a list of all entities of a certain type, and you can see "Results 1-10 of (some number)" to get a total count in your development environment.
In production you can use the "Datastore Statistics" tab (the link right underneath the Datastore Viewer), choose "Display Statistics for: (your entity type)" and it will show you the total number of entities, however this is not the freshest view of the data (updated "at least once per day").
Since you can't run arbitrary code in production via the browser, I don't think saying "use .count() on a query" would help, but if you're using the Remote API, the .count() method is no longer capped at 1000 entries as of August, 2010, so you should be able to run print MyEntity.all().count() and get the result you want.
This is one of those surprising things that the datastore just can't do. I think the fastest way to do it would be to select __KEY__ from foo into a List, and then count the items in the list (which you can't do in the web-based viewer).
If you're happy with statistics that can be a little bit stale, you can go to the Datastore Statistics page of the admin console, which will tell you how many entities of each type there were some time ago. It seems like those stats are usually less than 10 hours old. Unfortunately, you can't query them more specifically.
There's no way to get a total count in GQL. Here's a way to get a count using python:
def count_models(model_class, max_fetch=1000):
total = 0
cursor = None
while True:
query = model_class.all(keys_only=True)
if cursor:
query.with_cursor(cursor)
results = query.fetch(max_fetch)
total += len(results)
print('still counting: ' + total)
if (len(results) < max_fetch):
return total
cursor = query.cursor()
You could run this function using the remote_api_shell, or add a custom page to your admin site to run this query. Obviously, if you've got millions of rows you're going to be waiting a while. You might be able to increase max_fetch, I'm not sure what the current fetch limit is.

Resources