Performance issue when querying time-based objects - database

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

Related

Database schema design for stock market financial data

I'm figuring out the optimal structure to store financial data with daily inserts.
There are 3 use cases for querying the data:
Querying specific symbols for current data
Finding symbols current by values (e.g. where price < 10 and dividend.amountPaid > 3)
Charting historical values per symbol (e.g. query all dividend.yield between 2010 and 2020)
I am considering MongoDB, but I don't know which structure would be optimal. Embedding all the data per symbol for a duration of 10 years is too much, so I was thinking of embedding the current data per symbol, and creating references to historical documents.
How should I store this data? Is MongoDB not a good solution?
Here's a small example for a data for one symbol.
{
"symbol": "AAPL",
"info": {
"company_name": "Apple Inc.",
"description": "some long text",
"website": "http://apple.com",
"logo_url": "http://apple.com"
},
"quotes": {
"open": 111,
"close": 321,
"high": 111,
"low": 100
},
"dividends": {
"amountPaid": 0.5,
"exDate": "2020-01-01",
"yieldOnCost": 10,
"growth": { value: 111, pct_chg: 10 } /* some fields could be more attributes than just k/v */
"yield": 123
},
"fundamentals": {
"num_employees": 123213213,
"shares": 123123123123,
....
}
}
What approach would you take for storing this data?
Based upon the info (the sample data and the use cases) you had posted, I think storing the historical data as a separate collection sounds fine.
Some of the important factors that affect the database design (or data model) is the amount of data and the kind of queries - the most important queries you plan to perform on the data. Assuming that the JSON data you had posted (for a stock symbol) can be used to perform the first two queries - you can start with the idea that storing the historical data as a separate collection. The historical data document for a symbol can be for a year or for a range of years - depends upon the queries, the data size, and the type of information.
MongoDB's document based model allows flexible schema which can be useful for implementing future changes and requirements easily. Note that a MongoDB document can store upto 16 MB data maximum.
For reference, see MongoDB Data Model Design.
Stock market data by itself is huge. Keep it all in one place per company otherwise you got a mess sooner or later.
your example above: logos are '.png' etc, not .html
your "quotes" section will be way too big ... keep it all on the top level... that's the nice thing with mongo. each quotes-document should have a date ;) associated with it... use a dateformat mongo also has, not a string for it...

How to query with order by time different

I am having a real-time Firebase with time, value as following, timing is 1 min delay btw each object.
"-LhIaB7SP0y-FLb1xFFx" : {
"time" : 1560475623,
"value" : 11.614479842990287
},
"-LhIaJ6PjtbX1VHKlwFM" : {
"time" : 1560475681,
"value" : 11.642968895431837
},
"-LhIaXbX42k8dmApfztL" : {
"time" : 1560475741,
"value" : 11.707783121665505
},
"-LhIaqgYSpUmKbcH1MTN" : {
"time" : 1560475802,
"value" : 11.704004474172576
},
"-LhIb-20G9jnx61vNjS-" : {
"time" : 1560475861,
"value" : 11.69861155382089
},
"-LhIbDdTEdWrhirbjVRa" : {
"time" : 1560475921,
"value" : 11.661539551497276
},
"-LhIbSGKvS2POggUCots" : {
"time" : 1560475981,
"value" : 11.581711077020692
}
I can retrieve data for "time" in order. But I want to filter for every 5 mins, or 1 day, and a week.
this.items = this.db.list(`history/data`, ref => ref.orderByChild("time").limitToLast(1000));
Is there firebase list filtering for that?
The query model of the Firebase Realtime Database works as follow for your code:
It orders the child nodes of the reference by the child you indicate.
It finds the last node in the result, and then returns the 1000 nodes before that.
You can add a condition to your query with startAt(), endAt, and/or equalTo, so make the database start/end at a specific set of child nodes within the range. But there is no way within these conditions to skip child nodes in the middle. Once the database has found a child node to start returning, it will return all child nodes from there on until the conditions are no longer met.
The simplest way I can think of to implement your requirement is to store the data in the aggregation buckets that you want to query on. So if you want to allow reading the data in a way that gives you the first and last item of every week, and the first and last item of every week, you'd store:
2019w24_first: "-LhIaB7SP0y-FLb1xFFx",
2019w24_last: "-LhIbSGKvS2POggUCots",
2019m06_first: "-LhIaB7SP0y-FLb1xFFx",
2019m06_last: "-LhIbSGKvS2POggUCots"
And then each time when you write the data, you update the relevant aggregates too.
This sounds incredibly inefficient for folks who come from a background with relational/SQL databases, but is actually very common in NoSQL databases. By making your write operations do some extra work, and storing some duplicate data, your read operations becomes massively more scaleable.
For some more information on these types of data modeling choices, I recommend:
reading NoSQL data modeling
watching Firebase for SQL developers
watching Getting to know Cloud Firestore, which may be for a different Firebase Database, but many of the principles apply equally.

MongoDB grab last versions from specified version

I have a set of test results in my mongodb database. Each document in the database contains version information, test data, date, test run information etc...
The version is broken up in the document and stored as individual values. For example: { VER_MAJOR : "0", VER_MINOR : "2", VER_REVISION : "3", VER_PATCH : "20}
My application wants the ability to specify a specific version and grab the document as well as the previous N documents based on the version.
For example:
If version = 0.2.3.20 and n = 5 then the result would return documents with version 0.2.3.20, 0.2.3.19, 0.2.3.18, 0.2.3.17, 0.2.3.16, 0.2.3.15
The solutions that come to my mind is:
Create a new database that contains documents with version information and is sorted. Which can be used to obtain the previous N version's which can be used to obtain the corresponding N documents in the test results database.
Perform the sorting in the test results database itself like in number 1. Though if the test results database is large, this will take a very long time. Also consider inserting in order every time.
Creating another database like in option 1 doesn't seem like the right way. But sorting the test results database seems like there will be lots of overhead, am I mistaken that I should be worried about option 2 producing lots of overhead? I have the impression I'd have to query the entire database then sort it on application side. Querying the entire database seems like overkill...
db.collection_name.find().sort([Paramaters for sorting])
You are quite correct that querying and sorting the entire data set would be very excessive. I probably went overboard on this, but I tried to break everything down in detail below.
Terminology
First thing first, a couple terminology nitpicks. I think you're using the term Database when you mean to use the word Collection. Differentiating between these two concepts will help with navigating documentation and allow for a better understanding of MongoDB.
Collections and Sorting
Second, it is important to understand that documents in a Collection have no inherent ordering. The order in which documents are returned to your app is only applied when retrieving documents from the Collection, such as when specifying .sort() on a query. This means we won't need to copy all of the documents to some other collection; we just need to query the data so that only the desired data is returned in the order we want.
Query
Now to the fun part. The query will look like the following:
db.test_results.find({
"VER_MAJOR" : "0",
"VER_MINOR" : "2",
"VER_REVISION" : "3",
"VER_PATCH" : { "$lte" : 20 }
}).sort({
"VER_PATCH" : -1
}).limit(N)
Our query has a direct match on the three leading version fields to limit results to only those values, i.e. the specific version "0.2.3". A range $lte filter is applied on VER_PATCH since we will want more than a single patch revision.
We then sort results by VER_PATCH to return results descending by the patch version. Finally, the limit operator is used to restrict the number of documents being returned.
Index
We're not done yet! Remember how you said that querying the entire collection and sorting it on the app side felt like overkill? Well, the database would doing exactly that if an index did not exist for this query.
You should follow the equality-sort-match rule when determining the order of fields in an index. In this case, this would give us the index:
{ "VER_MAJOR" : 1, "VER_MINOR" : 1, "VER_REVISION" : 1, "VER_PATCH" : 1 }
Creating this index will allow the query to complete by scanning only the results it would return, while avoiding an in-memory sort. More information can be found here.

Parse Server, MongoDB - get "liked" state of an object

I am using Parse Server, which runs on MongoDB.
Let's say I have collections User and Comment and a join table of user and comment.
User can like a comment, which creates a new record in a join table.
Specifically in Parse Server, join table can be defined using a 'relation' field in the collection.
Now when I want to retrieve all comments, I also need to know, whether each of them is liked by the current user. How can I do this, without doing additional queries?
You might say I could create an array field likers in Comment table and use $elemMatch, but it doesn't seem as a good idea, because potentially, there can be thousands of likes on a comment.
My idea, but I hope there could be a better solution:
I could create an array field someLikers, a relation (join table) field allLikers and a number field likesCount in Comment table. Then put first 100 likers in both someLikers and allLikers and additional likers only in the allLikers. I would always increment the likesCount.
Then when querying a list of comments, I would implement the call with $elemMatch, which would tell me whether the current user is inside someLikers. When I would get the comments, I would check whether some of the comments have likesCount > 100 AND $elemMatch returned null. If so, I would have to run another query in the join table, looking for those comments and checking (querying by) whether they are liked by the current user.
Is there a better option?
Thanks!
I'd advise agains directly accessing MongoDB unless you absolutely have to; after all, the way collections and relations are built is an implementation detail of Parse and in theory could change in the future, breaking your code.
Even though you want to avoid multiple queries I suggest to do just that (depending on your platform you might even be able to run two Parse queries in parallel):
The first one is the query on Comment for getting all comments you want to display; assuming you have some kind of Post for which comments can be written, the query would find all comments referencing the current post.
The second query again is for on Comment, but this time
constrained to the comments retrieved in the first query, e.g.: containedIn("objectID", arrayOfCommentIDs)
and constrained to the comments having the current user in their likers relation, e.g.: equalTo("likers", currentUser)
Well a join collection is not really a noSQL way of thinking ;-)
I don't know ParseServer, so below is just based on pure MongoDB.
What i would do is, in the Comment document use an array of ObjectId's for each user who likes the comment.
Sample document layout
{
"_id" : ObjectId(""),
"name" : "Comment X",
"liked" : [
ObjectId(""),
....
]
}
Then use a aggregation to get the data. I asume you have the _id of the comment and you know the _id of the user.
The following aggregation returns the comment with a like count and a boolean which indicates the user liked the comment.
db.Comment.aggregate(
[
{
$match: {
_id : ObjectId("your commentId")
}
},
{
$project: {
_id : 1,
name :1,
number_of_likes : {$size : "$liked"},
user_liked: {
$gt: [{
$size: {
$filter: {
input: "$liked",
as: "like",
cond: {
$eq: ["$$like", ObjectId("your userId")]
}
}
}
}, 0]
},
}
},
]
);
this returns
{
"_id" : ObjectId(""),
"name" : "Comment X",
"number_of_likes" : NumberInt(7),
"user_liked" : true
}
Hope this is what your after.

What is the fastest ArangoDB friends-of-friends query (with count)

I'm trying to use ArangoDB to get a list of friends-of-friends. Not just a basic friends-of-friends list, I also want to know how many friends the user and the friend-of-a-friend have in common and sort the result.
After several attempts at (re)writing the best performing AQL query, this is what I ended up with:
LET friends = (
FOR f IN GRAPH_NEIGHBORS('graph', #user, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
RETURN f._id
)
LET foafs = (FOR friend IN friends
FOR foaf in GRAPH_NEIGHBORS('graph', friend, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
FILTER foaf._id != #user AND foaf._id NOT IN friends
COLLECT foaf_result = foaf WITH COUNT INTO common_friend_count
RETURN {
user: foaf_result,
common_friend_count: common_friend_count
}
)
FOR foaf IN foafs
SORT foaf.common_friend_count DESC
RETURN foaf
Unfortunately, performance is not as good as I would've liked. Compared to the Neo4j versions of the same query(and data), AQL seems quite a bit slower (5-10x).
What I'd like to know is... How can I improve our query to make it perform better?
I am one of the core developers of ArangoDB and tried to optimize your query. As I do not have your dataset I can only talk about my test dataset and would be happy to hear if you can validate my results.
First if all I am running on ArangoDB 2.7 but in this particular case I do not expect a major performance difference to 2.6.
In my dataset I could execute your query as it is in ~7sec.
First fix:
In your friends statement you use includeData: true and only return the _id. With includeData: false GRAPH_NEIGHBORS directly returns the _id and we can also get rid of the subquery here
LET friends = GRAPH_NEIGHBORS('graph',
#user,
{"direction": "any",
"edgeExamples": {
name: "FRIENDS_WITH"
}})
This got it down to ~ 1.1 sec on my machine. So I expect that this will be close to the performance of Neo4J.
Why does this have a high impact?
Internally we first find the _id value without actually loading the documents JSON. In your query you do not need any of this data, so we can safely continue with not opening it.
But now for the real improvement
Your query goes the "logical" way and first gets users neighbors, than finds their neighbors, counts how often a foaf is found and sorts it.
This has to build up the complete foaf network in memory and sort it as a whole.
You can also do it in a different way:
1. Find all friends of user (only _ids)
2. Find all foaf (complete document)
3. For each foaf find all foaf_friends (only _ids)
4. Find the intersection of friends and foaf_friends and COUNT them
This query would like this:
LET fids = GRAPH_NEIGHBORS("graph",
#user,
{
"direction":"any",
"edgeExamples": {
"name": "FRIENDS_WITH"
}
}
)
FOR foaf IN GRAPH_NEIGHBORS("graph",
#user,
{
"minDepth": 2,
"maxDepth": 2,
"direction": "any",
"includeData": true,
"edgeExamples": {
"name": "FRIENDS_WITH"
}
}
)
LET commonIds = GRAPH_NEIGHBORS("graph",
foaf._id, {
"direction": "any",
"edgeExamples": {
"name": "FRIENDS_WITH"
}
}
)
LET common_friend_count = LENGTH(INTERSECTION(fids, commonIds))
SORT common_friend_count DESC
RETURN {user: foaf, common_friend_count: common_friend_count}
Which in my test graph was executed in ~ 0.024 sec
So this gave me a factor 250 faster execution time and I would expect this to be faster than your current query in Neo4j, but as I do not have your dataset I can not verify it, it would be good if you could do it and tell me.
One last thing
With the edgeExamples: {name : "FRIENDS_WITH" } it is the same as with includeData, in this case we have to find the real edge and look into it. This could be avoided if you store your edges in separate collections based on their name. And then remove the edgeExamples as well. This will further increase the performance (especially if there are a lot of edges).
Future
Stay tuned for our next release, we are right now adding some more functionality to AQL which will make your case much easier to query and should give another performance boost.

Resources