Related
We are new to Mongodb, and would like to use it to insert genomic data (165M entries), and retrieve this data by genomic coordinates (ranges). Below is the type of data we store in a single table. Where column names are chrom,chromStart,chromEnd,datasetid,target,biotype
chr1 9903 10282 ENCSR440COG ZNF239 HEK293
chr1 9904 10252 ENCSR721QZV ZSCAN18 HEK293
chr1 9905 10132 ENCSR241LIH AFF1 K-562
chr1 9905 10238 ENCSR211GNP ZSCAN4 HEK293
chr1 9905 10241 ENCSR776LDJ ZNF645 HEK293
chr1 9905 10243 ENCSR042TWZ SNIP1 MCF-7
chr2 938173 938703 ENCSR000BUL MAX MCF-7
chr2 938174 938376 ENCSR108TYQ GATAD1 Hep-G2
chr3 938174 938412 ENCSR887MXT ZHX1 HeLa-S3
chr3 945236 945377 GSE46055 KDM5B SUM185_SHCTCF
chr4 945236 945488 ENCSR000BPU ETS1 A-549
chr4 945240 945501 GSE76494 CTCF HEK293
chr4 950008 951114 GSE67783 STAG1 HSPC
chr4 950013 950185 ENCSR000BQT TCF3 GM12878
chr4 950015 950797 ENCSR115BLD KDM1A Hep-G2
chr4 950024 950693 GSE88734 ZEB1 MIA-PaCa-2
chr4 950028 950565 ENCSR753GIA TARDBP HEK293T
The type of genomic ranges queries would be :
db.hsap_all_peaks.find({ chrom: "chr1", chromStart: {$gte: 9905}, chromEnd:{$lte: 10243}} ).count()
db.hsap_all_peaks.find({ chrom: "chr4", chromStart: {$gte: 950013}, chromEnd:{$lte: 950693}} ).pretty()
In the long run, we plan to queries on ranges but also on values like:
db.hsap_all_peaks.find({ chrom: "chr4", chromStart: {$gte: 950013}, chromEnd:{$lte: 950693}} , target: "KDM1A").pretty()
This is how we created indexes for the coordinates :
db.hsap_all_peaks.createIndex(
{chrom:1}
)
db.hsap_all_peaks.createIndex(
{chrom:1,chromStart:1,chromEnd:1}
)
However, the queries are very long to execute, and it seems that the indexes for chromStart, chromEnd are not working.
Hence my question: what would be the best way to create indexes here?
Extra information :
> db.hsap_all_peaks.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"key" : {
"chrom" : 1,
"chromStart" : 1,
"chromEnd" : 1
},
"name" : "chrom_1_chromStart_1_chromEnd_1"
},
{
"v" : 2,
"key" : {
"chrom" : 1
},
"name" : "chrom_1"
}
]
If you want to recreate a similar'ish table :
wget http://remap.univ-amu.fr/storage/remap2020/hg38/MACS2/remap2020_all_macs2_hg38_v1_0.bed.gz
gunzip remap2020_all_macs2_hg38_v1_0.bed.gz
mongoimport -d databaseName -c hsap_all_peaks --type tsv --file remap2020_all_macs2_hg38_v1_0.bed -f chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd,itemRgb --numInsertionWorkers 2
Explain() output for a classic query :
db.hsap_all_peaks.find({ chrom: "chr2", chromStart: {$gte: 50967094}, chromEnd:{$lte: 50970983} } ).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "remap2020.hsap_all_peaks",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"chrom" : {
"$eq" : "chr2"
}
},
{
"chromEnd" : {
"$lte" : 50970983
}
},
{
"chromStart" : {
"$gte" : 50967094
}
}
]
},
"queryHash" : "2A452369",
"planCacheKey" : "C93EF492",
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"chrom" : 1,
"chromStart" : 1,
"chromEnd" : 1
},
"indexName" : "chrom_1_chromStart_1_chromEnd_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"chrom" : [ ],
"chromStart" : [ ],
"chromEnd" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"chrom" : [
"[\"chr2\", \"chr2\"]"
],
"chromStart" : [
"[50967094.0, inf.0]"
],
"chromEnd" : [
"[-inf.0, 50970983.0]"
]
}
}
},
"rejectedPlans" : [
{
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"chromEnd" : {
"$lte" : 50970983
}
},
{
"chromStart" : {
"$gte" : 50967094
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"chrom" : 1
},
"indexName" : "chrom_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"chrom" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"chrom" : [
"[\"chr2\", \"chr2\"]"
]
}
}
}
]
},
"serverInfo" : {
"host" : "sormiou.local",
"port" : 27017,
"version" : "4.4.3",
"gitVersion" : "913d6b62acfbb344dde1b116f4161360acd8fd13"
},
"ok" : 1
}
To summarise the comments there is nothing to improve in the index. You already have the one that supports your query the most, and mongo actually uses it.
Improving performance is a bit more generic topic and is probably too wide for the SO format.
In this particular case if we can assume that chromStart is always less than chromEnd we can modify the query by adding top limit of chromStart the same as low limit of chromEnd :
db.hsap_all_peaks.find({
chrom: "chr1",
chromStart: {$gte: 9905, $lte: 10243},
chromEnd:{$lte: 10243}
} )
It will change the range in the explain() from
"indexBounds" : {
"chrom" : [
"[\"chr2\", \"chr2\"]"
],
"chromStart" : [
"[9905.0, inf.0]"
],
"chromEnd" : [
"[-inf.0, 10243.0]"
]
}
to
"indexBounds" : {
"chrom" : [
"[\"chr2\", \"chr2\"]"
],
"chromStart" : [
"[9905.0, 10243.0]"
],
"chromEnd" : [
"[-inf.0, 10243.0]"
]
}
The smaller the chromStart range the less nodes in the index it will need to examine.
The purpose of the index is to reduce the amount of data that must be examined to process the query.
With the existing index and query, you are avoiding looking at any document that will not match the query.
The only possible improvement will be in reducing the number of index keys examined.
The way the mongod will scan the index starts from negative infinity for an $lte operator, and extends to positive infinity for a $gte operator.
The sample data appears to have the property that chromStart is strictly less than chromEnd for any single document.
If that assumption is correct, you can use that to optimize the query a little bit more by restricting the limits.
Consider the query you explained:
db.hsap_all_peaks.find({
chrom: "chr2",
chromStart: {$gte: 50967094},
chromEnd:{$lte: 50970983}
})
As the explain command reported, the winning plan uses index bounds:
"indexBounds" : {
"chrom" : [ "[\"chr2\", \"chr2\"]" ],
"chromStart" : ["[50967094.0, inf.0]"],
"chromEnd" : ["[-inf.0, 50970983.0]"]
}
Those inf.0 mean infinity, which is probably quite a few keys for non-matching documents.
If you were to use both values for both criteria in the query, like:
db.hsap_all_peaks.find({
chrom: "chr2",
chromStart: {$gte: 50967094, $lte: 50970983},
chromEnd:{$gte: 50967094, $lte: 50970983}
})
Those index bounds could be reduced (in theory) to:
"indexBounds" : {
"chrom" : [ "[\"chr2\", \"chr2\"]" ],
"chromStart" : ["[50967094.0, 50970983.0]"],
"chromEnd" : ["[50967094.0, 50970983.0]"]
}
In a very large data set, that could be millions of keys that no longer need to be evaluated.
Or it could be a total waste of time.
I would be very interested to hear if it actually helps.
I have added created a collection first and created index;
db.first.createIndex({a:1, b:1, c:1, d:1, e:1, f:1});
then inserted data
db.first.insert({a:1, b:2, c:3, d:4, e:5, f:6});
db.first.insert({a:1, b:6});
When making queries like
db.first.find({f: 6, a:1, c:3}).sort({b: -1}).explain();
indexes are used (IXSCAN)
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "myproject.first",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"a" : {
"$eq" : 1
}
},
{
"c" : {
"$eq" : 3
}
},
{
"f" : {
"$eq" : 6
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"a" : 1,
"b" : 1,
"c" : 1,
"d" : 1,
"e" : 1,
"f" : 1
},
"indexName" : "a_1_b_1_c_1_d_1_e_1_f_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"a" : [ ],
"b" : [ ],
"c" : [ ],
"d" : [ ],
"e" : [ ],
"f" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "backward",
"indexBounds" : {
"a" : [
"[1.0, 1.0]"
],
"b" : [
"[MaxKey, MinKey]"
],
"c" : [
"[3.0, 3.0]"
],
"d" : [
"[MaxKey, MinKey]"
],
"e" : [
"[MaxKey, MinKey]"
],
"f" : [
"[6.0, 6.0]"
]
}
}
},
"rejectedPlans" : [
{
"stage" : "SORT",
"sortPattern" : {
"b" : -1
},
"inputStage" : {
"stage" : "SORT_KEY_GENERATOR",
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"c" : {
"$eq" : 3
}
},
{
"f" : {
"$eq" : 6
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"a" : 1
},
"indexName" : "a_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"a" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"a" : [
"[1.0, 1.0]"
]
}
}
}
}
}
]
},
"serverInfo" : {
"host" : "Manishs-MacBook-Pro.local",
"port" : 27017,
"version" : "3.6.4",
"gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856"
},
"ok" : 1
}
but when I use or query
db.first.find({ $or: [{f: 6}, {a:1}]}).explain();
index is not used instead columns are scanned (COLLSCAN)
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "myproject.first",
"indexFilterSet" : false,
"parsedQuery" : {
"$or" : [
{
"a" : {
"$eq" : 1
}
},
{
"f" : {
"$eq" : 6
}
}
]
},
"winningPlan" : {
"stage" : "SUBPLAN",
"inputStage" : {
"stage" : "COLLSCAN",
"filter" : {
"$or" : [
{
"a" : {
"$eq" : 1
}
},
{
"f" : {
"$eq" : 6
}
}
]
},
"direction" : "forward"
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "Manishs-MacBook-Pro.local",
"port" : 27017,
"version" : "3.6.4",
"gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856"
},
"ok" : 1
}
Please let me know if I am doing something wrong.
The fact that you have a compound index is the cause for indexes not being used with $or.
When evaluating the clauses in the $or expression, MongoDB either
performs a collection scan or, if all the clauses are supported by
indexes, MongoDB performs index scans. That is, for MongoDB to use
indexes to evaluate an $or expression, all the clauses in the $or
expression must be supported by indexes. Otherwise, MongoDB will
perform a collection scan.
When using indexes with $or queries, each clause of an $or can use its
own index. Consider the following query:
db.inventory.find( { $or: [ { quantity: { $lt: 20 } }, { price: 10 } ] } )
To support this query, rather than a compound index, you would create
one index on quantity and another index on price:
db.inventory.createIndex( { quantity: 1 } )
db.inventory.createIndex( { price: 1 } )
$or Clauses and Indexes
So just by adding individual indexing for fields f and a like;
db.first.createIndex({a:1});
db.first.createIndex({f:1});
will make your
db.first.find({ $or: [{f: 6}, {a:1}]})
query to use indexing.
The issue here is, you've created a compound index on {a:1, b:1, c:1, d:1, e:1, f:1} fields but you're not following the order of the index. So your queries should contain all the fields in the same order that you've constructed your index. Since the field 'f' is in the tail end of the index, your queries will not utilize or even identify it
Your queries:
db.first.find({f: 6, a:1, c:3}).sort({b: -1})
db.first.find({ $or: [{f: 6}, {a:1}]})
To make both your above queries use the index, you should build the compound index as below:
db.first.createIndex({ f:1, a:1, b:1, c:1 })
OR: you can build individual indexes on all fields and use it in any order in your query.
Remember: If you're building compound index, make sure to follow the Equality, Sort and Range order
I am trying to fetch few documents in a collection, by making a find query on array of nested objects. Nested objects are indexed but find query is not using the index to fetch documents.
Here is the structure of a document.
"_id" : ObjectId("5bc6498c1ec4062983c4f4ef"),
"appId" : ObjectId("5bbc775036021bea06d9bbc2"),
"status" : "active",
"segmentations" : [
{
"name" : "ch-1",
"values" : [
'true'
],
"type" : "string"
},
{
"name" : "browerInfo",
"values" : [
"Firefox"
],
"version" : [
"62.0"
],
"majorVersion" : [
"62"
],
"type" : "string"
},
{
"name" : "OS",
"values" : [
"Ubuntu"
],
"type" : "string"
},
{
"name" : "lastVisitTime",
"values" : [
1539721615231.0
],
"type" : "number"
}
]
}
Here are the index fields.
{
"v" : 2,
"key" : {
"appId" : 1,
"status" : 1,
"segmentations.name" : 1,
"segmentations.values" : 1
},
"name" : "SEGMENT_INDEX",
"ns" : "test.Collname"
}
below is the find find query i was executing
db.Collname.find({
appId: ObjectId("5c6a8ef544ff62c73bdb98fc"),
"segmentations.name": 'ch-1',
'segmentations.values': 'true',
status: 'active'
}, {})
I tried to get the query execution information using
<above query>.explain("executionStats")
The result is
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.Collname",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"appId" : {
"$eq" : ObjectId("5c6a8ef544ff62c73bdb98fc")
}
},
{
"segmentations.name" : {
"$eq" : "ch-1"
}
},
{
"segmentations.values" : {
"$eq" : "true"
}
},
{
"status" : {
"$eq" : "active"
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"segmentations.values" : {
"$eq" : "true"
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"appId" : 1.0,
"status" : 1.0,
"segmentations.name" : 1.0,
"segmentations.values" : 1.0
},
"indexName" : "SEGMENT_INDEX",
"isMultiKey" : true,
"multiKeyPaths" : {
"appId" : [],
"status" : [],
"segmentations.name" : [
"segmentations"
],
"segmentations.values" : [
"segmentations",
"segmentations.values"
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"appId" : [
"[ObjectId('5c6a8ef544ff62c73bdb98fc'), ObjectId('5c6a8ef544ff62c73bdb98fc')]"
],
"status" : [
"[\"active\", \"active\"]"
],
"segmentations.name" : [
"[\"ch-1\", \"ch-1\"]"
],
"segmentations.values" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 28176,
"executionTimeMillis" : 72,
"totalKeysExamined" : 28176,
"totalDocsExamined" : 28176,
"executionStages" : {
"stage" : "FETCH",
"filter" : {
"segmentations.values" : {
"$eq" : "true"
}
},
"nReturned" : 28176,
"executionTimeMillisEstimate" : 70,
"works" : 28177,
"advanced" : 28176,
"needTime" : 0,
"needYield" : 0,
"saveState" : 220,
"restoreState" : 220,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 28176,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 28176,
"executionTimeMillisEstimate" : 10,
"works" : 28177,
"advanced" : 28176,
"needTime" : 0,
"needYield" : 0,
"saveState" : 220,
"restoreState" : 220,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"appId" : 1.0,
"status" : 1.0,
"segmentations.name" : 1.0,
"segmentations.values" : 1.0
},
"indexName" : "SEGMENT_INDEX",
"isMultiKey" : true,
"multiKeyPaths" : {
"appId" : [],
"status" : [],
"segmentations.name" : [
"segmentations"
],
"segmentations.values" : [
"segmentations",
"segmentations.values"
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"appId" : [
"[ObjectId('5c6a8ef544ff62c73bdb98fc'), ObjectId('5c6a8ef544ff62c73bdb98fc')]"
],
"status" : [
"[\"active\", \"active\"]"
],
"segmentations.name" : [
"[\"ch-1\", \"ch-1\"]"
],
"segmentations.values" : [
"[MinKey, MaxKey]"
]
},
"keysExamined" : 28176,
"seeks" : 1,
"dupsTested" : 28176,
"dupsDropped" : 0,
"seenInvalidated" : 0
}
}
},
"serverInfo" : {
"host" : "sys3029",
"port" : 27017,
"version" : "4.0.9",
"gitVersion" : "fc525e2d9b0e4bceff5c2201457e564362909765"
},
"ok" : 1.0
}
I could see from executionStats that "segmentations.values" field is not used in "IXSCAN" stage. And there is an extra filter stage on "segmentations.values". IXSCAN stage took just 10ms, where as FILTER stage took 50ms.
I couldn't understand why the field is not included in IXSCAN stage. My collection has around 3.2 Million documents and because of this issue query execution time is very high than expected.
Please help me fix the issue.
Thank you in advance.
Please suggest me If I need to change my database structure,
If it is not possible in mongodb,you can suggest some other database which supports above operations.
The following query will use your index for both of your array fields:
.find({
appId: ObjectId("5c6a8ef544ff62c73bdb98fc"),
segmentations:{$elemMatch:{name: 'ch-1',values: 'true'}},
status: 'active'
}, {})
If you are not using $elemMatch, MongoDB can compound the bounds for the array item keys with either the bounds for "segmentations.name" or the bounds for "segmentations.values", but not both.
In order to compound the bounds for "segmentations.name" with the bounds for "segmentations.values", the query must use $elemMatch.
To compound together the bounds for index keys from the same array:
the index keys must share the same field path up to but excluding the
field names,
and the query must specify predicates on the fields
using $elemMatch on that path.
I suggest you to read mongodb docs about multikey-index-bounds and also about $elemMatch.
I have a multikey index on an array field for a collection. When I query the collection with an $elemMatch for the field the query is very slow despite of the index.
So I did and explain and the index bounds seems to be incorrect.
I have mongoDB version 3.2.11
Here is the collection document structure:
{
"_id" : ObjectId("5c3b2def2157ed8004f6df42"),
...
"optins" : [
{
"active" : true,
"campaign" : "campaign-partenaires",
"register_date" : ISODate("2014-07-29T08:39:14.000Z")
},
{
"active" : false,
"campaign" : "campaign-top-20",
"register_date" : ISODate("2014-07-29T08:39:14.000Z"),
"unregister_date" : ISODate("2018-03-01T09:37:58.000Z"),
},
...
]
}
The index definition:
createIndex(
{
'optins.campaign':1,
'optins.active':1,
'optins.register_date':1,
'optins.unregister_date':1
},
{
background:true,
sparse:false
}
)
The query:
db.getCollection('lead').find(
{
optins : {
$elemMatch : {
campaign: "campaign-partenaires",
active : true,
register_date : {
$gt: ISODate("2014-07-29T08:39:14.000Z"),
$lt: ISODate("2019-07-29T08:39:14.000Z")
}
}
}
})
The explain winning plan input stage:
{
"stage" : "IXSCAN",
"keyPattern" : {
"optins.campaign" : 1.0,
"optins.active" : 1.0,
"optins.register_date" : 1.0,
"optins.unregister_date" : 1.0
},
"indexName" : "optins.campaign_1_optins.active_1_optins.register_date_1_optins.unregister_date_1",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"optins.campaign" : [
"[\"campaign-partenaires\", \"campaign-partenaires\"]"
],
"optins.active" : [
"[true, true]"
],
"optins.register_date" : [
"(true, new Date(1564389554000))"
],
"optins.unregister_date" : [
"[MinKey, MaxKey]"
]
}
}
So I don't understand why the bounds on the unregister_date is
(true, new Date(1564389554000)) where Date(1564389554000) is ISODate("2019-07-29T08:39:14.000Z")
But I think it should be
[ISODate("2014-07-29T08:39:14.000Z",ISODate("2019-07-29T08:39:14.000Z")]
Any help please?
We Used mongo and its working fine on testing server(i think due to less number of records) but when we move to production it goes slow. even simple queries taking around 10s time.
I had checked indexes too there are proper indexes already defined.
Query
db.products.aggregate([{
"$match": {
"tenant_id": 1031
}
}, {
"$sort": {
"id": -1
}
}, {
"$skip": 0
}, {
"$limit": 20
}]
)
Explain Result
// collection: products
{
"waitedMS" : NumberLong("0"),
"stages" : [
{
"$cursor" : {
"query" : {
"tenant_id" : 1031
},
"sort" : {
"id" : NumberInt("-1")
},
"limit" : NumberLong("20"),
"queryPlanner" : {
"plannerVersion" : NumberInt("1"),
"namespace" : "dbname.products",
"indexFilterSet" : false,
"parsedQuery" : {
"tenant_id" : {
"$eq" : 1031
}
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"tenant_id" : {
"$eq" : 1031
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"id" : NumberInt("-1")
},
"indexName" : "id",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : NumberInt("1"),
"direction" : "forward",
"indexBounds" : {
"id" : [
"[MaxKey, MinKey]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
}
],
"ok" : 1
}
Please let me know what i can do make it fast
To properly support this query, you need a compound index on {tenant_id: 1, id: -1}. The first part helps filtering, the second one - enables fast sorting of the filtered results.
Some documentation: https://docs.mongodb.com/manual/core/index-compound/#sort-order