My question is related to mongo's ability to handle huge arrays.
I would like to send push notification when topic is updated to all subscribers of the topic. Assume a topic can have a million subscribers.
Will it be efficient to hold a huge array in the topic document that holds all users ids that subscribed to it? Or is the conservative way is better - hold an array of subscribed topics for each user and then query the users collection to find subscribers for specific topic?
Edit:
I would hold an array of subscribed topics in the user collection anyway (for views and edits)
Primary Assumption: Topic-related and person-related metadata is stored in different collections and the collection being discussed here is utilized only to keep track of topic subscribers.
Storing subscribers as a list/array associated with a topic identifier as the document key (meaning an indexed field) makes for an efficient structure. Once you have a topic of interest you can lookup the subscriber list by topic identifier. Here, as #Saleem rightly pointed out, you need to be wary of large subscriber lists causing documents to exceed the 16MB documents size limit. But, instead of complicating the design by making a different collection to handle this (as suggested by #Saleem), you can simply split the subscriber list (into as many parts as required, using a modulo 16MB operation) and create multiple documents for a topic in the same collection. Given that the topic identifier is an indexed field, lookup time will not be hurt, since 16MB can accomodate a significantly huge number of subscriber identifiers and number of splits required should be fairly low, if needed at all.
The other structure you suggested, where a subscriber identifier is the document key with all their subscribed topics in the document is intuitively not so efficient for a large dataset. This structure would involve lookup of all subscribers subscribing to the topic at hand. If subscribed topics are stored as a list/array (seems the likely choice) this query would involve a $in clause which is slower than a indexed field lookup, even for small sized topic lists over a significantly large user base.
If your array is very big and cumulative size of document is exceeding 16 MB, then split it into another collection. You can have topic in collection and all of its subscribers into separate collection referencing topic collection.
Related
I'm looking for a document design to store user ids that other users liked. My first approach (maybe not the best) is an array to store all users ids that liked someone. Something like:
/users/1
- likedByUserIds: [1,2,3,4,5,...,1000,1000001,...]
I have to query by not liked users yet and show them first. So how many ids can be stored in a field array in firestore collection ? It can be many user ids from previous/past likes.
The number of elements of the array is not fixed. The limit you will run into is the max total size of the document (1MB), which is documented here. The total size of the document is not going to be based on a single field. It will be based on everything you put in it. You can estimate the size of a document using information here.
For lists of data that are not bounded within the limits of a single document, it's better to store the items as individual docments in a collection. There is no bound to the number of documents in a collection.
Just what #Doug stated, "there's a maximum total size for a document: 1MB". There's also a Firestore single-field index exemption limit which is 200. A single-field index stores a sorted mapping of all the documents in a collection that contain a specific field. Each entry in a single-field index records a document's value for a specific field and the location of the document in the database.
Additionally, there's a single-field index exemptions that you can exempt a field from your automatic indexing settings by creating a single-field index exemption. An indexing exemption overrides the database-wide automatic index settings. Since we need to hold that information in memory of every process that could handle a write to your database, and compare the incoming write to this list of exemptions. This creates a need for both memory management and latency impact.
For further reference, you can check this related post about Firestore single-field index exemption limit.
Here a solution I found to avoid full search in firebase to perform get users that haven't liked yet.
My users collection in firestore
Add the user to redis add('users',userId)
Add liked user to redis zAdd('liked_users_${uid}',{value: userId})
Generate not liked yet zDiffStore('not_liked_yet',['users','liked_users_${uid}']
Get not liked yet zRange('not_liked_yet', 0, -1)
It just worked !
Problem
Sensors check-in periodically, but network connectivity issues may cause them to check-in with the same data more than once.
MongoDB does not allow the unique property on secondary indexes for time series collections (MongoDB 5.0). Timeseries Limitations
In addition, calculations need to be done on the data (preferably using aggregations) that involve counting the number of entries, which will be inaccurate if there are duplicates. Not to mention it bloats the database and is just messy.
Question
Is there any way to prevent duplicate entries in a MongoDB Timeseries collection?
I'm having the same issue.
According to official answer in MongoDB Community, there is no way to ensure unique values in timeseries collection.
You can check the full explanations here:
https://www.mongodb.com/community/forums/t/duplicate-data-issue/135023
They consider it a caveat of timeseries compare to normal collection. IMO, it's a crucial lack in the timeseries capability of mongodb...
There is currently two available solutions:
Use "normal" collection with a compound unique index on your timestamp and sensor_id fields
Keep using timeseries collection, but query your data only through aggregation pipeline with a $group stage to eliminate duplicate entries
I have 3 types of entity:
Subjects
Topics
Tasks
In each subjects there are topics and tasks. The topics can depend on each other. (Of course, a topic that belongs to sj1 subject, can only be depended on an another topic that also belongs to sj1 subject.)
Between tasks and topics there are connections (also must belong to same subject) that symbolise the fact that to solve a certain task we need to be aware of certain topics.
So a task can require more topics. Also a topic can be required by more tasks. ( N<--->M connection.)
What would be the best solution to store?
solution
Have 3 collections for each type of entity
In tasks and topics have an index for a subject identifier attribute.
and an edge collection for storing connections between topics [N]<-->[M] tasks
solution
Have 1 collection for the subjects
For each subject, have 1 topics, and 1 tasks collections. The connection between subjects and tasks/topics can be based on prefix of collection names. (I.e. for chemistry subject we have chemistry_tasks and chemistry_topics collections)
For each subject, have an edge collection for connections between the tasks and topics and an another edge collection for connections among topics (I.e. chemistry_topics_tasks_connections and chemistry_topics_connections)
This way if I want to search among topics or tasks of a subject, I don't need to pre-filter them based on the subject identifier index. I'll immediately get the desired collection that contains all of my data. Moreover I don't have overhead of index for each document in tasks and topics.
On the other hand, this will result in a mess of collections.
Sidenote: There will be maximum 50 subjects, but the number of tasks and topics are unlimited.
In your terms, "awareness" is generated through the "graph", which requires no extra indexing to work at it's best. ArangoDB automatically creates special "_key" and "_from/_to" indexes, which it uses for graph traversal.
But as for indexing, that about all search performance - indexes are added based on the data you want to find. It really comes down to how you want to search:
one collection with multiple entity types or
multiple collections segregated by entity type.
There is not a penalty for having large collections, and a graph can link documents within a single collection - it doesn't need them to be segregated. Also, you can have multiple edge collections and/or multiple document collections. These are some of the concepts that challenge those of us who, like me, come from a traditional RDBMS - "schemaless" or "multi-model" databases kinda turn normalization on its ear.
Personally, I choose to build fairly large collections based on the data source (I import a data from external sources). Each collection contains documents of multiple object/data schema identified by an objType attribute. The benefit here is that you can search all documents in the collection on a single field (or even an index with multiple fields, like title + objType), very quickly reducing the set of documents to iterate/traverse - this is usually where real performance gains are made.
So... I guess I recommend solution #3?
I’m building what can be treated as a slideshow app with CouchDB/PouchDB: each “slide” is its own Couch document, and slides can be reordered or deleted, and new slides can be added in between existing slides or at the beginning or end of the slideshow. A slideshow could grow from one to ≲10,000 slides, so I am sensitive to space- and time-efficiency.
I made the slide creation/editing functionality first, completely underestimating how tricky it is to keep track of slide ordering. This is hard because the order of each slide-document is completely independent of the slide-doc itself, i.e., it’s not something I can sort by time or some number contained in the document. I see numerous questions on StackOverflow about how to keep track of ordering in relational databases:
Efficient way to store reorderable items in a database
What would be the best way to store records order in SQL
How can I reorder rows in sql database
Storing item positions (for ordering) in a database efficiently
How to keep ordering of records in a database table
Linked List in SQL
but all these involve either
using a floating-point secondary key for reordering/creation/deletion, with periodic normalization of indexes (i.e., imagine two documents are order-index 1.0 and 2.0, then a third document in between gets key 1.5, then a fourth gets 1.25, …, until ~31 docs are inserted in between and you get floating-point accuracy problems);
a linked list approach where a slide-document has a previous and next field containing the primary key of the documents on either side of it;
a very straightforward approach of updating all documents for each document reordering/insertion/deletion.
None of these are appropriate for CouchDB: #1 incurs a huge amount of incidental complexity in SQL or CouchDB. #2 is unreliable due to lack of atomic transactions (CouchDB might update the previous document with its new next but another client might have updated the new next document meanwhile, so updating the new next document will fail with 409, and your linked list is left in an inconsistent state). For the same reason, #3 is completely unworkable.
One CouchDB-oriented approach I’m evaluating would create a document that just contains the ordering of the slides: it might contain a primary-key-to-order-number hash object as well as an array that converts order-number-to-primary-key, and just update this object when slides are reordered/inserted/deleted. The downside to this is that Couch will keep a copy of this potentially large document for every order change (reorder/insert/delete)—CouchDB doesn’t support compacting just a single document, and I don’t want to run compaction on my entire database since I love preserving the history of each slide-document. Another downside is that after thousands of slides, each change to ordering involves transmitting the entire object (hundreds of kilobytes) from PouchDB/client to Couch.
A tweak to this approach would be to make a second database just to hold this ordering document and turn on auto-compaction on it. It’ll be more work to keep track of two database connections, and I’ll eventually have to put a lot of data down the wire, but I’ll have a robust way to order documents in CouchDB.
So my questions are: how do CouchDB people usually store the order of documents? And can more experienced CouchDB people see any flaws in my approach outlined above?
Thanks to a tip by #LynHeadley, I wound up writing a library that could subdivide the lexicographical interval between strings: Mudder.js. This allows me to infinitely insert and move around documents in CouchDB, by creating new keys at will, without any overhead of a secondary document to store the ordering. I think this is the right way to solve this problem!
Based on what I've read, I would choose the "ordering document" approach. (ie: slideshow document that has an array of ids for each slide document) This is really straightforward and accomplishes the use-case, so I wouldn't let these concerns get in the way of clean/intuitive code.
You are right that this document can grow potentially very large, compounded by the write-heavy nature of that specific document. This is why compaction exists and is the solution here, so you should not fight against CouchDB on this point.
It is a common misconception that you can use CouchDB's revision history to keep a comprehensive history to your database. The revisions are merely there to aid in write concurrency, not as a full version control system.
CouchDB has auto-compaction enabled by default, and without it your database will grow in size unchecked. Thus, you should abandon the idea of tracking document history using this approach, and instead adopt another, safer alternative. (a list of these alternatives is beyond the scope of this answer)
One of my queries can take a lot of different filters and sort orders depending on user input. This generates a huge index.yaml file of 50+ indexes.
I'm thinking of denormalizing many of my boolean and multi-choice (string) properties into a single string list property. This way, I will reduce the number of query combinations because most queries will simply add a filter to the string list property, and my index count should decrease dramatically.
It will surely increase my storage size, but this isn't really an issue as I won't have that much data.
Does this sound like a good idea or are there any other drawbacks with this approach?
As always, this depends on how you want to query your entities. For most of the sorts of queries you could execute against a list of properties like this, App Engine will already include an automatically built index, which you don't have to specify in app.yaml. Likewise, most queries that you'd want to execute that require a composite index, you couldn't do with a list property, or would require an 'exploding' index on that list property.
If you tell us more about the sort of queries you typically run on this object, we can give you more specific advice.
Denormalizing your data to cut back on the number of indices sounds like it a good tradeoff. Reducing the number of indices you need will have fewer indices to update (though your one index will have more updates); it is unclear how this will affect performance on GAE. Size will of course be larger if you leave the original fields in place (since you're copying data into the string list property), but this might not be too significant unless your entity was quite large already.
This is complicated a little bit since the index on the list will contain one entry for each element in the list on each entity (rather than just one entry per entity). This will certainly impact space, and query performance. Also, be wary of creating an index which contains multiple list properties or you could run into a problem with exploding indices (multiple list properties => one index entry for each combination of values from each list).
Try experimenting and see how it works in practice for you (use AppStats!).
"It will surely increase my storage size, but this isn't really an issue as I won't have that much data."
If this is true then you have no reason to denormalize.