How to prevent duplicates in mongodb time series collection

How to prevent duplicates in mongodb time series collection - database

Problem
Sensors check-in periodically, but network connectivity issues may cause them to check-in with the same data more than once.
MongoDB does not allow the unique property on secondary indexes for time series collections (MongoDB 5.0). Timeseries Limitations
In addition, calculations need to be done on the data (preferably using aggregations) that involve counting the number of entries, which will be inaccurate if there are duplicates. Not to mention it bloats the database and is just messy.
Question
Is there any way to prevent duplicate entries in a MongoDB Timeseries collection?

I'm having the same issue.
According to official answer in MongoDB Community, there is no way to ensure unique values in timeseries collection.
You can check the full explanations here:
https://www.mongodb.com/community/forums/t/duplicate-data-issue/135023
They consider it a caveat of timeseries compare to normal collection. IMO, it's a crucial lack in the timeseries capability of mongodb...
There is currently two available solutions:
Use "normal" collection with a compound unique index on your timestamp and sensor_id fields
Keep using timeseries collection, but query your data only through aggregation pipeline with a $group stage to eliminate duplicate entries

Related

sql | slow queries | avoid many joins

I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?

Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.

Arbitrary document ordering in CouchDB/PouchDB

I’m building what can be treated as a slideshow app with CouchDB/PouchDB: each “slide” is its own Couch document, and slides can be reordered or deleted, and new slides can be added in between existing slides or at the beginning or end of the slideshow. A slideshow could grow from one to ≲10,000 slides, so I am sensitive to space- and time-efficiency.
I made the slide creation/editing functionality first, completely underestimating how tricky it is to keep track of slide ordering. This is hard because the order of each slide-document is completely independent of the slide-doc itself, i.e., it’s not something I can sort by time or some number contained in the document. I see numerous questions on StackOverflow about how to keep track of ordering in relational databases:
Efficient way to store reorderable items in a database
What would be the best way to store records order in SQL
How can I reorder rows in sql database
Storing item positions (for ordering) in a database efficiently
How to keep ordering of records in a database table
Linked List in SQL
but all these involve either
using a floating-point secondary key for reordering/creation/deletion, with periodic normalization of indexes (i.e., imagine two documents are order-index 1.0 and 2.0, then a third document in between gets key 1.5, then a fourth gets 1.25, …, until ~31 docs are inserted in between and you get floating-point accuracy problems);
a linked list approach where a slide-document has a previous and next field containing the primary key of the documents on either side of it;
a very straightforward approach of updating all documents for each document reordering/insertion/deletion.
None of these are appropriate for CouchDB: #1 incurs a huge amount of incidental complexity in SQL or CouchDB. #2 is unreliable due to lack of atomic transactions (CouchDB might update the previous document with its new next but another client might have updated the new next document meanwhile, so updating the new next document will fail with 409, and your linked list is left in an inconsistent state). For the same reason, #3 is completely unworkable.
One CouchDB-oriented approach I’m evaluating would create a document that just contains the ordering of the slides: it might contain a primary-key-to-order-number hash object as well as an array that converts order-number-to-primary-key, and just update this object when slides are reordered/inserted/deleted. The downside to this is that Couch will keep a copy of this potentially large document for every order change (reorder/insert/delete)—CouchDB doesn’t support compacting just a single document, and I don’t want to run compaction on my entire database since I love preserving the history of each slide-document. Another downside is that after thousands of slides, each change to ordering involves transmitting the entire object (hundreds of kilobytes) from PouchDB/client to Couch.
A tweak to this approach would be to make a second database just to hold this ordering document and turn on auto-compaction on it. It’ll be more work to keep track of two database connections, and I’ll eventually have to put a lot of data down the wire, but I’ll have a robust way to order documents in CouchDB.
So my questions are: how do CouchDB people usually store the order of documents? And can more experienced CouchDB people see any flaws in my approach outlined above?

Thanks to a tip by #LynHeadley, I wound up writing a library that could subdivide the lexicographical interval between strings: Mudder.js. This allows me to infinitely insert and move around documents in CouchDB, by creating new keys at will, without any overhead of a secondary document to store the ordering. I think this is the right way to solve this problem!

Based on what I've read, I would choose the "ordering document" approach. (ie: slideshow document that has an array of ids for each slide document) This is really straightforward and accomplishes the use-case, so I wouldn't let these concerns get in the way of clean/intuitive code.
You are right that this document can grow potentially very large, compounded by the write-heavy nature of that specific document. This is why compaction exists and is the solution here, so you should not fight against CouchDB on this point.
It is a common misconception that you can use CouchDB's revision history to keep a comprehensive history to your database. The revisions are merely there to aid in write concurrency, not as a full version control system.
CouchDB has auto-compaction enabled by default, and without it your database will grow in size unchecked. Thus, you should abandon the idea of tracking document history using this approach, and instead adopt another, safer alternative. (a list of these alternatives is beyond the scope of this answer)

Mongodb large array or query

My question is related to mongo's ability to handle huge arrays.
I would like to send push notification when topic is updated to all subscribers of the topic. Assume a topic can have a million subscribers.
Will it be efficient to hold a huge array in the topic document that holds all users ids that subscribed to it? Or is the conservative way is better - hold an array of subscribed topics for each user and then query the users collection to find subscribers for specific topic?
Edit:
I would hold an array of subscribed topics in the user collection anyway (for views and edits)

Primary Assumption: Topic-related and person-related metadata is stored in different collections and the collection being discussed here is utilized only to keep track of topic subscribers.
Storing subscribers as a list/array associated with a topic identifier as the document key (meaning an indexed field) makes for an efficient structure. Once you have a topic of interest you can lookup the subscriber list by topic identifier. Here, as #Saleem rightly pointed out, you need to be wary of large subscriber lists causing documents to exceed the 16MB documents size limit. But, instead of complicating the design by making a different collection to handle this (as suggested by #Saleem), you can simply split the subscriber list (into as many parts as required, using a modulo 16MB operation) and create multiple documents for a topic in the same collection. Given that the topic identifier is an indexed field, lookup time will not be hurt, since 16MB can accomodate a significantly huge number of subscriber identifiers and number of splits required should be fairly low, if needed at all.
The other structure you suggested, where a subscriber identifier is the document key with all their subscribed topics in the document is intuitively not so efficient for a large dataset. This structure would involve lookup of all subscribers subscribing to the topic at hand. If subscribed topics are stored as a list/array (seems the likely choice) this query would involve a $in clause which is slower than a indexed field lookup, even for small sized topic lists over a significantly large user base.

If your array is very big and cumulative size of document is exceeding 16 MB, then split it into another collection. You can have topic in collection and all of its subscribers into separate collection referencing topic collection.

Is Couchbase an ordered key-value store?

Are documents in Couchbase stored in key order? In other words, would they allow efficient queries for retrieving all documents with keys falling in a certain range? In particular I need to know if this is true for Couchbase lite.

Query efficiency is correlated with the construction of the views that are added to the server.
Couchbase/Couchbase Lite only stores the indexes specified and generated by the programmer in these views. As Couchbase rebalances it moves documents between nodes, so it seems impractical that key order could be guaranteed or consistent.
(Few databases/datastores guarantee document or row ordering on disk, as indexes provide this functionality more cheaply.)
Couchbase document retrieval is performed via map/reduce queries in views:
A view creates an index on the data according to the defined format and structure. The view consists of specific fields and information extracted from the objects in Couchbase. Views create indexes on your information that enables search and select operations on the data.
source: views intro
A view is created by iterating over every single document within the Couchbase bucket and outputting the specified information. The resulting index is stored for future use and updated with new data stored when the view is accessed. The process is incremental and therefore has a low ongoing impact on performance. Creating a new view on an existing large dataset may take a long time to build but updates to the data are quick.
source: Views Basics
source
and finally, the section on Translating SQL to map/reduce may be helpful:
In general, for each WHERE clause you need to include the corresponding field in the key of the generated view, and then use the key, keys or startkey / endkey combinations to indicate the data you want to select.
In conclusion, Couchbase views constantly update their indexes to ensure optimal query performance. Couchbase Lite is similar to query, however the server's mechanics differ slightly:
View indexes are updated on demand when queried. So after a document changes, the next query made to a view will cause that view's map function to be called on the doc's new contents, updating the view index. (But remember that you shouldn't write any code that makes assumptions about when map functions are called.)
How to improve your view indexing: The main thing you have control over is the performance of your map function, both how long it takes to run and how many objects it allocates. Try profiling your app while the view is indexing and see if a lot of time is spent in the map function; if so, optimize it. See if you can short-circuit the map function and give up early if the document isn't a type that will produce any rows. Also see if you could emit less data. (If you're emitting the entire document as a value, don't.)
from Couchbase Lite - View

which is the efficient db for autosuggest for some million datas

i need to know which will be the best db for an autosugest db with some 80 million records...
1)Redis
2)tokyoCabinet
3)Kyoto Cabinet

This site may have what you're looking for: http://perfectmarket.com/blog/not_only_nosql_review_solution_evaluation_guide_chart
You have several things to consider:
Volume of data - the database should be able to handle lots of records and large files
List item
Speed of inserts and retrieval
Stability - you don't want to go down because you're hammering the DB with lots of hits, as is common with an autosuggest
I know it isn't on your list, but I would go with MongoDB. If you can't then I would go with Redis, simply for the speed factor.

Redis is a great fit for autosuggest because of its sorted sets (implemented as a skiplist). The schema I've used w/success basically has each partial word as a key (so "python" would map to keys: "py", "pyt", "pyth", "pytho", and "python"). The data associated with each key is a sorted set where the value is there to provide lexical ordering of the original phrase (provide sorting of the results) and the key is an id mapping to the data you wish to return. I then store the ids and data in a hash.
Here is a sample project written in python, with more details: https://github.com/coleifer/redis-completion

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight