I am trying to implement a tailable cursor for mongo using C driver. Uptil now i have been able to create it and successfully get the pushed data into my process with the following code
cursor =mongo_find( connection, DB_TENANT_NAMESPACE, query, bson_empty( &e ), 0, 0, MONGO_TAILABLE | MONGO_AWAIT_DATA);
while(1)
{
while(mongo_cursor_next(cursor) == MONGO_OK)
{
b=mongo_cursor_bson(cursor);
if(bson_find(iterator,b,"_id"))
{
oid =bson_iterator_oid(iterator);
bson_oid_to_string(oid,&id);
printf("ID:%s\n",id);
}
}
With this code i can get the updates. But looking at the tailable cursors docs, it seems that i need to run the mongo_find inside the outer while loop to make sure i get the latest entries. The docs suggest appending to query with gte. Copying from docs
query = QUERY( "_id" << GT << lastId ).sort("$natural");
The issue is that the oid is an object which can be converted to a string. I dont really think i should be converting it to an int in-order for gte to work. Any ideas?
ObjectId's may be logically compared by those operators, as can Date and Timestamp objects. There should be no need to represent the ObjectId as a string, and is no practical reason (at least in this case) for comparing an ObjectId to a string.
Note that comparisons involving two different BSON types will follow this compare order.
Related
I need an algorithm which looks simple, but I still can't think about a well optimised way to do to do this.
I have the following json object:
[
{
"start": "2000-01-01T04:00:00.000Z",
"end": "2020-01-01T08:00:00.000Z"
}, {
"start": "2000-01-01T05:00:00.000Z",
"end": "2020-01-01T07:00:00.000Z"
}
]
As you can see, the second object is inside the range of the first. I need to iterate over this array and return which dates are conflicting.
My project is in ruby on rails right now, but I just need an idea how to implement the algorithm so, any high level programming language would be good.
Any ideas?
First, we can transform the list of hashes to parse the dates into Date objects:
require 'date'
dates = input.map do |hsh|
hsh.transform_values { |str| Date.parse str }
end
Now we can use a nested loop and use Range#cover? to find if there are duplicates:
conflicting = dates.select.with_index do |date, idx|
[date[:start], date[:end]].any? do |date_to_compare|
dates.map.with_index.any? do |date2, idx2|
next if idx == idx2 # so we don't compare to self
(date2[:start]..date2[:end]).cover?(date_to_compare)
end
end
end
Detect a DateTime Object Covered By a Range
There may be a more elegant way to do this, but this seems relatively straightforward to me. The trick is to convert your Hash values into DateTime ranges that can take advantage of the built-in Range#cover? method.
Consider the following:
require 'date'
dates = [
{:start=>"2000-01-01T04:00:00.000Z", :end=>"2020-01-01T08:00:00.000Z"},
{:start=>"2000-01-01T05:00:00.000Z", :end=>"2020-01-01T07:00:00.000Z"},
]
# convert your date hashes into an array of date ranges
date_ranges = dates.map { |hash| hash.values}.map do |array|
(DateTime.parse(array.first) .. DateTime.parse(array.last))
end
# compare sets of dates; report when the first covers the second range
date_ranges.each_slice(2) do |range1, range2|
puts "#{range1} covers #{range2}" if range1.cover? range2
end
Because Range#cover? is Boolean, you might prefer to simply store dates which are covered and do something with them later, rather than taking immediate action on each one. In that case, just use Array#select. For example:
date_ranges.each_slice(2).select { |r1, r2| r1.cover? r2 }
Shove the data into a database using BTREE index on the date fields. Let the DB do the work for you.
Lets say we have the following table:
TABLE myDate {
id BIGINT UNSIGNED, date_start DATETIME, date_end DATETIME
}
Then you want BTREE (or BTREE+) index on date_start and date_end, and HASH index on id.
Once these are in place, feed your table the data, and perform the following select statement to find times that overlap:
-- Query to select dates that are fully contained such as in the example (l contains r):
SELECT l.id, l.date_start, l.date_end, r.id, r.date_start, r.date_end
FROM myDate l JOIN myDate r ON (l.date_start < r.date_start) AND (l.date_end > r.date_end);
-- Query to select dates that overlap on one side:
SELECT l.id, l.date_start, l.date_end, r.id, r.date_start, r.date_end
FROM myDate l JOIN myDate r ON ((l.date_start < r.date_start) AND (l.date_end > r.date_start)) OR ((l.date_start > r.date_start) AND (l.date_end < r.date_start));
Those strings look like ISO 8601 format. You should be able to easily parse that into a Date/DateTime/orsimilar object. Check the docs about those classes, it will be shown there show you cn do that. Then, after parsing into objects, you should be able to compare those date objects simply with </<=/>=/> operators. With this you will be able to compare starts/ends, and you will be able to determine if a date X is:
(a) fully before the other one
(b) startsbefore and ends within the other one
(c) fully within the other one
(d) startswithin and ends after the other one
(e) fully after the other one
(f) is longer and fully contains the other one
I think that's all possibilities, but you better double-check that. Draw them all on time axis if needed and see if there are any other possibilities.
When you have code that can do this classification, you're good to go and implement rest of the logic that bases on that.
but I still can't think about a well optimised way
don't. Write it first in any way, just to get it working and reliable. Understand the problem from the beginning to the end, thoroughly. Then measure its speed and quality. If it's not good, then write a v2 version based on a first-whatever-guess regarding speed/quality observations. Measure and compare. If it's still not good, then collect code, data sets, measurements, make sure test cases and measurements are repeatable by readers that don't have your computer&network&passwords&etc, and then explain the problem and about how to fix/optimize that. Without all of this, asking about "optimization"*) mostly leads to pure guessing.
*) OFC assuming that "well optimized way" wasn't an empty buzzword, but a real question re performance
id | name | ipAddress
----+----------+-------------------------
1 | testname | {192.168.1.60,192.168.1.65}
I want to search ipAddress with LIKE. I tried:
{'$mac_ip_addresses.ip_address$': { [OP.contains]: [searchItem]}},
This one also:
{'$mac_ip_addresses.ip_address$': { [OP.Like] : { [OP.any]: [searchItem]}}},
The data type of ipAddress is text[]. I want to search in ipAddress with LIKE.
searchItem contains the IP that need to be searched in the ipAddress field so I want to search in array with LIKE.
I don't know Sequelize but I can answer from postgres side.
There is no short syntax to search for a pattern inside array in PostgreSQL.
If you want to check pattern for each array element individually, then you need to unfold the array using unnest:
SELECT id, name, ipaddress
FROM testing
WHERE EXISTS (
SELECT 1 FROM unnest(ipaddress) AS ip
WHERE ip LIKE '8.8.8.%'
);
If the array is frequently searched this way, it's better to store the data in normalized form.
However, there is a short syntax (plus GIN index support) for for equality based search (see #> and other operators here).
SELECT id, name, ipaddress
FROM testing
WHERE ipaddress #> ARRAY['8.8.8.8'];
What you asked
~~ is the operator used internally to implementing SQL LIKE. There is no commutator for it - no operator that works with left and right operand switched.
That's the one you'd need for your attempt to use the ANY construct with the pattern to the left. Related:
You can create the operator, though, and it's pretty simple:
CREATE OR REPLACE FUNCTION reverse_like (text, text)
RETURNS boolean LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'SELECT $2 LIKE $1';
CREATE OPERATOR <~~ (function = reverse_like, leftarg = text, rightarg = text);
Inspired by Jeff Janes' idea here:
Match string pattern to any array element
Then your query can have the pattern to the left of the operator:
SELECT *
FROM mac_ip_addresses
WHERE '192.168.2%.255' <~~ ANY (ipaddress);
Simple, but considerably slower than the EXISTS expression demonstrated by filiprem.
Then again, either query is excruciatingly slow for big tables, since neither can use an index. A normalized DB design with a n:1 table holding one IP each would allow that. It would also occupy several times the space on disk. Still, the much cleaner implementation ...
While stuck with your current design, there is still a way: create a trigram GIN index on a text representation of the array and add a redundant, "sargable" predicate to the query additionally. Confused? Here's the recipe:
First, trigram indexes? Read this if you are not familiar:
PostgreSQL LIKE query performance variations
Neither the cast from text[] to text nor array_to_string() are immutable. But we need that for an expression index. Long story short, fake it with an immutable wrapper function:
CREATE OR REPLACE FUNCTION f_textarr2text(text[])
RETURNS text LANGUAGE sql IMMUTABLE AS $$SELECT array_to_string($1, ',')$$;
CREATE INDEX iparr_trigram_idx ON iparr
USING gin (f_textarr2text(iparr) gin_trgm_ops);
Related answer with the long story (and why it's safe):
Indexing an array for full text search
Then your query can be:
SELECT *
FROM mac_ip_addresses
WHERE NOT ('192.168.9%.255' <~~ ANY (ipaddress))
AND f_textarr2text(ipaddress) LIKE '192.168.9%.255'; -- logically redundant
The added predicate is logically redundant, but can tap into the power of the trigram index.
Much faster for big tables. Still a bit faster, yet:
SELECT *
FROM mac_ip_addresses
WHERE EXISTS (SELECT FROM unnest(ipaddress) ip WHERE ip LIKE '192.168.9%.255')
AND f_textarr2text(ipaddress) LIKE '192.168.9%.255';
But that's minor now.
db<>fiddle here
I addressed the question asked, as I took an interest. Might be of interest to the general public. Most probably not what you need, though.
What you need
I want to search in ipAddress with LIKE. searchItem contains the IP that need to be searched in the ipAddress field so I want to search in array with LIKE.
That should probably read:
"I want to search a given IP address (searchItem) in the array ipAddress. My first idea is to use LIKE ..."
Well, LIKE is for pattern matching. To find complete IP addresses in an array, it's the wrong tool. filiprem's second query with array operators is the way to go. Probably good enough.
Using the built-in data type cidr instead of text would be better. And the ip4 data type of the additional ip4r module would be much better, yet. All in combination with standard array operators like demonstrated.
Finally, converting IPv4 addresses to integer and using that with the additional inrarray module should be stellar - as far as performance is concerned.
I have a set of test results in my mongodb database. Each document in the database contains version information, test data, date, test run information etc...
The version is broken up in the document and stored as individual values. For example: { VER_MAJOR : "0", VER_MINOR : "2", VER_REVISION : "3", VER_PATCH : "20}
My application wants the ability to specify a specific version and grab the document as well as the previous N documents based on the version.
For example:
If version = 0.2.3.20 and n = 5 then the result would return documents with version 0.2.3.20, 0.2.3.19, 0.2.3.18, 0.2.3.17, 0.2.3.16, 0.2.3.15
The solutions that come to my mind is:
Create a new database that contains documents with version information and is sorted. Which can be used to obtain the previous N version's which can be used to obtain the corresponding N documents in the test results database.
Perform the sorting in the test results database itself like in number 1. Though if the test results database is large, this will take a very long time. Also consider inserting in order every time.
Creating another database like in option 1 doesn't seem like the right way. But sorting the test results database seems like there will be lots of overhead, am I mistaken that I should be worried about option 2 producing lots of overhead? I have the impression I'd have to query the entire database then sort it on application side. Querying the entire database seems like overkill...
db.collection_name.find().sort([Paramaters for sorting])
You are quite correct that querying and sorting the entire data set would be very excessive. I probably went overboard on this, but I tried to break everything down in detail below.
Terminology
First thing first, a couple terminology nitpicks. I think you're using the term Database when you mean to use the word Collection. Differentiating between these two concepts will help with navigating documentation and allow for a better understanding of MongoDB.
Collections and Sorting
Second, it is important to understand that documents in a Collection have no inherent ordering. The order in which documents are returned to your app is only applied when retrieving documents from the Collection, such as when specifying .sort() on a query. This means we won't need to copy all of the documents to some other collection; we just need to query the data so that only the desired data is returned in the order we want.
Query
Now to the fun part. The query will look like the following:
db.test_results.find({
"VER_MAJOR" : "0",
"VER_MINOR" : "2",
"VER_REVISION" : "3",
"VER_PATCH" : { "$lte" : 20 }
}).sort({
"VER_PATCH" : -1
}).limit(N)
Our query has a direct match on the three leading version fields to limit results to only those values, i.e. the specific version "0.2.3". A range $lte filter is applied on VER_PATCH since we will want more than a single patch revision.
We then sort results by VER_PATCH to return results descending by the patch version. Finally, the limit operator is used to restrict the number of documents being returned.
Index
We're not done yet! Remember how you said that querying the entire collection and sorting it on the app side felt like overkill? Well, the database would doing exactly that if an index did not exist for this query.
You should follow the equality-sort-match rule when determining the order of fields in an index. In this case, this would give us the index:
{ "VER_MAJOR" : 1, "VER_MINOR" : 1, "VER_REVISION" : 1, "VER_PATCH" : 1 }
Creating this index will allow the query to complete by scanning only the results it would return, while avoiding an in-memory sort. More information can be found here.
The following document records a conversation between Milhouse and Bart. I would like to insert a new message with the right num (the next in the example would be 3) in a unique operation. Is that possible ?
{ user_a:"Bart",
user_b:"Milhouse",
conversation:{
last_msg:2,
messages:[
{ from:"Bart",
msg:"Hello"
num:1
},
{ from:"Milhouse",
msg:"Wanna go out ?"
num:2
}
]
}
}
In MongoDB, arrays keep their order, so by adding a num attribute, you're only creating more data for something that you could accomplish without the additional field. Just use the position in the array to accomplish the same thing. Grabbing the X message in an array will provide faster searches than searching for { num: X }.
To keep the order, I don't think there's an easy way to add the num category besides does a find() on conversation.last_msg before you insert the new subdocument and increment last_msg.
Depending on what you need to keep the ordering for, you might consider including a time stamp in your subdocument, which is commonly kept in conversation records anyway and may provide other useful information.
Also, I haven't used it, but there's a Mongoose plugin that may or may not be able to do what you want: https://npmjs.org/package/mongoose-auto-increment
You can't create an auto increment field but you can use functions to generate and administrate sequence :
http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
I would recommend using a timestamp rather than a numerical value. By using a timestamp, you can keep the ordering of the subdocument and make other use of it.
OutOfMemoryError caused when db4o databse has 15000+ objects
My question is in reference to my previous question (above). For the same PostedMessage model and same query.
With 100,000 PostedMessage objects, the query takes about 1243 ms to return first 20 PostedMessages.
Now, I have saved 1,000,000 PostedMessage objects in db4o. The same query took 342,132 ms. Which is non-linearly high.
How can I optimize the query speed?
FYR:
The timeSent and timeReceived are Indexed fields.
I am using SNAPSHOT query mode.
I am not using TA/TP.
Do you sort the result? Unfortunatly db4o doesn't use the index for sorting / orderBy. That means it will run a regular sort algorith, with O(n*log(n)). It won't scala liniearly.
Also db4o doesn't support a TOP operator. That means even without sorting it takes quite a bit of time to copy the ids to the results set, even when you never read the entities afterwards.
So, there's no real good solution for this, except trying to use some criteria which cut down the result size.
Some adventerous people might use a different query evaluation, but personally don't recommend that.
#Gamlor No, I am not sorting at all. The code is as follows:
public static ObjectSet<PostedMessage> getMessagesBetweenDates(
Calendar after,
Calendar before,
ObjectContainer db) {
if (after == null || before == null || db == null) {
return null;
}
Query q = db.query(); //db is pre-configured to use SNAPSHOT mode.
q.constrain(PostedMessage.class);
Constraint from = q.descend("timeRecieved").constrain(new Long(after.getTimeInMillis())).greater().equal();
q.descend("timeRecieved").constrain(new Long(before.getTimeInMillis())).smaller().equal().and(from);
ObjectSet<EmailMessage> results = q.execute();
return results;
}
The arguments to this method are as follows:
after = 13-09-2011 10:55:55
before = 13-09-2011 10:56:10
And I expect only 10 PostedMessages to be returned between "after" and "before". (I am generating dummy PostedMessage with timeReceived incremented by 1 sec each.)