I need an algorithm which looks simple, but I still can't think about a well optimised way to do to do this.
I have the following json object:
[
{
"start": "2000-01-01T04:00:00.000Z",
"end": "2020-01-01T08:00:00.000Z"
}, {
"start": "2000-01-01T05:00:00.000Z",
"end": "2020-01-01T07:00:00.000Z"
}
]
As you can see, the second object is inside the range of the first. I need to iterate over this array and return which dates are conflicting.
My project is in ruby on rails right now, but I just need an idea how to implement the algorithm so, any high level programming language would be good.
Any ideas?
First, we can transform the list of hashes to parse the dates into Date objects:
require 'date'
dates = input.map do |hsh|
hsh.transform_values { |str| Date.parse str }
end
Now we can use a nested loop and use Range#cover? to find if there are duplicates:
conflicting = dates.select.with_index do |date, idx|
[date[:start], date[:end]].any? do |date_to_compare|
dates.map.with_index.any? do |date2, idx2|
next if idx == idx2 # so we don't compare to self
(date2[:start]..date2[:end]).cover?(date_to_compare)
end
end
end
Detect a DateTime Object Covered By a Range
There may be a more elegant way to do this, but this seems relatively straightforward to me. The trick is to convert your Hash values into DateTime ranges that can take advantage of the built-in Range#cover? method.
Consider the following:
require 'date'
dates = [
{:start=>"2000-01-01T04:00:00.000Z", :end=>"2020-01-01T08:00:00.000Z"},
{:start=>"2000-01-01T05:00:00.000Z", :end=>"2020-01-01T07:00:00.000Z"},
]
# convert your date hashes into an array of date ranges
date_ranges = dates.map { |hash| hash.values}.map do |array|
(DateTime.parse(array.first) .. DateTime.parse(array.last))
end
# compare sets of dates; report when the first covers the second range
date_ranges.each_slice(2) do |range1, range2|
puts "#{range1} covers #{range2}" if range1.cover? range2
end
Because Range#cover? is Boolean, you might prefer to simply store dates which are covered and do something with them later, rather than taking immediate action on each one. In that case, just use Array#select. For example:
date_ranges.each_slice(2).select { |r1, r2| r1.cover? r2 }
Shove the data into a database using BTREE index on the date fields. Let the DB do the work for you.
Lets say we have the following table:
TABLE myDate {
id BIGINT UNSIGNED, date_start DATETIME, date_end DATETIME
}
Then you want BTREE (or BTREE+) index on date_start and date_end, and HASH index on id.
Once these are in place, feed your table the data, and perform the following select statement to find times that overlap:
-- Query to select dates that are fully contained such as in the example (l contains r):
SELECT l.id, l.date_start, l.date_end, r.id, r.date_start, r.date_end
FROM myDate l JOIN myDate r ON (l.date_start < r.date_start) AND (l.date_end > r.date_end);
-- Query to select dates that overlap on one side:
SELECT l.id, l.date_start, l.date_end, r.id, r.date_start, r.date_end
FROM myDate l JOIN myDate r ON ((l.date_start < r.date_start) AND (l.date_end > r.date_start)) OR ((l.date_start > r.date_start) AND (l.date_end < r.date_start));
Those strings look like ISO 8601 format. You should be able to easily parse that into a Date/DateTime/orsimilar object. Check the docs about those classes, it will be shown there show you cn do that. Then, after parsing into objects, you should be able to compare those date objects simply with </<=/>=/> operators. With this you will be able to compare starts/ends, and you will be able to determine if a date X is:
(a) fully before the other one
(b) startsbefore and ends within the other one
(c) fully within the other one
(d) startswithin and ends after the other one
(e) fully after the other one
(f) is longer and fully contains the other one
I think that's all possibilities, but you better double-check that. Draw them all on time axis if needed and see if there are any other possibilities.
When you have code that can do this classification, you're good to go and implement rest of the logic that bases on that.
but I still can't think about a well optimised way
don't. Write it first in any way, just to get it working and reliable. Understand the problem from the beginning to the end, thoroughly. Then measure its speed and quality. If it's not good, then write a v2 version based on a first-whatever-guess regarding speed/quality observations. Measure and compare. If it's still not good, then collect code, data sets, measurements, make sure test cases and measurements are repeatable by readers that don't have your computer&network&passwords&etc, and then explain the problem and about how to fix/optimize that. Without all of this, asking about "optimization"*) mostly leads to pure guessing.
*) OFC assuming that "well optimized way" wasn't an empty buzzword, but a real question re performance
Related
I need to compare the 2 arrays declared here to return records that exist only in the filtered_apps array. I am using the contents of previous_apps array to see if an ID in the record exists in filtered_apps array. I will be outputting the results to a CSV and displaying records that exist in both arrays to the console.
My question is this: How do I get the records that only exist in filtered_apps? Easiest for me would be to put those unique records into a new array to work with on the csv.
start_date = Date.parse("2022-02-05")
end_date = Date.parse("2022-05-17")
valid_year = start_date.year
dupe_apps = []
uniq_apps = []
# Finding applications that meet my criteria:
filtered_apps = FinancialAssistance::Application.where(
:is_requesting_info_in_mail => true,
:aasm_state => "determined",
:submitted_at => {
"$exists" => true,
"$gte" => start_date,
"$lte" => end_date })
# Finding applications that I want to compare against filtered_apps
previous_apps = FinancialAssistance::Application.where(
is_requesting_info_in_mail: true,
:submitted_at => {
"$exists" => true,
"$gte" => valid_year })
# I'm using this to pull the ID that I'm using for comparison just to make the comparison lighter by only storing the family_id
previous_apps.each do |y|
previous_apps_array << y.family_id
end
# This is where I'm doing my comparison and it is not working.
filtered_apps.each do |app|
if app.family_id.in?(previous_apps_array) == false
then #non_dupe_apps << app
else "No duplicate found for application #{app.hbx_id}"
end
end
end
So what am I doing wrong in the last code section?
Let's check your original method first (I fixed the indentation to make it clearer). There's quite a few issues with it:
filtered_apps.each do |app|
if app.family_id.in?(previous_apps_array) == false
# Where is "#non_dupe_apps" declared? It isn't anywhere in your example...
# Also, "then" is not necessary unless you want a one-line if-statement
then #non_dupe_apps << app
# This doesn't do anything, it's just a string
# You need to use "p" or "puts" to output something to the console
# Note that the "else" is also only triggered when duplicates WERE found...
else "No duplicate found for application #{app.hbx_id}"
end # Extra "end" here, this will mess things up
end
end
Also, you haven't declared previous_apps_array anywhere in your example, you just start adding to it out of nowhere.
Getting the difference between 2 arrays is dead easy in Ruby: just use -!
uniq_apps = filtered_apps - previous_apps
You can also do this with ActiveRecord results, since they are just arrays of ActiveRecord objects. However, this doesn't help if you specifically need to compare results using the family_id column.
TIP: Getting the values of only a specific column/columns from your database is probably best done with the pluck or select method if you don't need to store any other data about those objects. With pluck, you only get an array of values in the result, not the full objects. select works a bit differently and returns ActiveRecord objects, but filters out everything but the selected columns. select is usually better in nested queries, since it doesn't trigger a separate query when used as a part of another query, while pluck always triggers one.
# Querying straight from the database
# This is what I would recommend, but it doesn't print the values of duplicates
uniq_apps = filtered_apps.where.not(family_id: previous_apps.select(:family_id))
I highly recommend getting really familiar with at least filter/select, and map out of the basic array methods. They make things like this way easier. The Ruby docs are a great place to learn about them and others. A very simple example of doing a similar thing to what you explained in your question with filter/select on 2 arrays would be something like this:
arr = [1, 2, 3]
full_arr = [1, 2, 3, 4, 5]
unique_numbers = full_arr.filter do |num|
if arr.include?(num)
puts "Duplicates were found for #{num}"
false
else
true
end
end
# Duplicates were found for 1
# Duplicates were found for 2
# Duplicates were found for 3
=> [4, 5]
NOTE: The OP is working with ruby 2.5.9, where filter is not yet available as an array method (it was introduced in 2.6.3). However, filter is just an alias for select, which can be found on earlier versions of Ruby, so they can be used interchangeably. Personally, I prefer using filter because, as seen above, select is already used in other methods, and filter is also the more common term in other programming languages I usually work with. Of course when both are available, it doesn't really matter which one you use, as long as you keep it consistent.
EDIT: My last answer did, in fact, not work.
Here is the code all nice and working.
It turns out the issue was that when comparing family_id from the set of records I forgot that the looped record was a part of the set, so it would return it, too. I added a check for the ID of the array to match the looped record and bob's your uncle.
I added the pass and reject arrays so I could check my work instead of downloading a csv every time. Leaving them in mostly because I'm scared to change anything else.
start_date = Date.parse(date_from)
end_date = Date.parse(date_to)
valid_year = start_date.year
date_range = (start_date)..(end_date)
comparison_apps = FinancialAssistance::Application.by_year(start_date.year).where(
aasm_state:'determined',
is_requesting_voter_registration_application_in_mail:true)
apps = FinancialAssistance::Application.where(
:is_requesting_voter_registration_application_in_mail => true,
:submitted_at => date_range).uniq{ |n| n.family_id}
#pass_array = []
#reject_array = []
apps.each do |app|
family = app.family
app_id = app.id
previous_apps = comparison_apps.where(family_id:family.id,:id.ne => app.id)
if previous_apps.count > 0
#reject_array << app
puts "\e[32mApplicant hbx id \e[31m#{app.primary_applicant.person_hbx_id}\e[32m in family ID \e[31m#{family.id}\e[32m has registered to vote in a previous application.\e[0m"
else
<csv fields here>
csv << [csv fields here]
end
end
Basically, I pulled the applications into the app variable array, then filtered them by the family_id field in each record.
I had to do this because the issue at the bottom of everything was that there were records present in app that were themselves duplicates, only submitted a few days apart. Since I went on the assumption that the initial app array would be all unique, I thought the duplicates that were included were due to the rest of the code not filtering correctly.
I then use the uniq_apps array to filter through and look for matches in uniq_apps.each do, and when it finds a duplicate, it adds it to the previous_applications array inside the loop. Since this array resets each go-round, if it ever has more than 0 records in it, the app gets called out as being submitted already. Otherwise, it goes to my csv report.
Thanks for the help on this, it really got my brain thinking in another direction that I needed to. It also helped improve the code even though the issue was at the very beginning.
I am trying to query if a certain date belongs to a specific range of dates. Source code example:
billing_period_found = BillingPeriod.query(
ndb.AND(
transaction.date > BillingPeriod.start_date,
transaction.date < BillingPeriod.end_date)
).get()
Data definition:
class Transaction(ndb.Model):
date = ndb.DateProperty(required=False)
class BillingPeriod(ndb.Model):
start_date = ndb.DateProperty(required=False)
end_date = ndb.DateProperty(required=False)
Getting the following error:
TypeError: can't compare datetime.date to DateProperty
The message error does make sense because datetime is different from DateProperty. However, as you can see, the definition for transaction.date is not datetime, so I am not getting where this attempt to convert datetime to date is coming from. Anyways - If I figure out how to convert datetime to DateProperty, I guess it would fix the problem.
Any ideas on how to solve this?
Thanks!
The App Engine datastore does not allow queries with inequalities on multiple properties (not a limitation of ndb, but of the underlying datastore). Selecting date-range entities that contain a certain date is a typical example of tasks that this makes it impossible to achieve in a single query.
Check out Optimizing a inequality query in ndb over two properties for an example of this question, and, in the answer, one suggestion that might work: query for (in your case) all BillingPeriod entities with end_date greater than the desired date, perhaps with a projection to just get their key and start_date; then, select out of those only those with start_date less than the desired date, in your own application (if you only want one of them, then a next over the iterator will stop as soon as it finds one).
Edit: the issue above is problem #1 with this code; once solved, problem #2 arises -- as clearly listed at https://cloud.google.com/appengine/docs/python/ndb/queries, the property is ndb queries is always on the left of the comparison operator. So, one can't do date < BillingPeriod.end_date, as that would have the property on the right; rather, one does BillingPeriod.end_date > date.
I am learning Cypher on Neo4j and I am having trouble in understanding how to perform an efficient 'join' equivalent in Cypher.
I am using the standard Matrix character example and I have added some nodes to the mix called 'Gun' with a relation of ':GIVEN_TO'. You can see the console with my query result here:
http://console.neo4j.org/r/rog2hv
The query I am using is:
MATCH (Neo:Crew { name: 'Neo' })-[:KNOWS*..]->(other:Crew),(other)<-[:GIVEN_TO]-(g:Gun),(Neo)<-[:GIVEN_TO]-(g2:Gun)
RETURN count(g2);
I have given Neo 4 guns, but when I perform the above I get a count of '12'. This seems to be the case because there are 3 'others' and 3*4 = 12. So I get some exponential result.
What should my query look like to get the correct count ('4') from the example?
Edit:
The reason I am not querying through Guns directly as suggested by #ceej is because in my real use case I have to do this traversal as described above. Adding DISTINCT does not do anything for my result.
The reason you get 12 guns instead of 4 is because your query produces a cartesian product. This is because you have asked for items in the same match statement without joining them. #ceej rightly pointed out if you want to find Neo's guns you would do as he suggested in his first query.
If you wanted to get a list of the crew members and their guns then you could do something like this...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
RETURN crew.name, collect(g.name)
Which finds all of the crew members with guns and returns their name and the guns that they were given.
If you wanted to invert it and get a list of the guns and the respective crew members they were give to you could do the following...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
RETURN g.name, collect(crew.name)
If you wanted to find all of the crew that knew Neo multiple levels deep that were given a gun you could write the query like this...
MATCH (crew:Crew)<-[:GIVEN_TO]-(g:Gun)
WITH crew, g
MATCH (neo:Crew {name: 'Neo'})-[:KNOWS*0..]->(crew)
RETURN crew.name, collect(g.name)
That finds all the crew that were given guns and then determines which of them have a :KNOWS path to Neo.
Forgive me, but I am am unclear why you have the initial MATCH in your query. From your explanation it would appear that you are trying to get the number of :Gun nodes linked to Neo by the :GIVEN_TO relationship. In which case all you need is the latter part of your query. Which would give you something like
MATCH (neo:Crew { name: 'Neo' })<-[:GIVEN_TO]-(g:Gun)
RETURN count(g)
Furthermore, to make sure that you are only counting distinct :Gun nodes you can add DISTINCT to the RETURN statement.
MATCH (neo:Crew { name: 'Neo' })<-[:GIVEN_TO]-(g:Gun)
RETURN count( DISTINCT g )
This is possibly unnecessary in your case but can be helpful when the pattern that you are matching on can arrive at the same node by different traversals.
Have I misunderstood your requirement?
and thanks in advance for any and all help!!
I'm running a query on the datastore that looks like this:
forks = Thing.query(ancestor=user.subscriber_key).filter(
Thing.status==True,
Thing.fork_of==thing_key,
Thing.start_date <= user.day_threshold(),
Thing.level.IN([1,2,3,4,5])).order(
Thing.level)
This query works and returns the results I expect. However, I would like to sort it on one additional field (Thing.last_touched). If I add this to the sort, it won't work because Thing.last_touched is not the property to which the inequality filter is applied. I can't add an additional inequality filter, since we're only allowed one, plus it's not needed (actually, that's why Thing.leve.IN is there.. not needed as a filter, but required for the sort).
So, what I'm wondering is, could I run the query with the filters that I want, and then run code to sort the query results myself? I know I could pull all the parameters I want to sort and store them in dictionaries and sort them that way, but it seems to me there ought to be a way to handle this with the query.
I've searched for days for this but have had no luck.
Just in case you need it, here's the class definition of Thing:
class Thing(ndb.Model):
title = ndb.StringProperty()
level = ndb.IntegerProperty()
fork = ndb.BooleanProperty()
recursion_level = ndb.IntegerProperty()
fork_of = ndb.KeyProperty()
creation_date = ndb.DateTimeProperty(auto_now_add=True)
last_touched = ndb.DateTimeProperty(auto_now=True)
status = ndb.BooleanProperty()
description = ndb.StringProperty()
owner_id = ndb.StringProperty()
frequency = ndb.IntegerProperty()
start_date = ndb.DateTimeProperty(auto_now_add=True)
due_date = ndb.DateTimeProperty()
One of the main reasons that Google AppEngine is so fast even when dealing with insane amounts of data is because of the very limited query options. All standard queries are "scans" over an index, i.e. there is some table (index) that keeps references to your actual data entires in order sorted by ONE of the data's properties. So, let's say you add the following entries:
Thing A: start-date = Wednesday (I'm just going to use weekdays for simplicity)
Thing B: start-date = Friday
Thing C: start-date = Monday
Thing D: start-date = Thursday
Then, AppEngine will create an index that looks like this:
1 - Monday -> Thing C
2 - Wednesday -> Thing A
3 - Thursday -> Thing D
4 - Friday -> Thing B
Now, any query will correspond to a continuous block in this (or another) index. If you, for example, say "All Things with start-date >= Tuesday", it will return entries in row 2 through 4 (i.e. Thing A, Thing D, and Thing B in that exact order!). If you query for "< Thursday", you get 1-2. If you say "> Tuesday and <= Thursday" you get 2-3.
And if you are doing inequality filters on a different property, AppEngine will use a different index.
This is why you can only do one inequality filter and why the sort-order is always also specified by the property that you do an inequality filter of. Because AppEngine is not designed to be able to return items 1, 2, 4 (with a gap*) out of an index, or items 4, 2, 3 (no gap, but out of order).
So, if you need to sort your entries on a different property other than the one you use for inequality filtering, you basically have 3 choices:
Perform your query with the inequality filter, read all results into memory, and sort them in your code afterwards (I think this is what you mean by storing them in a dictionary)
Perform your query WITHOUT the inequality filter, but sorted on the right property. Then, as you loop over the returned entries, simply check the inequality yourself and drop the ones that don't match
Perform your query with the inequality filter and just return the items in the wrong order, and let the client-application worry about sorting them! ;)
Generally I would assume that you have much more unused resources available client-side to do the sorting, so I would probably go for option 3 in most cases. But if you need to sort the entries server-side (e.g. for a mobile-app targeted at older smart-phones), it will depend on the size of your database and the fraction of entries that usually match your inequality filter, whether option 1 or option 2 are better. If your inequality filter only removes a small fraction of the entries, option 2 might be much faster (as it doesn't require any O(>n) sorting), but if you have a huge database of entries and only a very small number of them will match the inequality, definitely go for option 1.
BTW: The talk "App Engine Datastore Under the Covers" from Google I/O 2008 might be a very helpful resource. It's a bit technical, but it gives a great overview of this topic and I consider it must-know information if you want to do anything in AppEngine. Note, though, that this talk is a bit out-dated. There are a bunch more things that you can do with queries now-a-days. But ALL of these extra things (if I understand correctly) are API functions that in the end just generate a set of several simple queries (exactly like the ones described in this talk) and then just combine the results of these in memory in your application (just like you would if you did your own sorting).
*There are some exceptions where AppEngine can generate the intersection of two (or more?) index-scans to drop items from the results, but I don't think that you could use that to change the order of the returned entries.
I am trying to implement a tailable cursor for mongo using C driver. Uptil now i have been able to create it and successfully get the pushed data into my process with the following code
cursor =mongo_find( connection, DB_TENANT_NAMESPACE, query, bson_empty( &e ), 0, 0, MONGO_TAILABLE | MONGO_AWAIT_DATA);
while(1)
{
while(mongo_cursor_next(cursor) == MONGO_OK)
{
b=mongo_cursor_bson(cursor);
if(bson_find(iterator,b,"_id"))
{
oid =bson_iterator_oid(iterator);
bson_oid_to_string(oid,&id);
printf("ID:%s\n",id);
}
}
With this code i can get the updates. But looking at the tailable cursors docs, it seems that i need to run the mongo_find inside the outer while loop to make sure i get the latest entries. The docs suggest appending to query with gte. Copying from docs
query = QUERY( "_id" << GT << lastId ).sort("$natural");
The issue is that the oid is an object which can be converted to a string. I dont really think i should be converting it to an int in-order for gte to work. Any ideas?
ObjectId's may be logically compared by those operators, as can Date and Timestamp objects. There should be no need to represent the ObjectId as a string, and is no practical reason (at least in this case) for comparing an ObjectId to a string.
Note that comparisons involving two different BSON types will follow this compare order.