SOLR: Range query with sum - solr

I have an example schema like:
id:1,date:2012-05-01,parent:p1
id:1,date:2012-05-01,parent:p2
id:1,date:2012-05-01,parent:p3
id:1,date:2012-05-02,parent:p1
id:1,date:2012-05-02,parent:p4
I would like to pefrorm a range query on "date" and know how many new/unique parents occured each day. In other words i would like to see how many NEW parents were added through time. For the given data the output should look like:
2012-04-31:0 (no parents existed an that time)
2012-05-01:3 (because three new parents occured at 2012-05-01: p1,p2,p3)
2012-05-02:4 (which is 3 parents from 2012-05-01 and 1 new unique parent p4 occured at 2012-05-02 which gives a total of 4)
2012-05-03:4 (no new parent was added this day...)
Is this kind of query even possible in SOLR?

Yes this should be fairly simple if I understand your question correctly. Adding something like
fq=date:[2012-05-05T00:00:00Z TO 2012-05-06T00:00:00Z]
to your query will fetch you all documents with a date between 5 May and 6 May. Make sure to store your dates in ISO 8601 format.
For more, check out the date examples here: http://wiki.apache.org/solr/SolrQuerySyntax
EDIT: I understood your question better now - you're looking for "group collapsing."
Try
&group=true&group.field=parent&group.limit=1
and count the number of documents returned.
If you want them with values for each date, you'll want to facet by date:
&facet=true&facet.field=date

Related

ValueFilter for DateTime Attributes

I'm working with the Blog app and I see how to filter the Blog posts by year using the Visual Query Designer. I use the querystring value that has the year and in the ValueFilter and my properties are as follows:
Attribute: PublicationMoment
Value: [QueryString:year]-01-01 and [QueryString:year]-12-31
Operation: between
How would I get the posts from a specific month and year, if those values are passed via query string parameters. Because the months of the year have a varying number of days, I'm not sure how you would accomplish this in the Value field of the ValueFilter. Currently I'm passing the 2 digit month as the parameter.
I tried something like: [QueryString:year]-[Querystring:month]
Operation: contains
but the above operation doesn't really work because the datatype is a DateTime object.
I could do it in the razor view but I'm afraid that the paging datasource would have too many pages in it since it would be based on the larger subset of posts for the given year that was passed in the querystring parameter.
Is there any way to do this with the filter?
Basically dates are not perfectly handled yet, but there are a few ways to do it using the visual query:
Use the correct date in the query like between [QueryString:Start] and [QueryString:End] and calculate the correct dates there where you generate the links
Since your main problem with the "between" filter is actually that it would include the last day too, you could also use a two filters a >= first date and another < second date, so the first-date would be the year/month and day 1; the second one is year-month and day 1 as well
Last but not least: if you do it with razor and LINQ you shouldn't run into any performance issues - it's technically the same thing the pipeline does and it's been tested to perform well with tens of thousands of records.

What is the optimized way for queries on partial dates in GAE Text Search?

Need to get entities filtering by month instead of complete date values (E.g. Birthdays) using Google App Engine Text Search. On verifying GAE docs, I think it is not possible to query date fields by month directly.
So in order to filter them by month/date, we consider saving each date sub value like Date(DD), Month(MM) and Year(YYYY) as separate NUMBER field along with complete date field.
I verified locally that we can achieve by saving like this. But is this the correct way of saving dates by splitting each field when we want to query on date sub values?
Is there any known/unknown limit on number of fields per document apart from 10GB size limit in GAE Text Search?
Please suggest me.
Thanks,
Naresh
The only time NUMBER or DATE fields make sense is if you need to query on ranges of values. In other cases they are wasteful.
I can't tell from your question exactly what queries you want to run. Are you looking for a (single) specific day of the month (e.g., January 6 -- of any year)? Or just "anything in June (again, without regard to year)"? Or is it a date range: something like January 20 through February 19? Or July 1 through September 30?
If it's a range then NUMBER values may make sense. But if it's just a single specific month, or a single month and day-of-month combination, then you're better off storing month and day as separate ATOM fields.
Anything that looks like a number, but isn't really going to be searched via a numerical range, or done arithmetic on, isn't really a number, and is probably best stored as an ATOM. For example, phone numbers, zip codes (unless you're terribly clever and wanting to do something like "all zip codes in San Francisco look like 941xx" -- but even then if that's what you want to do, you're probably better off just storing the "941" prefix as an ATOM).

Is it possible to create an SQL query that displays results like this?

Background
I have a database that hold records of all assets in an office. Each asset have a condition, a category name and an age.
A ConditionID can be;
In use
Spare
In Circulation
CategoryID are;
Phone
PC
Laptop
and Age is just a field called AquiredDate which holds records like;
2009-04-24 15:07:51.257
Example
I've created an example of the inputs of the query to explain better what I need if possible.
NB.
Inputs are in Orange in the above example.
I've split the example into two separate queries.
Count would be the output
Question
Is this type of query and result set possible using SQL alone? And if so where do I start? Would it be easier to use Ms Excel also?
Yes it is possible, for your orange fields you can just e.g.
where CategoryID ='Phone' and ConditionID in ('In use', 'In Circulation')
For the yellow one you could do a datediff of days of accuired date to now and divide it by 365 and floor that value, to get the last one (6+ years category) you need to take the minimum of 5 and the calculated value so you get 0 for all between 0-1 year old etc. until 5 which has everything above 6 years.
When you group by that calculated column and select the additional the count you get what you desire.

Pushing term 1 results into term 3 fields for new students joining mid-year

I have a table that contains test marks from different terms, ca1_percent, sa1_percent, ca2_percent and sa2_percent. These 4 fields reside in the Results table that contains results from the different terms.
I used a self-relationship linking using the matched field overall_percent_match which is calculated using year & " " & subject & " " & _kf_studentID. This relationship allows me to obtain the test results from past terms (of a year). For example, my term 3 results will contain results from term 1 and term 2 (of each subject). All works fine unless there is a new student who joins mid way of the year. If he joins in term 3, his ca2 results (done in term 3) will fall into his ca1_percent column (which is supposed to contain term 1 results) like other records before him.
Image shows what I mean.
I could not figure out the solution. Can anyone help me?
This StackOverflow link contains more details of my work that was done related to this problem.
The underlying problem, per your prior query, is that you're pulling the values through:
GetNthRecord(SA1_Results_Match::mark_percent,2)
This statement assumes the existence of an N=1, N=2 and N=3. To make this work properly you could do any of the following:
Ensure that your Results table always has records from the prior semester, even if the student joins later in the semester. You could keep using GetNthRecord this way, but you will always need to ensure that the records are in order.
Use an ExecuteSQL statement to gather only the correct semester's results for the correct summary field.
Make four separate relationships, with separate Table Occurrences, to define ca1, sa1, ca2 and sa2 each separately. This looks like what you started out trying to do in the prior question.

strange appengine query result

What am I doing wrong in this query?
SELECT * FROM TreatmentPlanDetails
WHERE
accountId = 'ag5zfmRvbW9kZW50d2ViMnIRCxIIQWNjb3VudHMYtcjdAQw' AND
status = 'done' AND
category = 'chirurgia orale' AND
setDoneCalendarEventStartTimestamp >= [timestamp for 6 june 2012] AND
setDoneCalendarEventStartTimestamp <= [timestamp for 11 june 2012] AND
deleteStatus = 'notDeleted'
ORDER BY setDoneCalendarEventStartTimestamp ASC
I am not getting any record and I am sure there are records meeting the where clause conditions. To get the correct records I have to widen the timestamp interval by 1 millisecond. Is it normal? Furthermore, if I modify this query by removing the category filter, I am getting the correct results. This is definitely weird.
I also asked on google groups, but I got no answer. Anyway, for details:
https://groups.google.com/forum/?fromgroups#!searchin/google-appengine/query/google-appengine/ixPIvmhCS3g/d4OP91yTkrEJ
Let's talk specifically about creating timestamps to go into the query. What code are you using to create the timestamp record? Apparently that's important, because fuzzing with it a little bit affects the query. It may be relevant that in the datastore, timestamps are recorded as integers representing posix timestamps with microseconds, i.e. the number of microseconds since 1/1/1970 UTC (not counting leap seconds). It's also relevant that dates (i.e. without a time) are represented as midnight, i.e. the earliest time on that day. But please show us the exact code. (It may also be important to show the actual content of the record that you're attempting to retrieve.)
An aside that is not specific to your question: Entity property names count as part of your storage quota. If this is going to be a huge dataset, you might pay more $$ than you'd like for property names like setDoneCalendarEventStartTimestamp.
Because you write :
if I modify this query by removing the category filter, I am getting
the correct results
this probably means that the category was not indexed at the time you write the matching records to the data store. You have to re-write your records to the data store if you want them added to the newly created index.

Resources