Sunspot solr index search and index on range data - solr

I am storing availability timing for my users where they enter for each day of week what timings they would be available
for example - Mr X would be available on
sunday for 2-5, 8-12, 15-18
monday for 1-3, 5-8, 10-12
and so on entire week
what would be the best possible way of indexing and searching this data in solr
a database query for searching such a dataset would be like
select * from schedule inner join days on schedule.day_id = days.id
where days.name = 'Sunday' and schedule.start>=5 and schedule.end>=8

Use the DateRangeField which became available in Solr5. This allows you to query for documents that contain ranges that matches your query time.
fq={!field f=dateRange op=Contains}[2013 TO 2018]
Before Solr 5 there's a neat hack that uses the spatial support in Solr to query for overlapping durations (if this point is contained within the expected time area, etc.).
Depending on the needed resolution, you could also index seven different fields (monday - sunday) and then index an integer for each hour that the person is available. You can then query the field with a regular query, such as available_sunday:15 to find matching persons.

Related

SQL Server full text search max performance limits within time window

Let's say we have a table with 100 million records. Which are transactions of sales form multiple sellers. Each record has around 14 columns
TABLE SellerTransactions
string SellerId,
string ProductId,
DateTime CreateDate,
string BankNumber,
string Name(name+' '+surname+' 'alias),
string Comments,
decimal Amount
etc...
Each year we will add around 60 million new records.. and the record count will increase by 10% yearly +-
Search will be done by seller Id, then by product Id or products Id's(for multiple products in a time period for that seller).
Now every search will be filtered by time usually 1 week period/ mostly the last week but its also possible to have worst case scenarios when we will search within all the data we have: years.
And also a search should be possible by bank number, or full text search by name or part of it.
SELECT *
FROM SellerTransactions
WHERE SelledId = 'Seller1Guid'
AND ProductId IN 'ProductGuid1,ProductGuid2..'
AND (CreateDate <= CurrentDate AND CreateDate >= (CurrendDate - 7 days))
AND CONTAINS(Name, "BoB Skynet")
ORDER BY CreateDateTime
TAKE 20 SKIP 20
So search time consumption in regards to these scenarios:
After filtering by id's & time range search in up to 10k records
After filtering by id's & time range search in up to 100k records
After filtering by id's & time range search in up to 1 million records
After filtering by id's & time range search in up to 1 to 10 million records
in these cases would it be 0.5s? or 1s or up to 20s?
Also how would the search performance would change if we would also add search in comments column as well as name?
CONTAINS(Name, "BoB Skynet") OR CONTAINS(Comments, "online")
In most cases the searching should be done on small records counts, in very rare cases we would go though millions of rows.. but how much time would it take?
When it would be an good idea to move to Elastic search for example?
Well we are storing a bit of data, but request count for this data is usually small.

Google Data Studio date aggregation - average number of daily users over time

This should be simple so I think I am missing it. I have a simple line chart that shows Users per day over 28 days (X axis is date, Y axis is number of users). I am using hard-coded 28 days here just to get it to work.
I want to add a scorecard for average daily users over the 28 day time frame. I tried to use a calculated field AVG(Users) but this shows an error for re-aggregating an aggregated value. Then I tried Users/28, but the result oddly is the value of Users for today. The division seems to be completely ignored.
What is the best way to show average number of daily users over a time frame? Average daily users over 10 days, 20 day, etc.
Try to create a new metric that counts the dates eg
Count of Date = COUNT(Date) or
Count of Date = COUNT_DISTINCT(Date) in case you have duplicated dates
Then create another metric for average users
Users AVG = (Users / Count of Date)
The average depends on the timeframe you have selected. If you are selecting the last 28 days the average is for those 28 days (dates), if you filter 20 days the average is for those 20 days etc.
Hope that helps.
I have been able to do this in an extremely crude and ugly manner using Google Sheets as a means to do the calculation and serve as a data source for Data studio.
This may be useful for other people trying to do the same thing. This assumes you know how to work with GA data in Sheets and are starting with a Report Configuration. There must be a better way.
Example for Average Number of Daily Users over the last 7 days:
Edit the Report Configuration fields:
Report Name: create one report per day, in this case 7 reports. Name them (for example) Users-1 through Users-7. These are your Row 2 values. You'll have 7 columns, with the first report name in column B.
Start Date and End Date: use TODAY()-X where X is the number of days previous to define the start and end dates for each report. Each report will contain the user count for one day. Report Users-1 will use TODAY()-1 for start and end, etc.
Metrics: enter the metrics e.g. ga:users and ga:new users
Create the reports
Use 'Run reports' to have the result sheets created and populated.
Create a sheet for an interim data set you will use as the basis for the average calculation. The first column is date, the remaining columns are for the metrics, in this case Users and New Users.
Populate the interim data set with the dates and values. You will reference the Report Configuration to get the dates, and you will pull the metrics from each of the individual reports. At this stage you have a sheet with date in first columns and values in subsequent columns with a row for each day's values. Be sure to use a header.
Finally, create a sheet that averages the values in the interim data set. This sheet will have a column for each metric, with one value per column. The one value is calculated from the series in the interim data set, for example =AVG(interim_sheet_reference:range) or any other calculation you'd like to do.
At last, you can use Data Studio to connect to this data source and use the values. For counts of users such as this example, you would use Sum as the aggregation field type when you are creating the data source.
It's super ugly but it works.

strange appengine query result

What am I doing wrong in this query?
SELECT * FROM TreatmentPlanDetails
WHERE
accountId = 'ag5zfmRvbW9kZW50d2ViMnIRCxIIQWNjb3VudHMYtcjdAQw' AND
status = 'done' AND
category = 'chirurgia orale' AND
setDoneCalendarEventStartTimestamp >= [timestamp for 6 june 2012] AND
setDoneCalendarEventStartTimestamp <= [timestamp for 11 june 2012] AND
deleteStatus = 'notDeleted'
ORDER BY setDoneCalendarEventStartTimestamp ASC
I am not getting any record and I am sure there are records meeting the where clause conditions. To get the correct records I have to widen the timestamp interval by 1 millisecond. Is it normal? Furthermore, if I modify this query by removing the category filter, I am getting the correct results. This is definitely weird.
I also asked on google groups, but I got no answer. Anyway, for details:
https://groups.google.com/forum/?fromgroups#!searchin/google-appengine/query/google-appengine/ixPIvmhCS3g/d4OP91yTkrEJ
Let's talk specifically about creating timestamps to go into the query. What code are you using to create the timestamp record? Apparently that's important, because fuzzing with it a little bit affects the query. It may be relevant that in the datastore, timestamps are recorded as integers representing posix timestamps with microseconds, i.e. the number of microseconds since 1/1/1970 UTC (not counting leap seconds). It's also relevant that dates (i.e. without a time) are represented as midnight, i.e. the earliest time on that day. But please show us the exact code. (It may also be important to show the actual content of the record that you're attempting to retrieve.)
An aside that is not specific to your question: Entity property names count as part of your storage quota. If this is going to be a huge dataset, you might pay more $$ than you'd like for property names like setDoneCalendarEventStartTimestamp.
Because you write :
if I modify this query by removing the category filter, I am getting
the correct results
this probably means that the category was not indexed at the time you write the matching records to the data store. You have to re-write your records to the data store if you want them added to the newly created index.

Sum field by date range

I'm using solr 3.6 and I'm kinda stuck trying to perform a special query.
I'm actually using facets by date range, the face.date.gap is set to +1DAY. Of course, the facet is supposed to return the count of docs at a date range but I also need to get the sum of a special field at the same ranges used in facet. It's like I need to count how many votes I have daily monthly, weekly, whatever... it depends on the gap params.
Any ideas? Should I use the group.query or facet.query?
One suggestion I have is to treat the weeks, days separately, and index them. For ex. Today is part of 24th week. Another suggestion is not to rule out multiple searches to service one request. One to calculate all oth facets and one to return counts for given date range (based on search results from first query).

SOLR travel site: on date queries

I was looking to implement SOLR for a Hotel bookings site. Search based on location, hotel names, facilities works very well and so does the faceting. What I have not been able to figure out is how to search for a hotel given Checkin and Checkout dates.
Eg: User will search for search query - "Hotels in Newyork" and select CheckIn Date: 10th Feb 2012 and CheckOut Date: 12 Feb 2012 from the date selection box.
This is how I have the data -
Hotel_Name 10thFeb2012 11thFEB2012 ........ 31DEC2012
Hotel1 2room 3room 10rooms
Hotel2 1room 4room ........ 12rooms
Now if the query is for Hotel2 for 3rooms from checkin Date 10thFeb2012 to 11thFeb2012 it shdnt match because there is only one room available for 10thFeb.
IF the query for Hotel2 is for 1 room from checkin Date 10thFeb2012 to 11thFeb2012 then it should be part of search result.
Use the ISO 8601 format for your date-times.
Complete date plus hours, minutes, seconds and a decimal fraction of a second
YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45Z)
Both your database and Solr will understand date-times from strings that conform to that format.
So,
store the data in DB and Solr with compatible date-time formats. (On the back of my head, Solr must have a Z appended to the date-time, else its invalid).
your search interface must format all dates in that format to query solr.
Solr can do conditional expressions, facets, range bucket faceting etc with dates.
I would go with the following schema:
hotel_name : string (for faceting)
hotel_name_searchable : text (for searching, this is a copy field:look it up)
room_id : string
start_date : date (when the room is availabe)
end_date : date (if not booked, set it to an infinite date, say 2040)
For each room you are ever tracking, store the date-times between which it is free.
You can search for rooms between the start_date and end_date.
Do faceting on hotel_name so your search for rooms "checkin Date 10thFeb2012 to 11thFeb2012" gets you:
Hotel1:[r1,r2,r3]
Hotel2:[r8]
Hotel3:[r2,r3,r4]
Faceting on hotel_name filters to one hotel, facet.mincount on room_id can return hotels having the required number of rooms.
A little warning: I may be a bit rusty on faceting, as I used to do a lot of processing on Solr results itself.

Resources